Run VoiceAI locally

ASR, TTS, voice calling, diarization, metadata, AI coaching — one Docker command.
Models download automatically. No cloud dependency.

API Docs Solutions

Quick start

$ docker run -d --name whissle \
  -p 9000:9000 -p 8001:8001 -p 8003:8003 \
  -v whissle-models:/models -v whissle-data:/data \
  -e VARIANT=en-full \
  -e ANTHROPIC_API_KEY=your-key \
  whissleasr/whissle-gateway:latest

VARIANT=

DEVICE=

en-full · Downloads ~2 GB on first run (cached after) · free, no sign-up

What happens when you run it:

═══════════════════════════════════════════════
  Whissle Gateway — en-full
═══════════════════════════════════════════════
No GPU detected → using CPU

Shared models:
  ✓ speaker encoder + VAD           26 MB
  ✓ punctuation                    254 MB
  ✓ ITN (English + Hinglish)       1.5 MB

Variant: en-full
  ✓ en-in-tech-misc (485 MB)
  ✓ KenLM ENGLISH (1.5 GB)

Auth:
  Mode:    local
  Token:   wh_a1b2c3d4e5f6... (admin)
  Manage:  curl -H 'Authorization: Bearer ...' localhost:9000/auth/tokens

Starting services...
  PostgreSQL: :5432  ●
  ASR:        :8001  ●
  Video:      :8002  ●
  TTS:        :8003  ●
  Agent:      :8765  ●
  Pipecat:    :8000  ●
  Gateway:    :9000  ●

API

Six interfaces — batch REST, streaming WebSocket, text-to-speech, video intelligence, voice calling, and an intelligent agent.

POST localhost:8001/transcribe

$ curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3" \
    -F "diarize=true" \
    -F "num_speakers=2" \
    -F "punctuation=true" \
    -F "metadata_prob=true" \
    -F "summarize=sales_coaching" \
    -o result.json

Response — transcript + metadata per segment + AI analysis

{
  "segments": [
    {
      "speaker":  "SPEAKER_00",
      "text":     "Hello, good morning.",
      "start":    1.0,  "end": 1.9,
      "metadata": {
        "emotion":  "EMOTION_NEUTRAL",
        "behavior": "BEHAVIOR_DIRECT",
        "role":     "ROLE_INTERVIEWER",
        "age":      "AGE_30_45",
        "gender":   "GENDER_MALE"
      },
      "words": [{"word": "Hello", "start": 1.0, "end": 1.3}]
    }
  ],
  "analysis": {
    "overall_score": 78,
    "buyer_outcome": "Converted",
    "practices":     { "followed": 6, "total": 8 },
    "highlights":    [...]
  }
}

Parameters

All parameters for POST /transcribe.

Parameter	Type	Default	Description
file	file	required	Audio file (MP3, WAV, FLAC, OGG, M4A)
language	string	auto	Language hint: en, hi, zh
diarize	bool	false	Speaker diarization
num_speakers	int	auto	Exact speaker count (if known)
punctuation	bool	true	Restore punctuation and capitalization
itn	bool	true	Inverse text normalization (numbers, currency)
use_lm	bool	true	KenLM language model beam search
metadata_prob	bool	false	Probability distributions for metadata
word_timestamps	bool	false	Per-word start/end timestamps
speech_analysis	bool	false	Speech patterns (pace, fillers, fluency)
summarize	string	—	AI analysis: true, sales_coaching, collections, or custom prompt
hotwords	string	—	Comma-separated hotwords for boosting

AI analysis modes

Add -F "summarize=mode" to any transcription. The diarized transcript + metadata is sent to your configured LLM for analysis.

sales_coaching

Sales Coaching

8 best practices scored. Rep/buyer identification. Highlights with timestamps. Behavior labels per segment. Overall score 0–100.

collections

Collections Compliance

Identity verification, reason stated, amount mentioned, no harassment. Call outcome (Promise to Pay / Dispute / Hardship). Next action.

true

General Summary

Overview, participants, key topics, emotional dynamics, entities, outcome. Markdown format.

your prompt here

Custom Prompt

Pass any prompt string. The LLM receives your instructions + full transcript with per-segment metadata.

Models

Each model extracts different metadata in a single ASR forward pass — no separate models or API calls.

en-tiny

57 MB

CTCBPE

13M params, NVIDIA Conformer CTC Small. Fastest inference — 50x realtime on CPU. Greedy decode only, no metadata classifiers.

English · 1025 vocab, greedy

en-in-tech-misc

485 MB

BEHAVIOREMOTIONEVALROLEAGEGENDERENTITY

120M params, 26 Behavioral codes for coaching, therapy, interviews. 8 evaluation labels.

English · 6 heads, 51 classes

hinglish-loans

479 MB

INTENTEMOTIONROLEAGEGENDERENTITY

115M params, Debt collection intents — pay-back, disputes, hardship. Agent/Customer role detection.

Hindi-English · 5 heads, 26 classes

zh

627 MB

DIALECTAGEGENDERENTITY

160M params, Mandarin with North/South dialect detection.

Mandarin · 3 heads, 12 classes

whissle-large

2.4 GB

INTENTEMOTIONAGEGENDERENTITY

600M params, inline action tokens. 31 intent groups, 18K vocabulary.

23 languages · 5,500+ action tokens

Kokoro TTS

82 MB

55 voices

Non-autoregressive text-to-speech. Sub-200ms TTFB on CPU. Always included.

10 languages · Baked in

Punctuation + ITN

255 MB

CapitalizationNumbers

Punctuation restoration and inverse text normalization.

EN + Hinglish · Auto-downloaded

Metadata per segment

Every segment includes these tags. Common tags appear on all models. Additional tags depend on the model.

Tag	Values	Models
emotion	EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE	All
age	AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+	All
gender	GENDER_MALE, GENDER_FEMALE	All
behavior	26 types (BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION, BEHAVIOR_ACKNOWLEDGE, ...)	en-in-tech-misc
eval	EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP	en-in-tech-misc
role	ROLE_INTERVIEWER / ROLE_INTERVIEWEE or ROLE_AGENT / ROLE_CUSTOMER	en-in-tech-misc, hinglish-loans
intent	13 collections intents or 31 general intents (INTENT_GREETING, INTENT_QUESTION, ...)	hinglish-loans, whissle-large
dialect	DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS	zh

Variants

Choose your variant based on language and quality needs. Switch by changing VARIANT= and restarting. Cached models are reused.

Variant	Languages	Download	Best for
`en-tiny`	English	~310 MB	Fastest, IoT/edge, lightweight deployments
`hinglish`	Hindi-English	~515 MB	Debt collections, Hindi-English call centers
`en-lite`	English	~500 MB	Quick testing, development
`en-full`★	English	~2 GB	Sales coaching, interviews, therapy
`multi-full`	23 languages	~4 GB	Multilingual, highest quality
`multi-zh`	23 langs + Mandarin	~5 GB	Multilingual + dialect detection
`all`	All	~6 GB	Maximum flexibility

Runs everywhere

From your laptop (CPU) to data center GPUs. Same Docker, same API. Auto-detects GPU.

Hardware	VRAM	Variant	Concurrent
Raspberry Pi / Edge	CPU	`en-tiny`	1
MacBook / Laptop	CPU	`en-tiny / en-lite`	1–5
Mac Mini M4 Pro	24 GB unified	`en-full`	3–8
NVIDIA T4	16 GB	`en-lite`	5–10
RTX 4090	24 GB	`en-full`	20–50
A100 40GB	40 GB	`multi-full`	50–80
RTX 6000 Ada	48 GB	`all`	50–100
H100	80 GB	`all`	150–300
DGX Spark	128 GB unified	`all`	30–60
H200	141 GB	`all`	250–500

Docker Tag	Arch	Runtime
whissleasr/whissle-gateway:latest	amd64	CPU — Mac (Rosetta), Linux, Windows
whissleasr/whissle-gateway:gpu	amd64	NVIDIA CUDA 12.4 + onnxruntime-gpu

Architecture

┌──────────────────────────────────────────────────────────────┐
│                     Docker Container                        │
│                                                             │
│  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌──────────┐  │
│  │ ASR    │ │ Video  │ │ TTS    │ │Pipecat │ │ Agent    │  │
│  │ :8001  │ │ :8002  │ │ :8003  │ │ :8000  │ │ :8765    │  │
│  │        │ │        │ │ Kokoro │ │        │ │          │  │
│  │ ONNX   │ │MediaPip│ │ 82M    │ │ WebRTC │ │ Any LLM  │  │
│  │ +KenLM │ │+Vision │ │55 voice│ │ Twilio │ │ Cloud or │  │
│  │ +ECAPA │ │  LLM   │ │        │ │Voice AI│ │  Local   │  │
│  │ +VAD   │ │        │ │        │ │        │ │          │  │
│  │ +Punct │ │Face    │ │        │ │ Auth   │ │Summarize │  │
│  │ +ITN   │ │Gesture │ │        │ │Multiorg│ │ Coach    │  │
│  └────────┘ └────────┘ └────────┘ └────────┘ └──────────┘  │
│                     │                                       │
│              ┌──────────────┐                               │
│              │ PostgreSQL   │                               │
│              │ :5432        │                               │
│              └──────────────┘                               │
│                                                             │
│  /models  (Docker volume — cached ASR models)               │
│  /data    (Docker volume — PostgreSQL, auth, conversations) │
└──────────────────────────────────────────────────────────────┘

whissle-models volume

ASR models, KenLM, punctuation, ITN. Downloaded on first run, cached forever. Survives container restarts.

whissle-data volume

Conversations, analytics, agent configs, auth tokens. Persists across restarts. Only deleted by docker volume rm.

Get started

One command. Models download automatically. Ready in 2 minutes.
Built for contact centers, sales intelligence, behavioral AI, and more.

API Docs View Solutions Contact Us