API Documentation
Complete reference for Whissle Gateway — ASR, TTS, diarization, metadata, AI analysis.
Overview
Whissle Gateway exposes three services on a single Docker container:
| Service | Port | Protocol | Function |
|---|---|---|---|
| ASR | 8001 | REST + WebSocket | Speech recognition, diarization, metadata |
| TTS | 8003 | WebSocket | Text-to-speech (Kokoro 82M, 55 voices) |
| Agent | 8765 | REST + SSE | LLM processing, summarization, coaching |
| Gateway | 9000 | REST + WebSocket | Unified proxy (requires API token) |
For development and testing, hit the service ports directly (8001, 8003, 8765). The gateway proxy at 9000 adds authentication, rate limiting, and usage tracking for production use.
Authentication
The gateway includes a built-in local auth system — no external backend required. An admin API token is auto-generated on first start and printed in the Docker logs.
Getting your admin token
# Check Docker startup logs for the admin token: docker logs whissle-gateway 2>&1 | grep "Token:" # Or read it from the persistent data volume: docker exec whissle-gateway cat /data/auth/admin_token.txt
Creating user tokens
# Create a token for an application or user
curl -X POST http://localhost:9000/auth/tokens \
-H "Authorization: Bearer wh_YOUR_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"user_id": "my-app", "label": "My Application"}'
# Response:
# {"success": true, "token": "wh_a1b2c3...", "user_id": "my-app", ...}Using tokens
# REST — Authorization header curl -X POST http://localhost:9000/asr/transcribe \ -H "Authorization: Bearer wh_YOUR_TOKEN" \ -F "file=@call.mp3" -F "diarize=true" # WebSocket — query parameter wscat -c "ws://localhost:9000/listen?token=wh_YOUR_TOKEN" wscat -c "ws://localhost:9000/asr/stream?token=wh_YOUR_TOKEN"
Token management
| Method | Endpoint | Description |
|---|---|---|
| POST | /auth/tokens | Create a new API token (requires admin) |
| GET | /auth/tokens | List all active tokens (requires admin) |
| GET | /auth/tokens?user_id=X | List tokens for a specific user |
| DELETE | /auth/tokens/{id} | Revoke a token by ID (requires admin) |
| GET | /auth/usage?user_id=X&days=7 | Usage logs for a user |
| GET | /auth/usage/summary | Aggregated usage stats |
All /auth/* endpoints require the admin token. Tokens and usage data persist in /data/auth/tokens.db and survive container restarts when using a Docker volume.
ASR — Batch Transcription
Upload an audio file and receive a complete transcription with metadata.
Basic usage
curl -X POST http://localhost:8001/transcribe \
-F "file=@call.mp3"
# With diarization + metadata + AI analysis
curl -X POST http://localhost:8001/transcribe \
-F "file=@call.mp3" \
-F "diarize=true" \
-F "num_speakers=2" \
-F "punctuation=true" \
-F "itn=true" \
-F "metadata_prob=true" \
-F "word_timestamps=true" \
-F "speech_analysis=true" \
-F "summarize=sales_coaching" \
-o result.jsonParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| file | file | required | Audio file — MP3, WAV, FLAC, OGG, M4A, WebM |
| language | string | auto | Language hint: en, hi, zh, ja, ko, es, fr, de, etc. |
| model | string | default | ASR model: en-in-tech-misc, hinglish-loans, zh, whissle-large |
| diarize | bool | false | Enable speaker diarization (ECAPA-TDNN + agglomerative clustering) |
| num_speakers | int | auto | Exact number of speakers (skips auto-detection) |
| min_speakers | int | 1 | Minimum speakers for auto-detection |
| max_speakers | int | 10 | Maximum speakers for auto-detection |
| punctuation | bool | true | Restore punctuation (commas, periods, question marks) and capitalization |
| itn | bool | true | Inverse text normalization — 'twenty three' → '23', 'पाँच सौ' → '500' |
| use_lm | bool | true | Use KenLM n-gram language model for beam search decoding |
| lm_mode | string | balanced | LM mode: greedy, balanced, strict — controls LM weight on decoding |
| beam_width | int | 100 | Beam width for KenLM beam search (higher = better, slower) |
| lm_alpha | float | 0.1 | LM weight (alpha) for beam search scoring |
| lm_beta | float | 0.5 | Word insertion bonus (beta) for beam search |
| metadata_prob | bool | false | Include probability distributions for emotion, intent, age, gender |
| word_timestamps | bool | false | Include per-word start and end timestamps with confidence scores |
| speech_analysis | bool | false | Include speech pattern analysis: lexical fluency, vocabulary range, grammar, pace |
| speaker_embedding | bool | false | Return 192-dim speaker embedding vector |
| summarize | string | — | AI analysis: true, sales_coaching, collections, or custom prompt text |
| hotwords | string | — | Comma-separated hotwords for boosting (e.g. 'LoanTap,GoBoult,EMI') |
| hotword_weight | float | 10.0 | Hotword boost weight in beam search |
| noise_reduce | bool | false | Apply noise reduction before transcription |
| gain_normalize | bool | false | Normalize audio gain before transcription |
| profanity_filter | bool | false | Mask profanity in transcript output |
| n_best | int | 1 | Number of alternative transcriptions to return |
| format | string | — | Output format: srt, vtt, segments (for subtitles) |
| structured_intents | bool | false | Two-tier intent classification: speech_act + domain |
| intent_labels | string | — | Filter intent predictions to specific labels |
Response Format
{
"transcript": "Hello, good morning. How are you?",
"duration": 5.2,
"inference_time": "0.823s",
"model": "whissle-large",
"decoder": "beam_search_kenlm",
// Present when diarize=true
"num_speakers": 2,
"segments": [
{
"speaker": "SPEAKER_00",
"text": "Hello, good morning.",
"start": 1.0,
"end": 1.9,
"confidence": 0.96,
// Always present — top-1 prediction per category
"metadata": {
"emotion": "EMOTION_NEUTRAL",
"behavior": "BEHAVIOR_DIRECT",
"role": "ROLE_INTERVIEWER",
"eval": "EVAL_NONE",
"age": "AGE_30_45",
"gender": "GENDER_MALE"
},
// Present when metadata_prob=true — full distributions
"metadata_probs": {
"emotion": {
"EMOTION_NEUTRAL": 0.82,
"EMOTION_HAPPY": 0.12,
"EMOTION_SAD": 0.03,
"EMOTION_ANGRY": 0.02,
"EMOTION_FEAR": 0.01
},
"age": { "AGE_30_45": 0.65, "AGE_18_30": 0.25, ... }
},
// Present when word_timestamps=true
"words": [
{"word": "Hello", "start": 1.0, "end": 1.3, "confidence": 0.98},
{"word": "good", "start": 1.4, "end": 1.6, "confidence": 0.95},
{"word": "morning", "start": 1.6, "end": 1.9, "confidence": 0.97}
],
"entities": [
{"type": "PERSON", "value": "Diana"},
{"type": "ORG", "value": "American Amicable"}
]
}
],
// Present when speech_analysis=true
"speech_analysis": {
"lexical_fluency": 0.92,
"vocabulary_range": 0.85,
"grammar_score": 0.88
},
// Present when summarize=... is set
"summary": "...", // text or markdown
"analysis": { ... } // structured JSON (sales_coaching, collections)
}Other Batch Endpoints
| Endpoint | Description |
|---|---|
| POST /transcribe | Full transcription with all metadata and analysis options |
| POST /transcribe/clean | Returns only clean transcript text — fastest, no metadata |
| POST /transcribe/raw | Returns raw model output including inline action tokens |
| POST /transcribe/pcm | Transcribe raw PCM s16le audio bytes (no file header needed) |
| POST /transcribe/long | Long audio with VAD segmentation — returns segments array |
| POST /transcribe/batch | Batch multiple files in one request |
| GET /models | List loaded ASR models and their metadata |
| GET /intents | List available intent labels for the loaded model |
| GET /languages | List supported languages and their KenLM availability |
| GET /status | Server status — loaded models, memory, uptime |
ASR — WebSocket Streaming
Real-time transcription over WebSocket. Send raw PCM audio chunks, receive interim and final transcripts with metadata as they arrive.
WebSocket ws://localhost:8001/stream
# Protocol:
# 1. Connect
# 2. Send JSON config
# 3. Send binary PCM chunks (16kHz, 16-bit, mono)
# 4. Receive JSON transcripts (interim + final)
# 5. Send {"type": "end"} to finalizePython example
import asyncio, json, websockets
async def stream(audio_path):
async with websockets.connect("ws://localhost:8001/stream") as ws:
# 1. Config
await ws.send(json.dumps({
"type": "config",
"sample_rate": 16000,
"language": "en",
"model": "en-in-tech-misc", # optional
"interim_results": True
}))
# 2. Stream audio chunks (100ms = 3200 bytes at 16kHz s16le)
with open(audio_path, "rb") as f:
while chunk := f.read(3200):
await ws.send(chunk)
await asyncio.sleep(0.05) # pace to real-time
# 3. Signal end
await ws.send(json.dumps({"type": "end"}))
# 4. Receive results
async for msg in ws:
data = json.loads(msg)
if data.get("is_final"):
print(f"FINAL: {data['text']}")
print(f" emotion={data['metadata']['emotion']}")
elif data["type"] == "transcript":
print(f" interim: {data['text']}")
elif data["type"] == "end":
break
asyncio.run(stream("call.wav"))Streaming Config
| Field | Type | Default | Description |
|---|---|---|---|
| type | string | required | Must be 'config' |
| sample_rate | int | 16000 | Audio sample rate in Hz |
| language | string | auto | Language hint |
| model | string | default | ASR model to use |
| interim_results | bool | true | Send interim (partial) transcripts |
| channel | string | microphone | Audio channel name (for multi-channel tagging) |
Streaming Response
// Interim transcript (partial, updates as more audio arrives)
{
"type": "transcript",
"channel": "microphone",
"text": "Hello how are",
"is_final": false,
"utterance_end": false
}
// Final transcript (complete utterance with metadata)
{
"type": "transcript",
"text": "Hello, how are you?",
"is_final": true,
"utterance_end": true,
"process_ms": 510,
"metadata": {
"emotion": "EMOTION_NEUTRAL",
"behavior": "BEHAVIOR_DIRECT",
"role": "ROLE_INTERVIEWER",
"age": "AGE_30_45",
"gender": "GENDER_MALE"
}
}
// Session end
{ "type": "end" }Speaker Diarization
Multi-speaker separation using ECAPA-TDNN 192-dim speaker embeddings + agglomerative clustering. Available in both batch and streaming modes.
# Batch with diarization
curl -X POST http://localhost:8001/transcribe \
-F "file=@meeting.wav" \
-F "diarize=true" \
-F "num_speakers=3"
# Response includes per-speaker segments:
# [SPEAKER_00]: "Hello, good morning."
# [SPEAKER_01]: "Hi, how are you?"
# [SPEAKER_00]: "I'm doing well, thank you."
# [SPEAKER_02]: "Let's get started."
# Each segment has its own metadata (emotion, age, gender per speaker)Pipeline: Silero VAD segmentation → per-segment ASR with speaker embedding → agglomerative clustering → merge adjacent same-speaker segments.
Metadata
Every ASR segment includes metadata extracted in a single forward pass — no separate models or API calls. Tags available depend on which ASR model is loaded.
Common tags (all models)
| Tag | Values | Description |
|---|---|---|
| emotion | EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE | Speaker emotion per utterance |
| age | AGE_CHILD, AGE_YOUNG, AGE_ADULT, AGE_SENIOR | Estimated speaker age range |
| gender | GENDER_MALE, GENDER_FEMALE | Speaker gender |
en-in-tech-misc (en-full, en-lite variants)
| Tag | Classes | Description |
|---|---|---|
| behavior | 26 types | BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION_OPEN, BEHAVIOR_QUESTION_CLOSED, BEHAVIOR_ACKNOWLEDGE, BEHAVIOR_DIRECT, BEHAVIOR_EVALUATE, BEHAVIOR_STRUCTURE, BEHAVIOR_THINK_ALOUD, BEHAVIOR_INFORM, BEHAVIOR_REASON, BEHAVIOR_ABILITY, BEHAVIOR_COMMIT, BEHAVIOR_ADVISE, BEHAVIOR_REFLECT, BEHAVIOR_AFFIRM, BEHAVIOR_FACILITATE, BEHAVIOR_FILLER, BEHAVIOR_EXPRESS, BEHAVIOR_FOLLOW_NEUTRAL, BEHAVIOR_RAISE_CONCERN, BEHAVIOR_REFRAME, BEHAVIOR_SUPPORT, BEHAVIOR_CONFRONT, BEHAVIOR_WARN |
| eval | 8 types | EVAL_NONE, EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP |
| role | 3 types | ROLE_INTERVIEWER, ROLE_INTERVIEWEE |
hinglish-loans (hinglish variant)
| Tag | Classes | Description |
|---|---|---|
| intent | 13 types | INTENT_GREETING, INTENT_IDENTITY_VERIFY, INTENT_PAYMENT_REMINDER, INTENT_PAYMENT_INSTRUCTION, INTENT_CLAIMS_PAID, INTENT_PROMISE_TO_PAY, INTENT_PAYMENT_QUERY, INTENT_AMOUNT_DISPUTE, INTENT_FINANCIAL_HARDSHIP, INTENT_COMPLAINT, INTENT_URGENCY_PRESSURE, INTENT_ACKNOWLEDGMENT, INTENT_OTHER |
| role | 3 types | ROLE_AGENT, ROLE_CUSTOMER, ROLE_OTHER |
zh (multi-zh, all variants)
| Tag | Classes |
|---|---|
| dialect | DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS |
whissle-large (multi-full, multi-zh, all variants)
| Tag | Classes |
|---|---|
| intent | 31 groups — inline action tokens extracted from the vocabulary |
AI Analysis
Add summarize=mode to any batch transcription. The diarized transcript + per-segment metadata is sent to the configured LLM (Claude or Gemini) for analysis.
Preset: sales_coaching
Scores 8 sales best practices: greeting, discovery questions, active listening, objection handling, value proposition, urgency creation, next steps, closing.
curl -F "file=@call.mp3" -F "diarize=true" \
-F "summarize=sales_coaching" http://localhost:8001/transcribe
# Returns analysis.overall_score, analysis.practices,
# analysis.buyer_outcome, analysis.highlights,
# analysis.behaviors (per-segment labels)Preset: collections
Compliance scoring for debt collection calls: identity verification, reason stated, amount mentioned, no harassment. Call outcome and next action.
-F "summarize=collections" # Returns analysis.compliance, analysis.call_outcome, # analysis.customer_sentiment, analysis.next_action
Preset: true (general)
Markdown summary with overview, participants, key topics, emotional dynamics, entities, and outcome.
-F "summarize=true"
Custom prompt
Pass any string — it becomes the LLM prompt with the full transcript + metadata appended.
-F "summarize=You are a medical call quality analyst. \
Score this call on: 1) empathy (0-10), \
2) medical history completeness, 3) HIPAA compliance. \
Return JSON with scores and recommendations."
# The LLM receives your prompt + every segment with
# emotion, intent, behavior, age, gender, role labels.TTS — Text-to-Speech
Kokoro 82M — non-autoregressive, single forward pass, 55 voices, 10 languages. Sub-200ms TTFB on CPU.
WebSocket ws://localhost:8003/stream
# Protocol:
# 1. Connect
# 2. Send config: {"type": "config", "voice": "af_heart"}
# 3. Send speak: {"type": "speak", "text": "Hello!"}
# 4. Receive binary PCM s16le chunks (24kHz mono)
# 5. Receive {"type": "done"} when completePython example
import asyncio, json, wave, websockets
async def speak(text, voice="af_heart"):
async with websockets.connect("ws://localhost:8003/stream") as ws:
await ws.send(json.dumps({"type": "config", "voice": voice}))
await ws.send(json.dumps({"type": "speak", "text": text}))
audio = b""
async for msg in ws:
if isinstance(msg, bytes):
audio += msg # PCM s16le, 24kHz
elif json.loads(msg).get("type") == "done":
break
with wave.open("out.wav", "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
wf.writeframes(audio)
asyncio.run(speak("Hello from Whissle Gateway."))TTS Config
| Field | Type | Default | Description |
|---|---|---|---|
| type | string | required | Must be 'config' |
| voice | string | af_heart | Voice ID (see /voices endpoint) |
| language | string | en | Language for phonemization |
| speed | float | 1.0 | Speech speed (0.5 – 2.0) |
| temperature | float | 0.6 | Sampling temperature (0.0 – 2.0) |
| top_k | int | 50 | Top-k sampling (0 – 200) |
| exaggeration | float | 0.0 | Voice exaggeration (prosody variation) |
Speak message fields:
| Field | Type | Description |
|---|---|---|
| type | string | Must be 'speak' |
| text | string | Text to synthesize |
| instruct | string | Optional style instruction (e.g. 'speak slowly and calmly') |
Voices
Query available voices via REST:
curl http://localhost:8003/voices
# Returns: {"voices": {"af_heart": {...}, "am_adam": {...}, ...}}55 voices across American (af_, am_), British (bf_, bm_), and other accents. Voice IDs follow the pattern: {accent}{gender}_{name}.
Agent
Intelligent agent powered by Claude or Gemini. Handles summarization, coaching analysis, conversational AI, and tool-augmented workflows.
Process (single-turn)
POST http://localhost:8765/process
curl -X POST http://localhost:8765/process \
-H "Content-Type: application/json" \
-H "X-Device-Id: user-1" \
-d '{
"transcript": "What meetings do I have today?",
"user_id": "user-1",
"language": "en"
}'
# Response: mode, processed_text, is_command, etc.Chat (multi-turn streaming)
POST http://localhost:8765/voice-agent/chat/stream
curl -N -X POST http://localhost:8765/voice-agent/chat/stream \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Summarize the key takeaways."}
],
"user_id": "user-1"
}'
# SSE response:
# data: {"type": "text_chunk", "content": "The key takeaways..."}
# data: {"type": "done"}Configuration
All configuration via environment variables passed to docker run -e.
| Variable | Default | Description |
|---|---|---|
| VARIANT | en-full | Model variant: hinglish, en-lite, en-full, multi-full, multi-zh, all |
| ASR_DEVICE | auto | Inference device: auto, cpu, cuda |
| LLM_PROVIDER | claude | LLM for agent + summarization: claude or gemini |
| ANTHROPIC_API_KEY | — | Anthropic Claude API key |
| ANTHROPIC_MODEL | claude-sonnet-4-6 | Claude model ID |
| GEMINI_API_KEY | — | Google Gemini API key |
| STORAGE_MODE | sqlite | Data storage: sqlite (local) or firestore (cloud) |
| TTS_ENGINE | kokoro | TTS engine (kokoro is default and baked in) |
| AUTH_MODE | local | Token auth mode: local (SQLite), remote (cloud backend), hybrid (local then remote) |
| AUTH_DB_PATH | /data/auth/tokens.db | Path to SQLite token database |
| GATEWAY_INTERNAL_SECRET | auto | Shared secret for internal service-to-service auth |
| HIPAA_MODE | false | Enable HIPAA mode: 1h session TTL, encrypted DB |
Need help?
See industry-specific solutions or contact us for custom integrations.
