API Documentation

Complete reference for Whissle Gateway — ASR, TTS, diarization, metadata, AI analysis.

Overview

Whissle Gateway exposes three services on a single Docker container:

Service	Port	Protocol	Function
ASR	8001	REST + WebSocket	Speech recognition, diarization, metadata
TTS	8003	WebSocket	Text-to-speech (Kokoro 82M, 55 voices)
Agent	8765	REST + SSE	LLM processing, summarization, coaching
Gateway	9000	REST + WebSocket	Unified proxy (requires API token)

For development and testing, hit the service ports directly (8001, 8003, 8765). The gateway proxy at 9000 adds authentication, rate limiting, and usage tracking for production use.

Authentication

The gateway includes a built-in local auth system — no external backend required. An admin API token is auto-generated on first start and printed in the Docker logs.

Getting your admin token

# Check Docker startup logs for the admin token:
docker logs whissle-gateway 2>&1 | grep "Token:"

# Or read it from the persistent data volume:
docker exec whissle-gateway cat /data/auth/admin_token.txt

Creating user tokens

# Create a token for an application or user
curl -X POST http://localhost:9000/auth/tokens \
  -H "Authorization: Bearer wh_YOUR_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"user_id": "my-app", "label": "My Application"}'

# Response:
# {"success": true, "token": "wh_a1b2c3...", "user_id": "my-app", ...}

Using tokens

# REST — Authorization header
curl -X POST http://localhost:9000/asr/transcribe \
  -H "Authorization: Bearer wh_YOUR_TOKEN" \
  -F "file=@call.mp3" -F "diarize=true"

# WebSocket — query parameter
wscat -c "ws://localhost:9000/listen?token=wh_YOUR_TOKEN"
wscat -c "ws://localhost:9000/asr/stream?token=wh_YOUR_TOKEN"

Token management

Method	Endpoint	Description
POST	/auth/tokens	Create a new API token (requires admin)
GET	/auth/tokens	List all active tokens (requires admin)
GET	/auth/tokens?user_id=X	List tokens for a specific user
DELETE	/auth/tokens/{id}	Revoke a token by ID (requires admin)
GET	/auth/usage?user_id=X&days=7	Usage logs for a user
GET	/auth/usage/summary	Aggregated usage stats

All /auth/* endpoints require the admin token. Tokens and usage data persist in /data/auth/tokens.db and survive container restarts when using a Docker volume.

ASR — Batch Transcription

Upload an audio file and receive a complete transcription with metadata.

Basic usage

curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3"

# With diarization + metadata + AI analysis
curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3" \
    -F "diarize=true" \
    -F "num_speakers=2" \
    -F "punctuation=true" \
    -F "itn=true" \
    -F "metadata_prob=true" \
    -F "word_timestamps=true" \
    -F "speech_analysis=true" \
    -F "summarize=sales_coaching" \
    -o result.json

Parameters

Parameter	Type	Default	Description
file	file	required	Audio file — MP3, WAV, FLAC, OGG, M4A, WebM
language	string	auto	Language hint: en, hi, zh, ja, ko, es, fr, de, etc.
model	string	default	ASR model: en-in-tech-misc, hinglish-loans, zh, whissle-large
diarize	bool	false	Enable speaker diarization (ECAPA-TDNN + agglomerative clustering)
num_speakers	int	auto	Exact number of speakers (skips auto-detection)
min_speakers	int	1	Minimum speakers for auto-detection
max_speakers	int	10	Maximum speakers for auto-detection
punctuation	bool	true	Restore punctuation (commas, periods, question marks) and capitalization
itn	bool	true	Inverse text normalization — 'twenty three' → '23', 'पाँच सौ' → '500'
use_lm	bool	true	Use KenLM n-gram language model for beam search decoding
lm_mode	string	balanced	LM mode: greedy, balanced, strict — controls LM weight on decoding
beam_width	int	100	Beam width for KenLM beam search (higher = better, slower)
lm_alpha	float	0.1	LM weight (alpha) for beam search scoring
lm_beta	float	0.5	Word insertion bonus (beta) for beam search
metadata_prob	bool	false	Include probability distributions for emotion, intent, age, gender
word_timestamps	bool	false	Include per-word start and end timestamps with confidence scores
speech_analysis	bool	false	Include speech pattern analysis: lexical fluency, vocabulary range, grammar, pace
speaker_embedding	bool	false	Return 192-dim speaker embedding vector
summarize	string	—	AI analysis: true, sales_coaching, collections, or custom prompt text
hotwords	string	—	Comma-separated hotwords for boosting (e.g. 'LoanTap,GoBoult,EMI')
hotword_weight	float	10.0	Hotword boost weight in beam search
noise_reduce	bool	false	Apply noise reduction before transcription
gain_normalize	bool	false	Normalize audio gain before transcription
profanity_filter	bool	false	Mask profanity in transcript output
n_best	int	1	Number of alternative transcriptions to return
format	string	—	Output format: srt, vtt, segments (for subtitles)
structured_intents	bool	false	Two-tier intent classification: speech_act + domain
intent_labels	string	—	Filter intent predictions to specific labels

Response Format

{
  "transcript": "Hello, good morning. How are you?",
  "duration": 5.2,
  "inference_time": "0.823s",
  "model": "whissle-large",
  "decoder": "beam_search_kenlm",

  // Present when diarize=true
  "num_speakers": 2,
  "segments": [
    {
      "speaker": "SPEAKER_00",
      "text": "Hello, good morning.",
      "start": 1.0,
      "end": 1.9,
      "confidence": 0.96,

      // Always present — top-1 prediction per category
      "metadata": {
        "emotion": "EMOTION_NEUTRAL",
        "behavior": "BEHAVIOR_DIRECT",
        "role": "ROLE_INTERVIEWER",
        "eval": "EVAL_NONE",
        "age": "AGE_30_45",
        "gender": "GENDER_MALE"
      },

      // Present when metadata_prob=true — full distributions
      "metadata_probs": {
        "emotion": {
          "EMOTION_NEUTRAL": 0.82,
          "EMOTION_HAPPY": 0.12,
          "EMOTION_SAD": 0.03,
          "EMOTION_ANGRY": 0.02,
          "EMOTION_FEAR": 0.01
        },
        "age": { "AGE_30_45": 0.65, "AGE_18_30": 0.25, ... }
      },

      // Present when word_timestamps=true
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3, "confidence": 0.98},
        {"word": "good",  "start": 1.4, "end": 1.6, "confidence": 0.95},
        {"word": "morning", "start": 1.6, "end": 1.9, "confidence": 0.97}
      ],

      "entities": [
        {"type": "PERSON", "value": "Diana"},
        {"type": "ORG", "value": "American Amicable"}
      ]
    }
  ],

  // Present when speech_analysis=true
  "speech_analysis": {
    "lexical_fluency": 0.92,
    "vocabulary_range": 0.85,
    "grammar_score": 0.88
  },

  // Present when summarize=... is set
  "summary": "...",           // text or markdown
  "analysis": { ... }         // structured JSON (sales_coaching, collections)
}

Other Batch Endpoints

Endpoint	Description
POST /transcribe	Full transcription with all metadata and analysis options
POST /transcribe/clean	Returns only clean transcript text — fastest, no metadata
POST /transcribe/raw	Returns raw model output including inline action tokens
POST /transcribe/pcm	Transcribe raw PCM s16le audio bytes (no file header needed)
POST /transcribe/long	Long audio with VAD segmentation — returns segments array
POST /transcribe/batch	Batch multiple files in one request
GET /models	List loaded ASR models and their metadata
GET /intents	List available intent labels for the loaded model
GET /languages	List supported languages and their KenLM availability
GET /status	Server status — loaded models, memory, uptime

ASR — WebSocket Streaming

Real-time transcription over WebSocket. Send raw PCM audio chunks, receive interim and final transcripts with metadata as they arrive.

WebSocket ws://localhost:8001/stream

# Protocol:
# 1. Connect
# 2. Send JSON config
# 3. Send binary PCM chunks (16kHz, 16-bit, mono)
# 4. Receive JSON transcripts (interim + final)
# 5. Send {"type": "end"} to finalize

Python example

import asyncio, json, websockets

async def stream(audio_path):
    async with websockets.connect("ws://localhost:8001/stream") as ws:
        # 1. Config
        await ws.send(json.dumps({
            "type": "config",
            "sample_rate": 16000,
            "language": "en",
            "model": "en-in-tech-misc",  # optional
            "interim_results": True
        }))

        # 2. Stream audio chunks (100ms = 3200 bytes at 16kHz s16le)
        with open(audio_path, "rb") as f:
            while chunk := f.read(3200):
                await ws.send(chunk)
                await asyncio.sleep(0.05)  # pace to real-time

        # 3. Signal end
        await ws.send(json.dumps({"type": "end"}))

        # 4. Receive results
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                print(f"FINAL: {data['text']}")
                print(f"  emotion={data['metadata']['emotion']}")
            elif data["type"] == "transcript":
                print(f"  interim: {data['text']}")
            elif data["type"] == "end":
                break

asyncio.run(stream("call.wav"))

Streaming Config

Field	Type	Default	Description
type	string	required	Must be 'config'
sample_rate	int	16000	Audio sample rate in Hz
language	string	auto	Language hint
model	string	default	ASR model to use
interim_results	bool	true	Send interim (partial) transcripts
channel	string	microphone	Audio channel name (for multi-channel tagging)

Streaming Response

// Interim transcript (partial, updates as more audio arrives)
{
  "type": "transcript",
  "channel": "microphone",
  "text": "Hello how are",
  "is_final": false,
  "utterance_end": false
}

// Final transcript (complete utterance with metadata)
{
  "type": "transcript",
  "text": "Hello, how are you?",
  "is_final": true,
  "utterance_end": true,
  "process_ms": 510,
  "metadata": {
    "emotion": "EMOTION_NEUTRAL",
    "behavior": "BEHAVIOR_DIRECT",
    "role": "ROLE_INTERVIEWER",
    "age": "AGE_30_45",
    "gender": "GENDER_MALE"
  }
}

// Session end
{ "type": "end" }

Speaker Diarization

Multi-speaker separation using ECAPA-TDNN 192-dim speaker embeddings + agglomerative clustering. Available in both batch and streaming modes.

# Batch with diarization
curl -X POST http://localhost:8001/transcribe \
    -F "file=@meeting.wav" \
    -F "diarize=true" \
    -F "num_speakers=3"

# Response includes per-speaker segments:
# [SPEAKER_00]: "Hello, good morning."
# [SPEAKER_01]: "Hi, how are you?"
# [SPEAKER_00]: "I'm doing well, thank you."
# [SPEAKER_02]: "Let's get started."

# Each segment has its own metadata (emotion, age, gender per speaker)

Pipeline: Silero VAD segmentation → per-segment ASR with speaker embedding → agglomerative clustering → merge adjacent same-speaker segments.

Metadata

Every ASR segment includes metadata extracted in a single forward pass — no separate models or API calls. Tags available depend on which ASR model is loaded.

Common tags (all models)

Tag	Values	Description
emotion	EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE	Speaker emotion per utterance
age	AGE_CHILD, AGE_YOUNG, AGE_ADULT, AGE_SENIOR	Estimated speaker age range
gender	GENDER_MALE, GENDER_FEMALE	Speaker gender

en-in-tech-misc (en-full, en-lite variants)

Tag	Classes	Description
behavior	26 types	BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION_OPEN, BEHAVIOR_QUESTION_CLOSED, BEHAVIOR_ACKNOWLEDGE, BEHAVIOR_DIRECT, BEHAVIOR_EVALUATE, BEHAVIOR_STRUCTURE, BEHAVIOR_THINK_ALOUD, BEHAVIOR_INFORM, BEHAVIOR_REASON, BEHAVIOR_ABILITY, BEHAVIOR_COMMIT, BEHAVIOR_ADVISE, BEHAVIOR_REFLECT, BEHAVIOR_AFFIRM, BEHAVIOR_FACILITATE, BEHAVIOR_FILLER, BEHAVIOR_EXPRESS, BEHAVIOR_FOLLOW_NEUTRAL, BEHAVIOR_RAISE_CONCERN, BEHAVIOR_REFRAME, BEHAVIOR_SUPPORT, BEHAVIOR_CONFRONT, BEHAVIOR_WARN
eval	8 types	EVAL_NONE, EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP
role	3 types	ROLE_INTERVIEWER, ROLE_INTERVIEWEE

hinglish-loans (hinglish variant)

Tag	Classes	Description
intent	13 types	INTENT_GREETING, INTENT_IDENTITY_VERIFY, INTENT_PAYMENT_REMINDER, INTENT_PAYMENT_INSTRUCTION, INTENT_CLAIMS_PAID, INTENT_PROMISE_TO_PAY, INTENT_PAYMENT_QUERY, INTENT_AMOUNT_DISPUTE, INTENT_FINANCIAL_HARDSHIP, INTENT_COMPLAINT, INTENT_URGENCY_PRESSURE, INTENT_ACKNOWLEDGMENT, INTENT_OTHER
role	3 types	ROLE_AGENT, ROLE_CUSTOMER, ROLE_OTHER

zh (multi-zh, all variants)

Tag	Classes
dialect	DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS

whissle-large (multi-full, multi-zh, all variants)

Tag	Classes
intent	31 groups — inline action tokens extracted from the vocabulary

AI Analysis

Add summarize=mode to any batch transcription. The diarized transcript + per-segment metadata is sent to the configured LLM (Claude or Gemini) for analysis.

Preset: sales_coaching

Scores 8 sales best practices: greeting, discovery questions, active listening, objection handling, value proposition, urgency creation, next steps, closing.

curl -F "file=@call.mp3" -F "diarize=true" \
    -F "summarize=sales_coaching" http://localhost:8001/transcribe

# Returns analysis.overall_score, analysis.practices,
# analysis.buyer_outcome, analysis.highlights,
# analysis.behaviors (per-segment labels)

Preset: collections

Compliance scoring for debt collection calls: identity verification, reason stated, amount mentioned, no harassment. Call outcome and next action.

-F "summarize=collections"

# Returns analysis.compliance, analysis.call_outcome,
# analysis.customer_sentiment, analysis.next_action

Preset: true (general)

Markdown summary with overview, participants, key topics, emotional dynamics, entities, and outcome.

-F "summarize=true"

Custom prompt

Pass any string — it becomes the LLM prompt with the full transcript + metadata appended.

-F "summarize=You are a medical call quality analyst. \
    Score this call on: 1) empathy (0-10), \
    2) medical history completeness, 3) HIPAA compliance. \
    Return JSON with scores and recommendations."

# The LLM receives your prompt + every segment with
# emotion, intent, behavior, age, gender, role labels.

TTS — Text-to-Speech

Kokoro 82M — non-autoregressive, single forward pass, 55 voices, 10 languages. Sub-200ms TTFB on CPU.

WebSocket ws://localhost:8003/stream

# Protocol:
# 1. Connect
# 2. Send config: {"type": "config", "voice": "af_heart"}
# 3. Send speak:  {"type": "speak", "text": "Hello!"}
# 4. Receive binary PCM s16le chunks (24kHz mono)
# 5. Receive {"type": "done"} when complete

Python example

import asyncio, json, wave, websockets

async def speak(text, voice="af_heart"):
    async with websockets.connect("ws://localhost:8003/stream") as ws:
        await ws.send(json.dumps({"type": "config", "voice": voice}))
        await ws.send(json.dumps({"type": "speak", "text": text}))

        audio = b""
        async for msg in ws:
            if isinstance(msg, bytes):
                audio += msg           # PCM s16le, 24kHz
            elif json.loads(msg).get("type") == "done":
                break

        with wave.open("out.wav", "wb") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(24000)
            wf.writeframes(audio)

asyncio.run(speak("Hello from Whissle Gateway."))

TTS Config

Field	Type	Default	Description
type	string	required	Must be 'config'
voice	string	af_heart	Voice ID (see /voices endpoint)
language	string	en	Language for phonemization
speed	float	1.0	Speech speed (0.5 – 2.0)
temperature	float	0.6	Sampling temperature (0.0 – 2.0)
top_k	int	50	Top-k sampling (0 – 200)
exaggeration	float	0.0	Voice exaggeration (prosody variation)

Speak message fields:

Field	Type	Description
type	string	Must be 'speak'
text	string	Text to synthesize
instruct	string	Optional style instruction (e.g. 'speak slowly and calmly')

Voices

Query available voices via REST:

curl http://localhost:8003/voices

# Returns: {"voices": {"af_heart": {...}, "am_adam": {...}, ...}}

55 voices across American (af_, am_), British (bf_, bm_), and other accents. Voice IDs follow the pattern: {accent}{gender}_{name}.

Agent

Intelligent agent powered by Claude or Gemini. Handles summarization, coaching analysis, conversational AI, and tool-augmented workflows.

Process (single-turn)

POST http://localhost:8765/process

curl -X POST http://localhost:8765/process \
    -H "Content-Type: application/json" \
    -H "X-Device-Id: user-1" \
    -d '{
      "transcript": "What meetings do I have today?",
      "user_id": "user-1",
      "language": "en"
    }'

# Response: mode, processed_text, is_command, etc.

Chat (multi-turn streaming)

POST http://localhost:8765/voice-agent/chat/stream

curl -N -X POST http://localhost:8765/voice-agent/chat/stream \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {"role": "user", "content": "Summarize the key takeaways."}
      ],
      "user_id": "user-1"
    }'

# SSE response:
# data: {"type": "text_chunk", "content": "The key takeaways..."}
# data: {"type": "done"}

Configuration

All configuration via environment variables passed to docker run -e.

Variable	Default	Description
VARIANT	en-full	Model variant: hinglish, en-lite, en-full, multi-full, multi-zh, all
ASR_DEVICE	auto	Inference device: auto, cpu, cuda
LLM_PROVIDER	claude	LLM for agent + summarization: claude or gemini
ANTHROPIC_API_KEY	—	Anthropic Claude API key
ANTHROPIC_MODEL	claude-sonnet-4-6	Claude model ID
GEMINI_API_KEY	—	Google Gemini API key
STORAGE_MODE	sqlite	Data storage: sqlite (local) or firestore (cloud)
TTS_ENGINE	kokoro	TTS engine (kokoro is default and baked in)
AUTH_MODE	local	Token auth mode: local (SQLite), remote (cloud backend), hybrid (local then remote)
AUTH_DB_PATH	/data/auth/tokens.db	Path to SQLite token database
GATEWAY_INTERNAL_SECRET	auto	Shared secret for internal service-to-service auth
HIPAA_MODE	false	Enable HIPAA mode: 1h session TTL, encrypted DB

Need help?

See industry-specific solutions or contact us for custom integrations.

View Solutions Contact Us

API Documentation

Overview

Authentication

Getting your admin token

Creating user tokens

Using tokens

Token management

ASR — Batch Transcription

Basic usage

Parameters

Response Format

Other Batch Endpoints

ASR — WebSocket Streaming

Python example

Streaming Config

Streaming Response

Speaker Diarization

Metadata

Common tags (all models)

en-in-tech-misc (en-full, en-lite variants)

hinglish-loans (hinglish variant)

zh (multi-zh, all variants)

whissle-large (multi-full, multi-zh, all variants)

AI Analysis

Preset: sales_coaching

Preset: collections

Preset: true (general)

Custom prompt

TTS — Text-to-Speech

Python example

TTS Config

Voices

Agent

Process (single-turn)

Chat (multi-turn streaming)

Configuration

Need help?

Ready to meet your personal AI?