🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai

API Documentation

Complete reference for Whissle Gateway — ASR, TTS, diarization, metadata, AI analysis.

Overview

Whissle Gateway exposes three services on a single Docker container:

ServicePortProtocolFunction
ASR8001REST + WebSocketSpeech recognition, diarization, metadata
TTS8003WebSocketText-to-speech (Kokoro 82M, 55 voices)
Agent8765REST + SSELLM processing, summarization, coaching
Gateway9000REST + WebSocketUnified proxy (requires API token)

For development and testing, hit the service ports directly (8001, 8003, 8765). The gateway proxy at 9000 adds authentication, rate limiting, and usage tracking for production use.

Authentication

The gateway includes a built-in local auth system — no external backend required. An admin API token is auto-generated on first start and printed in the Docker logs.

Getting your admin token

# Check Docker startup logs for the admin token:
docker logs whissle-gateway 2>&1 | grep "Token:"

# Or read it from the persistent data volume:
docker exec whissle-gateway cat /data/auth/admin_token.txt

Creating user tokens

# Create a token for an application or user
curl -X POST http://localhost:9000/auth/tokens \
  -H "Authorization: Bearer wh_YOUR_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"user_id": "my-app", "label": "My Application"}'

# Response:
# {"success": true, "token": "wh_a1b2c3...", "user_id": "my-app", ...}

Using tokens

# REST — Authorization header
curl -X POST http://localhost:9000/asr/transcribe \
  -H "Authorization: Bearer wh_YOUR_TOKEN" \
  -F "file=@call.mp3" -F "diarize=true"

# WebSocket — query parameter
wscat -c "ws://localhost:9000/listen?token=wh_YOUR_TOKEN"
wscat -c "ws://localhost:9000/asr/stream?token=wh_YOUR_TOKEN"

Token management

MethodEndpointDescription
POST/auth/tokensCreate a new API token (requires admin)
GET/auth/tokensList all active tokens (requires admin)
GET/auth/tokens?user_id=XList tokens for a specific user
DELETE/auth/tokens/{id}Revoke a token by ID (requires admin)
GET/auth/usage?user_id=X&days=7Usage logs for a user
GET/auth/usage/summaryAggregated usage stats

All /auth/* endpoints require the admin token. Tokens and usage data persist in /data/auth/tokens.db and survive container restarts when using a Docker volume.

ASR — Batch Transcription

Upload an audio file and receive a complete transcription with metadata.

Basic usage

curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3"

# With diarization + metadata + AI analysis
curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3" \
    -F "diarize=true" \
    -F "num_speakers=2" \
    -F "punctuation=true" \
    -F "itn=true" \
    -F "metadata_prob=true" \
    -F "word_timestamps=true" \
    -F "speech_analysis=true" \
    -F "summarize=sales_coaching" \
    -o result.json

Parameters

ParameterTypeDefaultDescription
filefilerequiredAudio file — MP3, WAV, FLAC, OGG, M4A, WebM
languagestringautoLanguage hint: en, hi, zh, ja, ko, es, fr, de, etc.
modelstringdefaultASR model: en-in-tech-misc, hinglish-loans, zh, whissle-large
diarizeboolfalseEnable speaker diarization (ECAPA-TDNN + agglomerative clustering)
num_speakersintautoExact number of speakers (skips auto-detection)
min_speakersint1Minimum speakers for auto-detection
max_speakersint10Maximum speakers for auto-detection
punctuationbooltrueRestore punctuation (commas, periods, question marks) and capitalization
itnbooltrueInverse text normalization — 'twenty three' → '23', 'पाँच सौ' → '500'
use_lmbooltrueUse KenLM n-gram language model for beam search decoding
lm_modestringbalancedLM mode: greedy, balanced, strict — controls LM weight on decoding
beam_widthint100Beam width for KenLM beam search (higher = better, slower)
lm_alphafloat0.1LM weight (alpha) for beam search scoring
lm_betafloat0.5Word insertion bonus (beta) for beam search
metadata_probboolfalseInclude probability distributions for emotion, intent, age, gender
word_timestampsboolfalseInclude per-word start and end timestamps with confidence scores
speech_analysisboolfalseInclude speech pattern analysis: lexical fluency, vocabulary range, grammar, pace
speaker_embeddingboolfalseReturn 192-dim speaker embedding vector
summarizestringAI analysis: true, sales_coaching, collections, or custom prompt text
hotwordsstringComma-separated hotwords for boosting (e.g. 'LoanTap,GoBoult,EMI')
hotword_weightfloat10.0Hotword boost weight in beam search
noise_reduceboolfalseApply noise reduction before transcription
gain_normalizeboolfalseNormalize audio gain before transcription
profanity_filterboolfalseMask profanity in transcript output
n_bestint1Number of alternative transcriptions to return
formatstringOutput format: srt, vtt, segments (for subtitles)
structured_intentsboolfalseTwo-tier intent classification: speech_act + domain
intent_labelsstringFilter intent predictions to specific labels

Response Format

{
  "transcript": "Hello, good morning. How are you?",
  "duration": 5.2,
  "inference_time": "0.823s",
  "model": "whissle-large",
  "decoder": "beam_search_kenlm",

  // Present when diarize=true
  "num_speakers": 2,
  "segments": [
    {
      "speaker": "SPEAKER_00",
      "text": "Hello, good morning.",
      "start": 1.0,
      "end": 1.9,
      "confidence": 0.96,

      // Always present — top-1 prediction per category
      "metadata": {
        "emotion": "EMOTION_NEUTRAL",
        "behavior": "BEHAVIOR_DIRECT",
        "role": "ROLE_INTERVIEWER",
        "eval": "EVAL_NONE",
        "age": "AGE_30_45",
        "gender": "GENDER_MALE"
      },

      // Present when metadata_prob=true — full distributions
      "metadata_probs": {
        "emotion": {
          "EMOTION_NEUTRAL": 0.82,
          "EMOTION_HAPPY": 0.12,
          "EMOTION_SAD": 0.03,
          "EMOTION_ANGRY": 0.02,
          "EMOTION_FEAR": 0.01
        },
        "age": { "AGE_30_45": 0.65, "AGE_18_30": 0.25, ... }
      },

      // Present when word_timestamps=true
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3, "confidence": 0.98},
        {"word": "good",  "start": 1.4, "end": 1.6, "confidence": 0.95},
        {"word": "morning", "start": 1.6, "end": 1.9, "confidence": 0.97}
      ],

      "entities": [
        {"type": "PERSON", "value": "Diana"},
        {"type": "ORG", "value": "American Amicable"}
      ]
    }
  ],

  // Present when speech_analysis=true
  "speech_analysis": {
    "lexical_fluency": 0.92,
    "vocabulary_range": 0.85,
    "grammar_score": 0.88
  },

  // Present when summarize=... is set
  "summary": "...",           // text or markdown
  "analysis": { ... }         // structured JSON (sales_coaching, collections)
}

Other Batch Endpoints

EndpointDescription
POST /transcribeFull transcription with all metadata and analysis options
POST /transcribe/cleanReturns only clean transcript text — fastest, no metadata
POST /transcribe/rawReturns raw model output including inline action tokens
POST /transcribe/pcmTranscribe raw PCM s16le audio bytes (no file header needed)
POST /transcribe/longLong audio with VAD segmentation — returns segments array
POST /transcribe/batchBatch multiple files in one request
GET /modelsList loaded ASR models and their metadata
GET /intentsList available intent labels for the loaded model
GET /languagesList supported languages and their KenLM availability
GET /statusServer status — loaded models, memory, uptime

ASR — WebSocket Streaming

Real-time transcription over WebSocket. Send raw PCM audio chunks, receive interim and final transcripts with metadata as they arrive.

WebSocket ws://localhost:8001/stream

# Protocol:
# 1. Connect
# 2. Send JSON config
# 3. Send binary PCM chunks (16kHz, 16-bit, mono)
# 4. Receive JSON transcripts (interim + final)
# 5. Send {"type": "end"} to finalize

Python example

import asyncio, json, websockets

async def stream(audio_path):
    async with websockets.connect("ws://localhost:8001/stream") as ws:
        # 1. Config
        await ws.send(json.dumps({
            "type": "config",
            "sample_rate": 16000,
            "language": "en",
            "model": "en-in-tech-misc",  # optional
            "interim_results": True
        }))

        # 2. Stream audio chunks (100ms = 3200 bytes at 16kHz s16le)
        with open(audio_path, "rb") as f:
            while chunk := f.read(3200):
                await ws.send(chunk)
                await asyncio.sleep(0.05)  # pace to real-time

        # 3. Signal end
        await ws.send(json.dumps({"type": "end"}))

        # 4. Receive results
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                print(f"FINAL: {data['text']}")
                print(f"  emotion={data['metadata']['emotion']}")
            elif data["type"] == "transcript":
                print(f"  interim: {data['text']}")
            elif data["type"] == "end":
                break

asyncio.run(stream("call.wav"))

Streaming Config

FieldTypeDefaultDescription
typestringrequiredMust be 'config'
sample_rateint16000Audio sample rate in Hz
languagestringautoLanguage hint
modelstringdefaultASR model to use
interim_resultsbooltrueSend interim (partial) transcripts
channelstringmicrophoneAudio channel name (for multi-channel tagging)

Streaming Response

// Interim transcript (partial, updates as more audio arrives)
{
  "type": "transcript",
  "channel": "microphone",
  "text": "Hello how are",
  "is_final": false,
  "utterance_end": false
}

// Final transcript (complete utterance with metadata)
{
  "type": "transcript",
  "text": "Hello, how are you?",
  "is_final": true,
  "utterance_end": true,
  "process_ms": 510,
  "metadata": {
    "emotion": "EMOTION_NEUTRAL",
    "behavior": "BEHAVIOR_DIRECT",
    "role": "ROLE_INTERVIEWER",
    "age": "AGE_30_45",
    "gender": "GENDER_MALE"
  }
}

// Session end
{ "type": "end" }

Speaker Diarization

Multi-speaker separation using ECAPA-TDNN 192-dim speaker embeddings + agglomerative clustering. Available in both batch and streaming modes.

# Batch with diarization
curl -X POST http://localhost:8001/transcribe \
    -F "file=@meeting.wav" \
    -F "diarize=true" \
    -F "num_speakers=3"

# Response includes per-speaker segments:
# [SPEAKER_00]: "Hello, good morning."
# [SPEAKER_01]: "Hi, how are you?"
# [SPEAKER_00]: "I'm doing well, thank you."
# [SPEAKER_02]: "Let's get started."

# Each segment has its own metadata (emotion, age, gender per speaker)

Pipeline: Silero VAD segmentation → per-segment ASR with speaker embedding → agglomerative clustering → merge adjacent same-speaker segments.

Metadata

Every ASR segment includes metadata extracted in a single forward pass — no separate models or API calls. Tags available depend on which ASR model is loaded.

Common tags (all models)

TagValuesDescription
emotionEMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISESpeaker emotion per utterance
ageAGE_CHILD, AGE_YOUNG, AGE_ADULT, AGE_SENIOREstimated speaker age range
genderGENDER_MALE, GENDER_FEMALESpeaker gender

en-in-tech-misc (en-full, en-lite variants)

TagClassesDescription
behavior26 typesBEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION_OPEN, BEHAVIOR_QUESTION_CLOSED, BEHAVIOR_ACKNOWLEDGE, BEHAVIOR_DIRECT, BEHAVIOR_EVALUATE, BEHAVIOR_STRUCTURE, BEHAVIOR_THINK_ALOUD, BEHAVIOR_INFORM, BEHAVIOR_REASON, BEHAVIOR_ABILITY, BEHAVIOR_COMMIT, BEHAVIOR_ADVISE, BEHAVIOR_REFLECT, BEHAVIOR_AFFIRM, BEHAVIOR_FACILITATE, BEHAVIOR_FILLER, BEHAVIOR_EXPRESS, BEHAVIOR_FOLLOW_NEUTRAL, BEHAVIOR_RAISE_CONCERN, BEHAVIOR_REFRAME, BEHAVIOR_SUPPORT, BEHAVIOR_CONFRONT, BEHAVIOR_WARN
eval8 typesEVAL_NONE, EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP
role3 typesROLE_INTERVIEWER, ROLE_INTERVIEWEE

hinglish-loans (hinglish variant)

TagClassesDescription
intent13 typesINTENT_GREETING, INTENT_IDENTITY_VERIFY, INTENT_PAYMENT_REMINDER, INTENT_PAYMENT_INSTRUCTION, INTENT_CLAIMS_PAID, INTENT_PROMISE_TO_PAY, INTENT_PAYMENT_QUERY, INTENT_AMOUNT_DISPUTE, INTENT_FINANCIAL_HARDSHIP, INTENT_COMPLAINT, INTENT_URGENCY_PRESSURE, INTENT_ACKNOWLEDGMENT, INTENT_OTHER
role3 typesROLE_AGENT, ROLE_CUSTOMER, ROLE_OTHER

zh (multi-zh, all variants)

TagClasses
dialectDIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS

whissle-large (multi-full, multi-zh, all variants)

TagClasses
intent31 groups — inline action tokens extracted from the vocabulary

AI Analysis

Add summarize=mode to any batch transcription. The diarized transcript + per-segment metadata is sent to the configured LLM (Claude or Gemini) for analysis.

Preset: sales_coaching

Scores 8 sales best practices: greeting, discovery questions, active listening, objection handling, value proposition, urgency creation, next steps, closing.

curl -F "file=@call.mp3" -F "diarize=true" \
    -F "summarize=sales_coaching" http://localhost:8001/transcribe

# Returns analysis.overall_score, analysis.practices,
# analysis.buyer_outcome, analysis.highlights,
# analysis.behaviors (per-segment labels)

Preset: collections

Compliance scoring for debt collection calls: identity verification, reason stated, amount mentioned, no harassment. Call outcome and next action.

-F "summarize=collections"

# Returns analysis.compliance, analysis.call_outcome,
# analysis.customer_sentiment, analysis.next_action

Preset: true (general)

Markdown summary with overview, participants, key topics, emotional dynamics, entities, and outcome.

-F "summarize=true"

Custom prompt

Pass any string — it becomes the LLM prompt with the full transcript + metadata appended.

-F "summarize=You are a medical call quality analyst. \
    Score this call on: 1) empathy (0-10), \
    2) medical history completeness, 3) HIPAA compliance. \
    Return JSON with scores and recommendations."

# The LLM receives your prompt + every segment with
# emotion, intent, behavior, age, gender, role labels.

TTS — Text-to-Speech

Kokoro 82M — non-autoregressive, single forward pass, 55 voices, 10 languages. Sub-200ms TTFB on CPU.

WebSocket ws://localhost:8003/stream

# Protocol:
# 1. Connect
# 2. Send config: {"type": "config", "voice": "af_heart"}
# 3. Send speak:  {"type": "speak", "text": "Hello!"}
# 4. Receive binary PCM s16le chunks (24kHz mono)
# 5. Receive {"type": "done"} when complete

Python example

import asyncio, json, wave, websockets

async def speak(text, voice="af_heart"):
    async with websockets.connect("ws://localhost:8003/stream") as ws:
        await ws.send(json.dumps({"type": "config", "voice": voice}))
        await ws.send(json.dumps({"type": "speak", "text": text}))

        audio = b""
        async for msg in ws:
            if isinstance(msg, bytes):
                audio += msg           # PCM s16le, 24kHz
            elif json.loads(msg).get("type") == "done":
                break

        with wave.open("out.wav", "wb") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(24000)
            wf.writeframes(audio)

asyncio.run(speak("Hello from Whissle Gateway."))

TTS Config

FieldTypeDefaultDescription
typestringrequiredMust be 'config'
voicestringaf_heartVoice ID (see /voices endpoint)
languagestringenLanguage for phonemization
speedfloat1.0Speech speed (0.5 – 2.0)
temperaturefloat0.6Sampling temperature (0.0 – 2.0)
top_kint50Top-k sampling (0 – 200)
exaggerationfloat0.0Voice exaggeration (prosody variation)

Speak message fields:

FieldTypeDescription
typestringMust be 'speak'
textstringText to synthesize
instructstringOptional style instruction (e.g. 'speak slowly and calmly')

Voices

Query available voices via REST:

curl http://localhost:8003/voices

# Returns: {"voices": {"af_heart": {...}, "am_adam": {...}, ...}}

55 voices across American (af_, am_), British (bf_, bm_), and other accents. Voice IDs follow the pattern: {accent}{gender}_{name}.

Agent

Intelligent agent powered by Claude or Gemini. Handles summarization, coaching analysis, conversational AI, and tool-augmented workflows.

Process (single-turn)

POST http://localhost:8765/process

curl -X POST http://localhost:8765/process \
    -H "Content-Type: application/json" \
    -H "X-Device-Id: user-1" \
    -d '{
      "transcript": "What meetings do I have today?",
      "user_id": "user-1",
      "language": "en"
    }'

# Response: mode, processed_text, is_command, etc.

Chat (multi-turn streaming)

POST http://localhost:8765/voice-agent/chat/stream

curl -N -X POST http://localhost:8765/voice-agent/chat/stream \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {"role": "user", "content": "Summarize the key takeaways."}
      ],
      "user_id": "user-1"
    }'

# SSE response:
# data: {"type": "text_chunk", "content": "The key takeaways..."}
# data: {"type": "done"}

Configuration

All configuration via environment variables passed to docker run -e.

VariableDefaultDescription
VARIANTen-fullModel variant: hinglish, en-lite, en-full, multi-full, multi-zh, all
ASR_DEVICEautoInference device: auto, cpu, cuda
LLM_PROVIDERclaudeLLM for agent + summarization: claude or gemini
ANTHROPIC_API_KEYAnthropic Claude API key
ANTHROPIC_MODELclaude-sonnet-4-6Claude model ID
GEMINI_API_KEYGoogle Gemini API key
STORAGE_MODEsqliteData storage: sqlite (local) or firestore (cloud)
TTS_ENGINEkokoroTTS engine (kokoro is default and baked in)
AUTH_MODElocalToken auth mode: local (SQLite), remote (cloud backend), hybrid (local then remote)
AUTH_DB_PATH/data/auth/tokens.dbPath to SQLite token database
GATEWAY_INTERNAL_SECRETautoShared secret for internal service-to-service auth
HIPAA_MODEfalseEnable HIPAA mode: 1h session TTL, encrypted DB

Need help?

See industry-specific solutions or contact us for custom integrations.