API Documentation

Complete reference for Whissle Gateway — ASR, TTS, voice calling, diarization, metadata, AI analysis, auth, and multi-tenancy.

Overview

Whissle Gateway runs seven services in a single Docker container, managed by supervisord:

Service	Port	Protocol	Function
PostgreSQL	5432	TCP	Persistent database for agents, calls, users, organizations
ASR	8001	REST + WebSocket	Speech recognition, diarization, metadata
Video	8002	REST + WebSocket	Video intelligence — face emotion, gaze, gestures, scene
TTS	8003	WebSocket	Text-to-speech (Kokoro 82M, 55 voices)
Pipecat	8000	REST + WebSocket	Voice calling platform (WebRTC + Twilio)
Agent	8765	REST + SSE	LLM processing, summarization, coaching
Gateway	9000	REST + WebSocket	Unified proxy (requires API token)

For development and testing, hit the service ports directly (8001, 8003, 8000, 8765). The gateway proxy at 9000 adds authentication, rate limiting, and usage tracking for production use. Voice calling endpoints are proxied at /bot/*.

API Authentication

The gateway includes a built-in API token system for authenticating ASR, TTS, and Agent requests. An admin token is auto-generated on first start. For user-level auth (voice calling, organizations), see User Auth.

Getting your admin token

# Check Docker startup logs for the admin token:
docker logs whissle-gateway 2>&1 | grep "Token:"

# Or read it from the persistent data volume:
docker exec whissle-gateway cat /data/auth/admin_token.txt

Creating user tokens

# Create a token for an application or user
curl -X POST http://localhost:9000/auth/tokens \
  -H "Authorization: Bearer wh_YOUR_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"user_id": "my-app", "label": "My Application"}'

# Response:
# {"success": true, "token": "wh_a1b2c3...", "user_id": "my-app", ...}

Using tokens

# REST — Authorization header
curl -X POST http://localhost:9000/asr/transcribe \
  -H "Authorization: Bearer wh_YOUR_TOKEN" \
  -F "file=@call.mp3" -F "diarize=true"

# WebSocket — query parameter
wscat -c "ws://localhost:9000/listen?token=wh_YOUR_TOKEN"
wscat -c "ws://localhost:9000/asr/stream?token=wh_YOUR_TOKEN"

Token management

Method	Endpoint	Description
POST	/auth/tokens	Create a new API token (requires admin)
GET	/auth/tokens	List all active tokens (requires admin)
GET	/auth/tokens?user_id=X	List tokens for a specific user
DELETE	/auth/tokens/{id}	Revoke a token by ID (requires admin)
GET	/auth/usage?user_id=X&days=7	Usage logs for a user
GET	/auth/usage/summary	Aggregated usage stats

All /auth/* endpoints require the admin token. Tokens and usage data persist in /data/auth/tokens.db and survive container restarts when using a Docker volume.

ASR — Batch Transcription

Upload an audio file and receive a complete transcription with metadata.

Basic usage

curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3"

# With diarization + metadata + AI analysis
curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3" \
    -F "diarize=true" \
    -F "num_speakers=2" \
    -F "punctuation=true" \
    -F "itn=true" \
    -F "metadata_prob=true" \
    -F "word_timestamps=true" \
    -F "speech_analysis=true" \
    -F "summarize=sales_coaching" \
    -o result.json

Parameters

Parameter	Type	Default	Description
file	file	required	Audio file — MP3, WAV, FLAC, OGG, M4A, WebM
language	string	auto	Language hint: en, hi, zh, ja, ko, es, fr, de, etc.
model	string	default	ASR model: en-tiny, en-in-tech-misc, hinglish-loans, zh, whissle-large
diarize	bool	false	Enable speaker diarization (ECAPA-TDNN + agglomerative clustering)
num_speakers	int	auto	Exact number of speakers (skips auto-detection)
min_speakers	int	1	Minimum speakers for auto-detection
max_speakers	int	10	Maximum speakers for auto-detection
punctuation	bool	true	Restore punctuation (commas, periods, question marks) and capitalization
itn	bool	true	Inverse text normalization — 'twenty three' → '23', 'पाँच सौ' → '500'
use_lm	bool	true	Use KenLM n-gram language model for beam search decoding
lm_mode	string	balanced	LM mode: greedy, balanced, strict — controls LM weight on decoding
beam_width	int	100	Beam width for KenLM beam search (higher = better, slower)
lm_alpha	float	0.1	LM weight (alpha) for beam search scoring
lm_beta	float	0.5	Word insertion bonus (beta) for beam search
metadata_prob	bool	false	Include probability distributions for emotion, intent, age, gender
word_timestamps	bool	false	Include per-word start and end timestamps with confidence scores
speech_analysis	bool	false	Include speech pattern analysis: lexical fluency, vocabulary range, grammar, pace
speaker_embedding	bool	false	Return 192-dim speaker embedding vector
summarize	string	—	AI analysis: true, sales_coaching, collections, or custom prompt text
hotwords	string	—	Comma-separated hotwords for boosting (e.g. 'LoanTap,GoBoult,EMI')
hotword_weight	float	10.0	Hotword boost weight in beam search
noise_reduce	bool	false	Apply noise reduction before transcription
gain_normalize	bool	false	Normalize audio gain before transcription
profanity_filter	bool	false	Mask profanity in transcript output
n_best	int	1	Number of alternative transcriptions to return
format	string	—	Output format: srt, vtt, segments (for subtitles)
structured_intents	bool	false	Two-tier intent classification: speech_act + domain
intent_labels	string	—	Filter intent predictions to specific labels

Response Format

{
  "transcript": "Hello, good morning. How are you?",
  "duration": 5.2,
  "inference_time": "0.823s",
  "model": "whissle-large",
  "decoder": "beam_search_kenlm",

  // Present when diarize=true
  "num_speakers": 2,
  "segments": [
    {
      "speaker": "SPEAKER_00",
      "text": "Hello, good morning.",
      "start": 1.0,
      "end": 1.9,
      "confidence": 0.96,

      // Always present — top-1 prediction per category
      "metadata": {
        "emotion": "EMOTION_NEUTRAL",
        "behavior": "BEHAVIOR_DIRECT",
        "role": "ROLE_INTERVIEWER",
        "eval": "EVAL_NONE",
        "age": "AGE_30_45",
        "gender": "GENDER_MALE"
      },

      // Present when metadata_prob=true — full distributions
      "metadata_probs": {
        "emotion": {
          "EMOTION_NEUTRAL": 0.82,
          "EMOTION_HAPPY": 0.12,
          "EMOTION_SAD": 0.03,
          "EMOTION_ANGRY": 0.02,
          "EMOTION_FEAR": 0.01
        },
        "age": { "AGE_30_45": 0.65, "AGE_18_30": 0.25, ... }
      },

      // Present when word_timestamps=true
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3, "confidence": 0.98},
        {"word": "good",  "start": 1.4, "end": 1.6, "confidence": 0.95},
        {"word": "morning", "start": 1.6, "end": 1.9, "confidence": 0.97}
      ],

      "entities": [
        {"type": "PERSON", "value": "Diana"},
        {"type": "ORG", "value": "American Amicable"}
      ]
    }
  ],

  // Present when speech_analysis=true
  "speech_analysis": {
    "lexical_fluency": 0.92,
    "vocabulary_range": 0.85,
    "grammar_score": 0.88
  },

  // Present when summarize=... is set
  "summary": "...",           // text or markdown
  "analysis": { ... }         // structured JSON (sales_coaching, collections)
}

Other Batch Endpoints

Endpoint	Description
POST /transcribe	Full transcription with all metadata and analysis options
POST /transcribe/clean	Returns only clean transcript text — fastest, no metadata
POST /transcribe/raw	Returns raw model output including inline action tokens
POST /transcribe/pcm	Transcribe raw PCM s16le audio bytes (no file header needed)
POST /transcribe/long	Long audio with VAD segmentation — returns segments array
POST /transcribe/batch	Batch multiple files in one request
GET /models	List loaded ASR models and their metadata
GET /intents	List available intent labels for the loaded model
GET /languages	List supported languages and their KenLM availability
GET /status	Server status — loaded models, memory, uptime

ASR — WebSocket Streaming

Real-time transcription over WebSocket. Send raw PCM audio chunks, receive interim and final transcripts with metadata as they arrive.

WebSocket ws://localhost:8001/stream

# Protocol:
# 1. Connect
# 2. Send JSON config
# 3. Send binary PCM chunks (16kHz, 16-bit, mono)
# 4. Receive JSON transcripts (interim + final)
# 5. Send {"type": "end"} to finalize

Python example

import asyncio, json, websockets

async def stream(audio_path):
    async with websockets.connect("ws://localhost:8001/stream") as ws:
        # 1. Config
        await ws.send(json.dumps({
            "type": "config",
            "sample_rate": 16000,
            "language": "en",
            "model": "en-in-tech-misc",  # optional
            "interim_results": True
        }))

        # 2. Stream audio chunks (100ms = 3200 bytes at 16kHz s16le)
        with open(audio_path, "rb") as f:
            while chunk := f.read(3200):
                await ws.send(chunk)
                await asyncio.sleep(0.05)  # pace to real-time

        # 3. Signal end
        await ws.send(json.dumps({"type": "end"}))

        # 4. Receive results
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                print(f"FINAL: {data['text']}")
                print(f"  emotion={data['metadata']['emotion']}")
            elif data["type"] == "transcript":
                print(f"  interim: {data['text']}")
            elif data["type"] == "end":
                break

asyncio.run(stream("call.wav"))

Streaming Config

Field	Type	Default	Description
type	string	required	Must be 'config'
sample_rate	int	16000	Audio sample rate in Hz
language	string	auto	Language hint
model	string	default	ASR model to use
interim_results	bool	true	Send interim (partial) transcripts
channel	string	microphone	Audio channel name (for multi-channel tagging)

Streaming Response

// Interim transcript (partial, updates as more audio arrives)
{
  "type": "transcript",
  "channel": "microphone",
  "text": "Hello how are",
  "is_final": false,
  "utterance_end": false
}

// Final transcript (complete utterance with metadata)
{
  "type": "transcript",
  "text": "Hello, how are you?",
  "is_final": true,
  "utterance_end": true,
  "process_ms": 510,
  "metadata": {
    "emotion": "EMOTION_NEUTRAL",
    "behavior": "BEHAVIOR_DIRECT",
    "role": "ROLE_INTERVIEWER",
    "age": "AGE_30_45",
    "gender": "GENDER_MALE"
  }
}

// Session end
{ "type": "end" }

Speaker Diarization

Multi-speaker separation using ECAPA-TDNN 192-dim speaker embeddings + agglomerative clustering. Available in both batch and streaming modes.

# Batch with diarization
curl -X POST http://localhost:8001/transcribe \
    -F "file=@meeting.wav" \
    -F "diarize=true" \
    -F "num_speakers=3"

# Response includes per-speaker segments:
# [SPEAKER_00]: "Hello, good morning."
# [SPEAKER_01]: "Hi, how are you?"
# [SPEAKER_00]: "I'm doing well, thank you."
# [SPEAKER_02]: "Let's get started."

# Each segment has its own metadata (emotion, age, gender per speaker)

Pipeline: Silero VAD segmentation → per-segment ASR with speaker embedding → agglomerative clustering → merge adjacent same-speaker segments.

Metadata

Every ASR segment includes metadata extracted in a single forward pass — no separate models or API calls. Tags available depend on which ASR model is loaded.

en-tiny (en-tiny variant)

Pure CTC model with no metadata classifier heads. Returns transcript only — no emotion, age, gender, or behavior tags. Fastest inference at 50x realtime on CPU. Punctuation and ITN are applied via shared post-processing models.

Common tags (all models except en-tiny)

Tag	Values	Description
emotion	EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE	Speaker emotion per utterance
age	AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+	Estimated speaker age range
gender	GENDER_MALE, GENDER_FEMALE	Speaker gender

en-in-tech-misc (en-full, en-lite variants)

Tag	Classes	Description
behavior	26 types	BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION_OPEN, BEHAVIOR_QUESTION_CLOSED, BEHAVIOR_ACKNOWLEDGE, BEHAVIOR_DIRECT, BEHAVIOR_EVALUATE, BEHAVIOR_STRUCTURE, BEHAVIOR_THINK_ALOUD, BEHAVIOR_INFORM, BEHAVIOR_REASON, BEHAVIOR_ABILITY, BEHAVIOR_COMMIT, BEHAVIOR_ADVISE, BEHAVIOR_REFLECT, BEHAVIOR_AFFIRM, BEHAVIOR_FACILITATE, BEHAVIOR_FILLER, BEHAVIOR_EXPRESS, BEHAVIOR_FOLLOW_NEUTRAL, BEHAVIOR_RAISE_CONCERN, BEHAVIOR_REFRAME, BEHAVIOR_SUPPORT, BEHAVIOR_CONFRONT, BEHAVIOR_WARN
eval	8 types	EVAL_NONE, EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP
role	3 types	ROLE_INTERVIEWER, ROLE_INTERVIEWEE

hinglish-loans (hinglish variant)

Tag	Classes	Description
intent	13 types	INTENT_GREETING, INTENT_IDENTITY_VERIFY, INTENT_PAYMENT_REMINDER, INTENT_PAYMENT_INSTRUCTION, INTENT_CLAIMS_PAID, INTENT_PROMISE_TO_PAY, INTENT_PAYMENT_QUERY, INTENT_AMOUNT_DISPUTE, INTENT_FINANCIAL_HARDSHIP, INTENT_COMPLAINT, INTENT_URGENCY_PRESSURE, INTENT_ACKNOWLEDGMENT, INTENT_OTHER
role	3 types	ROLE_AGENT, ROLE_CUSTOMER, ROLE_OTHER

zh (multi-zh, all variants)

Tag	Classes
dialect	DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS

whissle-large (multi-full, multi-zh, all variants)

Tag	Classes
intent	31 groups — inline action tokens extracted from the vocabulary

AI Analysis

Add summarize=mode to any batch transcription. The diarized transcript + per-segment metadata is sent to your configured LLM for analysis.

Preset: sales_coaching

Scores 8 sales best practices: greeting, discovery questions, active listening, objection handling, value proposition, urgency creation, next steps, closing.

curl -F "file=@call.mp3" -F "diarize=true" \
    -F "summarize=sales_coaching" http://localhost:8001/transcribe

# Returns analysis.overall_score, analysis.practices,
# analysis.buyer_outcome, analysis.highlights,
# analysis.behaviors (per-segment labels)

Preset: collections

Compliance scoring for debt collection calls: identity verification, reason stated, amount mentioned, no harassment. Call outcome and next action.

-F "summarize=collections"

# Returns analysis.compliance, analysis.call_outcome,
# analysis.customer_sentiment, analysis.next_action

Preset: true (general)

Markdown summary with overview, participants, key topics, emotional dynamics, entities, and outcome.

-F "summarize=true"

Custom prompt

Pass any string — it becomes the LLM prompt with the full transcript + metadata appended.

-F "summarize=You are a medical call quality analyst. \
    Score this call on: 1) empathy (0-10), \
    2) medical history completeness, 3) HIPAA compliance. \
    Return JSON with scores and recommendations."

# The LLM receives your prompt + every segment with
# emotion, intent, behavior, age, gender, role labels.

Video Intelligence

Real-time visual context for voice calls. Dual-lane architecture: a fast lane (MediaPipe, ~40ms) for face emotion, gaze, gestures, and a semantic lane (any vision LLM, ~1.5s throttled) for scene understanding. Works with cloud APIs or self-hosted models. Visual context is injected into the LLM prompt automatically — zero blocking on the audio pipeline.

Visual Context (per-session)

Push video frames during a live session and retrieve the computed visual context. The Pipecat voice pipeline does this automatically when video=1 is passed to the offer endpoint.

Method	Endpoint	Description
POST	/video/context/{session_id}	Push a base64-encoded JPEG frame for analysis
GET	/video/context/{session_id}	Get current visual context (fast + semantic lanes)
DELETE	/video/context/{session_id}	Clear session context on disconnect
GET	/video/health	Health check — reports fast/semantic lane status and LLM backend

# Push a frame
curl -X POST http://localhost:8002/context/session-123 \
    -F "image=<base64-encoded-jpeg>"

# Get current visual context
curl http://localhost:8002/context/session-123

# Response:
{
  "fast": {
    "faces": [{
      "emotion": "happy",
      "gaze": "center",
      "head_pose": {"pitch": -2.1, "yaw": 5.3, "roll": 0.8},
      "speaking": true
    }],
    "hands": [{"gesture": "open_palm", "confidence": 0.92}]
  },
  "semantic": {
    "scene": "modern office with whiteboard",
    "activity": "presenting to camera",
    "entities": ["laptop", "whiteboard", "coffee mug"]
  },
  "prompt_block": "[VISUAL CONTEXT]\nFace: happy, looking at camera..."
}

Streaming

Two WebSocket endpoints for real-time video processing:

Endpoint	Input	Output
WS /video/ws	JSON frames with base64 image	Visual context updates (fast + semantic)
WS /video/ws/av	Interleaved binary PCM audio + JSON video frames	ASR transcripts enriched with visual context

# Video-only streaming
wscat -c "ws://localhost:8002/ws?session_id=s1"
# Send: {"type":"frame","image":"<b64 jpeg>"}
# Receive: {"type":"fast","faces":[...],"hands":[...]}
# Receive: {"type":"semantic","scene":"...","activity":"..."}

# Fused audio-visual streaming (single WebSocket)
# Binary messages → forwarded to ASR as PCM audio
# JSON messages  → processed through vision lanes
# ASR transcripts are enriched with latest visual context
# before forwarding back to the client

The fused /ws/av endpoint runs audio and video on fully independent lanes with zero cross-blocking. Audio is proxied to the ASR streaming WebSocket internally, while video frames are processed through the fast and semantic vision pipelines.

Batch Analysis

Upload a video file for full analysis. Audio is extracted via FFmpeg and transcribed; video frames are sampled at configurable FPS for visual analysis.

curl -X POST http://localhost:8002/analyze \
    -F "file=@interview.mp4" \
    -F "fast_fps=5" \
    -F "semantic_fps=0.5" \
    -F "asr=true"

# Returns timeline of visual context + ASR transcript
# with fused audio-visual results per segment

Parameter	Type	Default	Description
file	file	required	Video file (MP4, WebM, MOV, AVI)
fast_fps	float	5	Frames per second for fast lane (MediaPipe)
semantic_fps	float	0.5	Frames per second for semantic lane (LLM Vision)
asr	bool	true	Also transcribe the audio track

TTS — Text-to-Speech

Kokoro 82M — non-autoregressive, single forward pass, 55 voices, 10 languages. Sub-200ms TTFB on CPU.

WebSocket ws://localhost:8003/stream

# Protocol:
# 1. Connect
# 2. Send config: {"type": "config", "voice": "af_heart"}
# 3. Send speak:  {"type": "speak", "text": "Hello!"}
# 4. Receive binary PCM s16le chunks (24kHz mono)
# 5. Receive {"type": "done"} when complete

Python example

import asyncio, json, wave, websockets

async def speak(text, voice="af_heart"):
    async with websockets.connect("ws://localhost:8003/stream") as ws:
        await ws.send(json.dumps({"type": "config", "voice": voice}))
        await ws.send(json.dumps({"type": "speak", "text": text}))

        audio = b""
        async for msg in ws:
            if isinstance(msg, bytes):
                audio += msg           # PCM s16le, 24kHz
            elif json.loads(msg).get("type") == "done":
                break

        with wave.open("out.wav", "wb") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(24000)
            wf.writeframes(audio)

asyncio.run(speak("Hello from Whissle Gateway."))

TTS Config

Field	Type	Default	Description
type	string	required	Must be 'config'
voice	string	af_heart	Voice ID (see /voices endpoint)
language	string	en	Language for phonemization
speed	float	1.0	Speech speed (0.5 – 2.0)
temperature	float	0.6	Sampling temperature (0.0 – 2.0)
top_k	int	50	Top-k sampling (0 – 200)
exaggeration	float	0.0	Voice exaggeration (prosody variation)

Speak message fields:

Field	Type	Description
type	string	Must be 'speak'
text	string	Text to synthesize
instruct	string	Optional style instruction (e.g. 'speak slowly and calmly')

Voices

Query available voices via REST:

curl http://localhost:8003/voices

# Returns: {"voices": {"af_heart": {...}, "am_adam": {...}, ...}}

55 voices across American (af_, am_), British (bf_, bm_), and other accents. Voice IDs follow the pattern: {accent}{gender}_{name}.

Voice Calling

Built-in voice calling platform powered by Pipecat. Browser voice chat via WebRTC, phone calls via Twilio Media Streams. The full pipeline runs ASR → LLM → TTS inside the container with real-time emotion and intent metadata.

All voice calling endpoints are served by the Pipecat service on port 8000 and proxied through the gateway at /bot/api/*. Authentication uses HTTP-only JWT cookies (see User Auth).

Voice Agents

Create and manage voice agent personas. Each agent has a system prompt, greeting message, voice, and LLM model configuration.

Method	Endpoint	Description
GET	/bot/api/agents	List all agents for the current organization
POST	/bot/api/agents	Create a new voice agent
GET	/bot/api/agents/{id}	Get agent by ID
PATCH	/bot/api/agents/{id}	Update agent settings
DELETE	/bot/api/agents/{id}	Delete an agent

# Create a voice agent
curl -X POST http://localhost:9000/bot/api/agents \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{
      "name": "Sales Rep",
      "system_prompt": "You are a helpful sales assistant for Acme Corp.",
      "greeting": "Hi there! How can I help you today?",
      "voice": "tara",
      "voice_gender": "female",
      "llm_model": "gemini-2.5-flash",
      "video_enabled": true
    }'

# Response:
# {"id": "uuid", "name": "Sales Rep", "voice": "tara", ...}
# video_enabled: when true, camera is enabled by default for this agent.
# Can also be overridden per-session with ?video=1 on the offer endpoint.

WebRTC Signaling

Start a browser voice session by sending an SDP offer. The gateway returns an SDP answer and establishes a WebRTC connection to the Pipecat voice pipeline.

Method	Endpoint	Description
POST	/bot/api/offer	Send SDP offer to start a voice session
PATCH	/bot/api/offer	Send ICE trickle candidates

# Start a WebRTC voice session
curl -X POST http://localhost:9000/bot/api/offer \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{
      "sdp": "<browser SDP offer string>",
      "type": "offer",
      "agent_id": "agent-uuid",
      "customer_id": "customer-uuid"
    }'

# Response: {"sdp": "<SDP answer>", "type": "answer", "pc_id": "..."}

# Start with camera enabled (video intelligence)
curl -X POST "http://localhost:9000/bot/api/offer?video=1" \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{"sdp": "...", "type": "offer", "agent_id": "..."}'

# Start with 3D avatar
curl -X POST "http://localhost:9000/bot/api/offer?avatar=simli\
    &avatar_face_id=b9e5fba3-..." \
    -d '{"sdp": "...", "type": "offer", "agent_id": "..."}'

# Send ICE candidates as they arrive
curl -X PATCH http://localhost:9000/bot/api/offer \
    -H "Content-Type: application/json" \
    -d '{"pc_id": "...", "candidates": [{"candidate": "..."}]}'

Query params: video=1 enables camera capture and visual context injection (face emotion, gaze, gestures, scene into LLM). avatar=simli enables a lip-synced 3D avatar. Both can be combined. For Twilio phone calls, the gateway accepts Twilio Media Streams over WebSocket at /ws/twilio.

Customers

Manage customer records linked to voice calls. Customers are scoped to the current organization.

Method	Endpoint	Description
GET	/bot/api/customers	List all customers
POST	/bot/api/customers	Create a customer record
GET	/bot/api/customers/{id}	Get customer by ID
PATCH	/bot/api/customers/{id}	Update customer details
DELETE	/bot/api/customers/{id}	Delete a customer

curl -X POST http://localhost:9000/bot/api/customers \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{
      "name": "Jane Smith",
      "phone_number": "+14155551234",
      "email": "jane@example.com"
    }'

Call History

Query call records including transcripts, metadata, duration, and recordings.

Method	Endpoint	Description
GET	/bot/api/calls	List calls (filterable by agent, customer, date)
POST	/bot/api/calls/start	Start an outbound Twilio call
POST	/bot/api/calls/twilio-status	Twilio status callback webhook
POST	/bot/api/calls/twilio-amd	Twilio answering machine detection callback

Call records include the full transcript (with per-segment metadata), call duration, recording path, and AI-generated summary. Data is stored in PostgreSQL and persists in the /data volume.

Voice Pipeline

The Pipecat voice pipeline processes audio through a chain of intelligent processors:

transport.input (WebRTC / Twilio audio + video)
    → VideoContext       Captures video frames → POST to :8002 (~2.5 fps)
    → WhissleSTT         ASR via local :8001, returns text + metadata
    → SmartEndpointer    Adapts VAD from intent head (open question = longer wait)
    → EmotionAdaptive    Injects emotion/intent + visual context into LLM
    → Backchannel        Plays "mm-hmm" cues during user speech
    → UserAggregator     Combines multi-utterance user turns
    → LLM               Generates response with speech + visual context
    → ClosingPhrase      Detects goodbye → auto-hangup
    → NumberPronouncer   "4500" → "forty-five hundred"
    → TTS (Kokoro)       Text-to-speech via local :8003
    → AudioHumanizer     Adds room tone + breath sounds
    → SilenceWatchdog    2-strike: prompt after silence, hangup after second
    → transport.output   WebRTC / Twilio audio out
    → AudioBuffer        Records both directions for call recording
    → AssistantAgg       Combines multi-chunk assistant turns

The pipeline runs at 24kHz for WebRTC and resamples from 8kHz µ-law for Twilio. All ASR metadata (emotion, intent, behavior) and visual context (face emotion, gaze, gestures, scene) are available in real-time and injected into the LLM context as developer messages. Video processing is fire-and-forget — it never blocks the audio path.

Agent

Intelligent agent powered by any LLM — cloud APIs or self-hosted. Handles summarization, coaching analysis, conversational AI, and tool-augmented workflows.

Process (single-turn)

POST http://localhost:8765/process

curl -X POST http://localhost:8765/process \
    -H "Content-Type: application/json" \
    -H "X-Device-Id: user-1" \
    -d '{
      "transcript": "What meetings do I have today?",
      "user_id": "user-1",
      "language": "en"
    }'

# Response: mode, processed_text, is_command, etc.

Chat (multi-turn streaming)

POST http://localhost:8765/voice-agent/chat/stream

curl -N -X POST http://localhost:8765/voice-agent/chat/stream \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {"role": "user", "content": "Summarize the key takeaways."}
      ],
      "user_id": "user-1"
    }'

# SSE response:
# data: {"type": "text_chunk", "content": "The key takeaways..."}
# data: {"type": "done"}

User Authentication

The voice calling platform includes a full user authentication system with email/password signup, JWT sessions, guest access, Google OAuth, and password reset flows. Auth state is stored in PostgreSQL and JWT cookies are HTTP-only with automatic refresh token rotation.

This is separate from the API token system used for ASR/TTS/Agent requests. User auth is used by the voice calling frontend and protects all /bot/api/* endpoints.

Auth Endpoints

Method	Endpoint	Description
POST	/bot/api/auth/signup	Register with email + password
POST	/bot/api/auth/login	Login — sets access_token + refresh_token cookies
POST	/bot/api/auth/logout	Clear session cookies and revoke refresh token
POST	/bot/api/auth/refresh	Rotate access + refresh tokens (cookie-based)
GET	/bot/api/auth/me	Get current user profile + active organization
POST	/bot/api/auth/guest	Create a guest session (no email required)
POST	/bot/api/auth/forgot-password	Send password reset email
POST	/bot/api/auth/reset-password	Reset password with token from email
POST	/bot/api/auth/verify-email	Verify email with token from email
POST	/bot/api/auth/resend-verification	Resend email verification
GET	/bot/api/auth/config	Auth configuration (Google OAuth enabled, etc.)
GET	/bot/api/auth/google/start	Initiate Google OAuth flow
GET	/bot/api/auth/google/callback	Google OAuth callback handler

# Sign up
curl -X POST http://localhost:9000/bot/api/auth/signup \
    -H "Content-Type: application/json" \
    -d '{"email": "user@example.com", "password": "securepass123"}'

# Login — response sets HTTP-only cookies
curl -X POST http://localhost:9000/bot/api/auth/login \
    -H "Content-Type: application/json" \
    -c cookies.txt \
    -d '{"email": "user@example.com", "password": "securepass123"}'

# Use cookies for authenticated requests
curl http://localhost:9000/bot/api/auth/me -b cookies.txt

# Guest access (no credentials needed)
curl -X POST http://localhost:9000/bot/api/auth/guest -c cookies.txt

Organizations

Multi-tenant support via organizations. Each organization has its own agents, customers, and call history. Users can belong to multiple organizations with different roles (owner, admin, member).

Method	Endpoint	Description
GET	/bot/api/orgs	List organizations the current user belongs to
POST	/bot/api/orgs	Create a new organization
GET	/bot/api/orgs/active	Get the currently active organization
POST	/bot/api/orgs/switch	Switch active organization
GET	/bot/api/orgs/{id}	Get organization details
PATCH	/bot/api/orgs/{id}	Update organization (name, slug)
DELETE	/bot/api/orgs/{id}	Delete organization (owner only)
GET	/bot/api/orgs/{id}/members	List organization members
PATCH	/bot/api/orgs/{id}/members/{user_id}	Change member role

# Create an organization
curl -X POST http://localhost:9000/bot/api/orgs \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"name": "Acme Corp", "slug": "acme"}'

# Switch active organization
curl -X POST http://localhost:9000/bot/api/orgs/switch \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"organization_id": "org-uuid"}'

Roles: owner (full control + delete), admin (manage members + settings), member (use agents + view calls).

Invitations

Invite users to join an organization via email. Invitations include a secure token link that can be accepted from the frontend.

Method	Endpoint	Description
GET	/bot/api/orgs/{id}/invitations	List pending invitations for an organization
POST	/bot/api/orgs/{id}/invitations	Create an invitation (sends email)
DELETE	/bot/api/orgs/{id}/invitations/{inv_id}	Revoke a pending invitation
GET	/bot/api/invitations/preview?token=...	Preview invitation details (public)
POST	/bot/api/invitations/accept	Accept an invitation and join the organization

# Invite a user
curl -X POST http://localhost:9000/bot/api/orgs/{org_id}/invitations \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"email": "colleague@example.com", "role": "member"}'

# Accept invitation (the invited user)
curl -X POST http://localhost:9000/bot/api/invitations/accept \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"token": "invitation-token-from-email"}'

Database Schema

PostgreSQL runs embedded inside the container on port 5432. Data persists in the /data/pgdata directory on the Docker volume. Schema migrations run automatically on first start.

Core tables

Table	Key Columns	Description
agents	name, system_prompt, greeting, voice, llm_model, organization_id	Voice agent configurations
customers	name, phone_number, email, organization_id	Customer records linked to calls
calls	agent_id, customer_id, transcript (JSONB), metadata (JSONB), recording_path, duration, status	Call history with full transcripts and recordings

Auth tables

Table	Key Columns	Description
users	email (CITEXT), password_hash, auth_provider, device_id	User accounts
user_sessions	user_id, refresh_token_hash, device_info, replaced_by	JWT sessions with token rotation chain
oauth_accounts	user_id, provider, provider_user_id	Google OAuth linked accounts
password_resets	user_id, token_hash, expires_at	Password reset tokens (1h expiry)
email_verifications	user_id, token_hash, expires_at	Email verification tokens
audit_logs	user_id, action, ip_address, user_agent	Security audit trail

Multi-tenancy tables

Table	Key Columns	Description
organizations	name, slug (CITEXT), owner_user_id	Organization accounts
organization_memberships	user_id, organization_id, role (owner/admin/member)	User-org membership with roles
organization_invitations	email, organization_id, role, token_hash, expires_at	Pending invitations

Configuration

All configuration via environment variables passed to docker run -e.

Variable	Default	Description
VARIANT	en-full	Model variant: en-tiny, hinglish, en-lite, en-full, multi-full, multi-zh, all
ASR_DEVICE	auto	Inference device: auto, cpu, cuda
LLM_PROVIDER	claude	LLM provider: claude, gemini, groq, or local (llama.cpp / vLLM / Ollama)
ANTHROPIC_API_KEY	—	Anthropic Claude API key
ANTHROPIC_MODEL	claude-sonnet-4-6	Claude model ID
GEMINI_API_KEY	—	Google Gemini API key
STORAGE_MODE	sqlite	Data storage: sqlite (local) or firestore (cloud)
TTS_ENGINE	kokoro	TTS engine (kokoro is default and baked in)
AUTH_MODE	local	Token auth mode: local (SQLite), remote (cloud backend), hybrid (local then remote)
AUTH_DB_PATH	/data/auth/tokens.db	Path to SQLite token database
GATEWAY_INTERNAL_SECRET	auto	Shared secret for internal service-to-service auth
HIPAA_MODE	false	Enable HIPAA mode: 1h session TTL, encrypted DB
VIDEO_CONTEXT_URL	http://127.0.0.1:8002	Video intelligence service URL (internal)
VIDEO_LLM_BACKEND	auto	Vision LLM: anthropic, gemini, or custom (any vision endpoint). Auto-detects
VIDEO_LLM_MODEL	auto	Vision model ID (e.g. claude-haiku-4-5, gemini-2.0-flash)
FRONTEND_URL	http://localhost:3000	Frontend origin for CORS (comma-separated for multiple)
TWILIO_ACCOUNT_SID	—	Twilio account SID (for outbound phone calls)
TWILIO_AUTH_TOKEN	—	Twilio auth token
TWILIO_PHONE_NUMBER	—	Twilio phone number for outbound calls
GOOGLE_CLIENT_ID	—	Google OAuth client ID (optional)
GOOGLE_CLIENT_SECRET	—	Google OAuth client secret (optional)

Need help?

See industry-specific solutions or contact us for custom integrations.

View Solutions Contact Us

API Documentation

Overview

API Authentication

Getting your admin token

Creating user tokens

Using tokens

Token management

ASR — Batch Transcription

Basic usage

Parameters

Response Format

Other Batch Endpoints

ASR — WebSocket Streaming

Python example

Streaming Config

Streaming Response

Speaker Diarization

Metadata

en-tiny (en-tiny variant)

Common tags (all models except en-tiny)

en-in-tech-misc (en-full, en-lite variants)

hinglish-loans (hinglish variant)

zh (multi-zh, all variants)

whissle-large (multi-full, multi-zh, all variants)

AI Analysis

Preset: sales_coaching

Preset: collections

Preset: true (general)

Custom prompt

Video Intelligence

Visual Context (per-session)

Streaming

Batch Analysis

TTS — Text-to-Speech

Python example

TTS Config

Voices

Voice Calling

Voice Agents

WebRTC Signaling

Customers

Call History

Voice Pipeline

Agent

Process (single-turn)

Chat (multi-turn streaming)

User Authentication

Auth Endpoints

Organizations

Invitations

Database Schema

Core tables

Auth tables

Multi-tenancy tables

Configuration

Need help?

Ready to meet your personal AI?