🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai🔒 Service Notice: Cloud services temporarily down — reinforcing our on-prem AI. Contact: hello@whissle.ai

API Documentation

Complete reference for Whissle Gateway — ASR, TTS, voice calling, diarization, metadata, AI analysis, auth, and multi-tenancy.

Overview

Whissle Gateway runs seven services in a single Docker container, managed by supervisord:

ServicePortProtocolFunction
PostgreSQL5432TCPPersistent database for agents, calls, users, organizations
ASR8001REST + WebSocketSpeech recognition, diarization, metadata
Video8002REST + WebSocketVideo intelligence — face emotion, gaze, gestures, scene
TTS8003WebSocketText-to-speech (Kokoro 82M, 55 voices)
Pipecat8000REST + WebSocketVoice calling platform (WebRTC + Twilio)
Agent8765REST + SSELLM processing, summarization, coaching
Gateway9000REST + WebSocketUnified proxy (requires API token)

For development and testing, hit the service ports directly (8001, 8003, 8000, 8765). The gateway proxy at 9000 adds authentication, rate limiting, and usage tracking for production use. Voice calling endpoints are proxied at /bot/*.

API Authentication

The gateway includes a built-in API token system for authenticating ASR, TTS, and Agent requests. An admin token is auto-generated on first start. For user-level auth (voice calling, organizations), see User Auth.

Getting your admin token

# Check Docker startup logs for the admin token:
docker logs whissle-gateway 2>&1 | grep "Token:"

# Or read it from the persistent data volume:
docker exec whissle-gateway cat /data/auth/admin_token.txt

Creating user tokens

# Create a token for an application or user
curl -X POST http://localhost:9000/auth/tokens \
  -H "Authorization: Bearer wh_YOUR_ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"user_id": "my-app", "label": "My Application"}'

# Response:
# {"success": true, "token": "wh_a1b2c3...", "user_id": "my-app", ...}

Using tokens

# REST — Authorization header
curl -X POST http://localhost:9000/asr/transcribe \
  -H "Authorization: Bearer wh_YOUR_TOKEN" \
  -F "file=@call.mp3" -F "diarize=true"

# WebSocket — query parameter
wscat -c "ws://localhost:9000/listen?token=wh_YOUR_TOKEN"
wscat -c "ws://localhost:9000/asr/stream?token=wh_YOUR_TOKEN"

Token management

MethodEndpointDescription
POST/auth/tokensCreate a new API token (requires admin)
GET/auth/tokensList all active tokens (requires admin)
GET/auth/tokens?user_id=XList tokens for a specific user
DELETE/auth/tokens/{id}Revoke a token by ID (requires admin)
GET/auth/usage?user_id=X&days=7Usage logs for a user
GET/auth/usage/summaryAggregated usage stats

All /auth/* endpoints require the admin token. Tokens and usage data persist in /data/auth/tokens.db and survive container restarts when using a Docker volume.

ASR — Batch Transcription

Upload an audio file and receive a complete transcription with metadata.

Basic usage

curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3"

# With diarization + metadata + AI analysis
curl -X POST http://localhost:8001/transcribe \
    -F "file=@call.mp3" \
    -F "diarize=true" \
    -F "num_speakers=2" \
    -F "punctuation=true" \
    -F "itn=true" \
    -F "metadata_prob=true" \
    -F "word_timestamps=true" \
    -F "speech_analysis=true" \
    -F "summarize=sales_coaching" \
    -o result.json

Parameters

ParameterTypeDefaultDescription
filefilerequiredAudio file — MP3, WAV, FLAC, OGG, M4A, WebM
languagestringautoLanguage hint: en, hi, zh, ja, ko, es, fr, de, etc.
modelstringdefaultASR model: en-tiny, en-in-tech-misc, hinglish-loans, zh, whissle-large
diarizeboolfalseEnable speaker diarization (ECAPA-TDNN + agglomerative clustering)
num_speakersintautoExact number of speakers (skips auto-detection)
min_speakersint1Minimum speakers for auto-detection
max_speakersint10Maximum speakers for auto-detection
punctuationbooltrueRestore punctuation (commas, periods, question marks) and capitalization
itnbooltrueInverse text normalization — 'twenty three' → '23', 'पाँच सौ' → '500'
use_lmbooltrueUse KenLM n-gram language model for beam search decoding
lm_modestringbalancedLM mode: greedy, balanced, strict — controls LM weight on decoding
beam_widthint100Beam width for KenLM beam search (higher = better, slower)
lm_alphafloat0.1LM weight (alpha) for beam search scoring
lm_betafloat0.5Word insertion bonus (beta) for beam search
metadata_probboolfalseInclude probability distributions for emotion, intent, age, gender
word_timestampsboolfalseInclude per-word start and end timestamps with confidence scores
speech_analysisboolfalseInclude speech pattern analysis: lexical fluency, vocabulary range, grammar, pace
speaker_embeddingboolfalseReturn 192-dim speaker embedding vector
summarizestringAI analysis: true, sales_coaching, collections, or custom prompt text
hotwordsstringComma-separated hotwords for boosting (e.g. 'LoanTap,GoBoult,EMI')
hotword_weightfloat10.0Hotword boost weight in beam search
noise_reduceboolfalseApply noise reduction before transcription
gain_normalizeboolfalseNormalize audio gain before transcription
profanity_filterboolfalseMask profanity in transcript output
n_bestint1Number of alternative transcriptions to return
formatstringOutput format: srt, vtt, segments (for subtitles)
structured_intentsboolfalseTwo-tier intent classification: speech_act + domain
intent_labelsstringFilter intent predictions to specific labels

Response Format

{
  "transcript": "Hello, good morning. How are you?",
  "duration": 5.2,
  "inference_time": "0.823s",
  "model": "whissle-large",
  "decoder": "beam_search_kenlm",

  // Present when diarize=true
  "num_speakers": 2,
  "segments": [
    {
      "speaker": "SPEAKER_00",
      "text": "Hello, good morning.",
      "start": 1.0,
      "end": 1.9,
      "confidence": 0.96,

      // Always present — top-1 prediction per category
      "metadata": {
        "emotion": "EMOTION_NEUTRAL",
        "behavior": "BEHAVIOR_DIRECT",
        "role": "ROLE_INTERVIEWER",
        "eval": "EVAL_NONE",
        "age": "AGE_30_45",
        "gender": "GENDER_MALE"
      },

      // Present when metadata_prob=true — full distributions
      "metadata_probs": {
        "emotion": {
          "EMOTION_NEUTRAL": 0.82,
          "EMOTION_HAPPY": 0.12,
          "EMOTION_SAD": 0.03,
          "EMOTION_ANGRY": 0.02,
          "EMOTION_FEAR": 0.01
        },
        "age": { "AGE_30_45": 0.65, "AGE_18_30": 0.25, ... }
      },

      // Present when word_timestamps=true
      "words": [
        {"word": "Hello", "start": 1.0, "end": 1.3, "confidence": 0.98},
        {"word": "good",  "start": 1.4, "end": 1.6, "confidence": 0.95},
        {"word": "morning", "start": 1.6, "end": 1.9, "confidence": 0.97}
      ],

      "entities": [
        {"type": "PERSON", "value": "Diana"},
        {"type": "ORG", "value": "American Amicable"}
      ]
    }
  ],

  // Present when speech_analysis=true
  "speech_analysis": {
    "lexical_fluency": 0.92,
    "vocabulary_range": 0.85,
    "grammar_score": 0.88
  },

  // Present when summarize=... is set
  "summary": "...",           // text or markdown
  "analysis": { ... }         // structured JSON (sales_coaching, collections)
}

Other Batch Endpoints

EndpointDescription
POST /transcribeFull transcription with all metadata and analysis options
POST /transcribe/cleanReturns only clean transcript text — fastest, no metadata
POST /transcribe/rawReturns raw model output including inline action tokens
POST /transcribe/pcmTranscribe raw PCM s16le audio bytes (no file header needed)
POST /transcribe/longLong audio with VAD segmentation — returns segments array
POST /transcribe/batchBatch multiple files in one request
GET /modelsList loaded ASR models and their metadata
GET /intentsList available intent labels for the loaded model
GET /languagesList supported languages and their KenLM availability
GET /statusServer status — loaded models, memory, uptime

ASR — WebSocket Streaming

Real-time transcription over WebSocket. Send raw PCM audio chunks, receive interim and final transcripts with metadata as they arrive.

WebSocket ws://localhost:8001/stream

# Protocol:
# 1. Connect
# 2. Send JSON config
# 3. Send binary PCM chunks (16kHz, 16-bit, mono)
# 4. Receive JSON transcripts (interim + final)
# 5. Send {"type": "end"} to finalize

Python example

import asyncio, json, websockets

async def stream(audio_path):
    async with websockets.connect("ws://localhost:8001/stream") as ws:
        # 1. Config
        await ws.send(json.dumps({
            "type": "config",
            "sample_rate": 16000,
            "language": "en",
            "model": "en-in-tech-misc",  # optional
            "interim_results": True
        }))

        # 2. Stream audio chunks (100ms = 3200 bytes at 16kHz s16le)
        with open(audio_path, "rb") as f:
            while chunk := f.read(3200):
                await ws.send(chunk)
                await asyncio.sleep(0.05)  # pace to real-time

        # 3. Signal end
        await ws.send(json.dumps({"type": "end"}))

        # 4. Receive results
        async for msg in ws:
            data = json.loads(msg)
            if data.get("is_final"):
                print(f"FINAL: {data['text']}")
                print(f"  emotion={data['metadata']['emotion']}")
            elif data["type"] == "transcript":
                print(f"  interim: {data['text']}")
            elif data["type"] == "end":
                break

asyncio.run(stream("call.wav"))

Streaming Config

FieldTypeDefaultDescription
typestringrequiredMust be 'config'
sample_rateint16000Audio sample rate in Hz
languagestringautoLanguage hint
modelstringdefaultASR model to use
interim_resultsbooltrueSend interim (partial) transcripts
channelstringmicrophoneAudio channel name (for multi-channel tagging)

Streaming Response

// Interim transcript (partial, updates as more audio arrives)
{
  "type": "transcript",
  "channel": "microphone",
  "text": "Hello how are",
  "is_final": false,
  "utterance_end": false
}

// Final transcript (complete utterance with metadata)
{
  "type": "transcript",
  "text": "Hello, how are you?",
  "is_final": true,
  "utterance_end": true,
  "process_ms": 510,
  "metadata": {
    "emotion": "EMOTION_NEUTRAL",
    "behavior": "BEHAVIOR_DIRECT",
    "role": "ROLE_INTERVIEWER",
    "age": "AGE_30_45",
    "gender": "GENDER_MALE"
  }
}

// Session end
{ "type": "end" }

Speaker Diarization

Multi-speaker separation using ECAPA-TDNN 192-dim speaker embeddings + agglomerative clustering. Available in both batch and streaming modes.

# Batch with diarization
curl -X POST http://localhost:8001/transcribe \
    -F "file=@meeting.wav" \
    -F "diarize=true" \
    -F "num_speakers=3"

# Response includes per-speaker segments:
# [SPEAKER_00]: "Hello, good morning."
# [SPEAKER_01]: "Hi, how are you?"
# [SPEAKER_00]: "I'm doing well, thank you."
# [SPEAKER_02]: "Let's get started."

# Each segment has its own metadata (emotion, age, gender per speaker)

Pipeline: Silero VAD segmentation → per-segment ASR with speaker embedding → agglomerative clustering → merge adjacent same-speaker segments.

Metadata

Every ASR segment includes metadata extracted in a single forward pass — no separate models or API calls. Tags available depend on which ASR model is loaded.

en-tiny (en-tiny variant)

Pure CTC model with no metadata classifier heads. Returns transcript only — no emotion, age, gender, or behavior tags. Fastest inference at 50x realtime on CPU. Punctuation and ITN are applied via shared post-processing models.

Common tags (all models except en-tiny)

TagValuesDescription
emotionEMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISESpeaker emotion per utterance
ageAGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+Estimated speaker age range
genderGENDER_MALE, GENDER_FEMALESpeaker gender

en-in-tech-misc (en-full, en-lite variants)

TagClassesDescription
behavior26 typesBEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION_OPEN, BEHAVIOR_QUESTION_CLOSED, BEHAVIOR_ACKNOWLEDGE, BEHAVIOR_DIRECT, BEHAVIOR_EVALUATE, BEHAVIOR_STRUCTURE, BEHAVIOR_THINK_ALOUD, BEHAVIOR_INFORM, BEHAVIOR_REASON, BEHAVIOR_ABILITY, BEHAVIOR_COMMIT, BEHAVIOR_ADVISE, BEHAVIOR_REFLECT, BEHAVIOR_AFFIRM, BEHAVIOR_FACILITATE, BEHAVIOR_FILLER, BEHAVIOR_EXPRESS, BEHAVIOR_FOLLOW_NEUTRAL, BEHAVIOR_RAISE_CONCERN, BEHAVIOR_REFRAME, BEHAVIOR_SUPPORT, BEHAVIOR_CONFRONT, BEHAVIOR_WARN
eval8 typesEVAL_NONE, EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP
role3 typesROLE_INTERVIEWER, ROLE_INTERVIEWEE

hinglish-loans (hinglish variant)

TagClassesDescription
intent13 typesINTENT_GREETING, INTENT_IDENTITY_VERIFY, INTENT_PAYMENT_REMINDER, INTENT_PAYMENT_INSTRUCTION, INTENT_CLAIMS_PAID, INTENT_PROMISE_TO_PAY, INTENT_PAYMENT_QUERY, INTENT_AMOUNT_DISPUTE, INTENT_FINANCIAL_HARDSHIP, INTENT_COMPLAINT, INTENT_URGENCY_PRESSURE, INTENT_ACKNOWLEDGMENT, INTENT_OTHER
role3 typesROLE_AGENT, ROLE_CUSTOMER, ROLE_OTHER

zh (multi-zh, all variants)

TagClasses
dialectDIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS

whissle-large (multi-full, multi-zh, all variants)

TagClasses
intent31 groups — inline action tokens extracted from the vocabulary

AI Analysis

Add summarize=mode to any batch transcription. The diarized transcript + per-segment metadata is sent to your configured LLM for analysis.

Preset: sales_coaching

Scores 8 sales best practices: greeting, discovery questions, active listening, objection handling, value proposition, urgency creation, next steps, closing.

curl -F "file=@call.mp3" -F "diarize=true" \
    -F "summarize=sales_coaching" http://localhost:8001/transcribe

# Returns analysis.overall_score, analysis.practices,
# analysis.buyer_outcome, analysis.highlights,
# analysis.behaviors (per-segment labels)

Preset: collections

Compliance scoring for debt collection calls: identity verification, reason stated, amount mentioned, no harassment. Call outcome and next action.

-F "summarize=collections"

# Returns analysis.compliance, analysis.call_outcome,
# analysis.customer_sentiment, analysis.next_action

Preset: true (general)

Markdown summary with overview, participants, key topics, emotional dynamics, entities, and outcome.

-F "summarize=true"

Custom prompt

Pass any string — it becomes the LLM prompt with the full transcript + metadata appended.

-F "summarize=You are a medical call quality analyst. \
    Score this call on: 1) empathy (0-10), \
    2) medical history completeness, 3) HIPAA compliance. \
    Return JSON with scores and recommendations."

# The LLM receives your prompt + every segment with
# emotion, intent, behavior, age, gender, role labels.

Video Intelligence

Real-time visual context for voice calls. Dual-lane architecture: a fast lane (MediaPipe, ~40ms) for face emotion, gaze, gestures, and a semantic lane (any vision LLM, ~1.5s throttled) for scene understanding. Works with cloud APIs or self-hosted models. Visual context is injected into the LLM prompt automatically — zero blocking on the audio pipeline.

Visual Context (per-session)

Push video frames during a live session and retrieve the computed visual context. The Pipecat voice pipeline does this automatically when video=1 is passed to the offer endpoint.

MethodEndpointDescription
POST/video/context/{session_id}Push a base64-encoded JPEG frame for analysis
GET/video/context/{session_id}Get current visual context (fast + semantic lanes)
DELETE/video/context/{session_id}Clear session context on disconnect
GET/video/healthHealth check — reports fast/semantic lane status and LLM backend
# Push a frame
curl -X POST http://localhost:8002/context/session-123 \
    -F "image=<base64-encoded-jpeg>"

# Get current visual context
curl http://localhost:8002/context/session-123

# Response:
{
  "fast": {
    "faces": [{
      "emotion": "happy",
      "gaze": "center",
      "head_pose": {"pitch": -2.1, "yaw": 5.3, "roll": 0.8},
      "speaking": true
    }],
    "hands": [{"gesture": "open_palm", "confidence": 0.92}]
  },
  "semantic": {
    "scene": "modern office with whiteboard",
    "activity": "presenting to camera",
    "entities": ["laptop", "whiteboard", "coffee mug"]
  },
  "prompt_block": "[VISUAL CONTEXT]\nFace: happy, looking at camera..."
}

Streaming

Two WebSocket endpoints for real-time video processing:

EndpointInputOutput
WS /video/wsJSON frames with base64 imageVisual context updates (fast + semantic)
WS /video/ws/avInterleaved binary PCM audio + JSON video framesASR transcripts enriched with visual context
# Video-only streaming
wscat -c "ws://localhost:8002/ws?session_id=s1"
# Send: {"type":"frame","image":"<b64 jpeg>"}
# Receive: {"type":"fast","faces":[...],"hands":[...]}
# Receive: {"type":"semantic","scene":"...","activity":"..."}

# Fused audio-visual streaming (single WebSocket)
# Binary messages → forwarded to ASR as PCM audio
# JSON messages  → processed through vision lanes
# ASR transcripts are enriched with latest visual context
# before forwarding back to the client

The fused /ws/av endpoint runs audio and video on fully independent lanes with zero cross-blocking. Audio is proxied to the ASR streaming WebSocket internally, while video frames are processed through the fast and semantic vision pipelines.

Batch Analysis

Upload a video file for full analysis. Audio is extracted via FFmpeg and transcribed; video frames are sampled at configurable FPS for visual analysis.

curl -X POST http://localhost:8002/analyze \
    -F "file=@interview.mp4" \
    -F "fast_fps=5" \
    -F "semantic_fps=0.5" \
    -F "asr=true"

# Returns timeline of visual context + ASR transcript
# with fused audio-visual results per segment
ParameterTypeDefaultDescription
filefilerequiredVideo file (MP4, WebM, MOV, AVI)
fast_fpsfloat5Frames per second for fast lane (MediaPipe)
semantic_fpsfloat0.5Frames per second for semantic lane (LLM Vision)
asrbooltrueAlso transcribe the audio track

TTS — Text-to-Speech

Kokoro 82M — non-autoregressive, single forward pass, 55 voices, 10 languages. Sub-200ms TTFB on CPU.

WebSocket ws://localhost:8003/stream

# Protocol:
# 1. Connect
# 2. Send config: {"type": "config", "voice": "af_heart"}
# 3. Send speak:  {"type": "speak", "text": "Hello!"}
# 4. Receive binary PCM s16le chunks (24kHz mono)
# 5. Receive {"type": "done"} when complete

Python example

import asyncio, json, wave, websockets

async def speak(text, voice="af_heart"):
    async with websockets.connect("ws://localhost:8003/stream") as ws:
        await ws.send(json.dumps({"type": "config", "voice": voice}))
        await ws.send(json.dumps({"type": "speak", "text": text}))

        audio = b""
        async for msg in ws:
            if isinstance(msg, bytes):
                audio += msg           # PCM s16le, 24kHz
            elif json.loads(msg).get("type") == "done":
                break

        with wave.open("out.wav", "wb") as wf:
            wf.setnchannels(1)
            wf.setsampwidth(2)
            wf.setframerate(24000)
            wf.writeframes(audio)

asyncio.run(speak("Hello from Whissle Gateway."))

TTS Config

FieldTypeDefaultDescription
typestringrequiredMust be 'config'
voicestringaf_heartVoice ID (see /voices endpoint)
languagestringenLanguage for phonemization
speedfloat1.0Speech speed (0.5 – 2.0)
temperaturefloat0.6Sampling temperature (0.0 – 2.0)
top_kint50Top-k sampling (0 – 200)
exaggerationfloat0.0Voice exaggeration (prosody variation)

Speak message fields:

FieldTypeDescription
typestringMust be 'speak'
textstringText to synthesize
instructstringOptional style instruction (e.g. 'speak slowly and calmly')

Voices

Query available voices via REST:

curl http://localhost:8003/voices

# Returns: {"voices": {"af_heart": {...}, "am_adam": {...}, ...}}

55 voices across American (af_, am_), British (bf_, bm_), and other accents. Voice IDs follow the pattern: {accent}{gender}_{name}.

Voice Calling

Built-in voice calling platform powered by Pipecat. Browser voice chat via WebRTC, phone calls via Twilio Media Streams. The full pipeline runs ASR → LLM → TTS inside the container with real-time emotion and intent metadata.

All voice calling endpoints are served by the Pipecat service on port 8000 and proxied through the gateway at /bot/api/*. Authentication uses HTTP-only JWT cookies (see User Auth).

Voice Agents

Create and manage voice agent personas. Each agent has a system prompt, greeting message, voice, and LLM model configuration.

MethodEndpointDescription
GET/bot/api/agentsList all agents for the current organization
POST/bot/api/agentsCreate a new voice agent
GET/bot/api/agents/{id}Get agent by ID
PATCH/bot/api/agents/{id}Update agent settings
DELETE/bot/api/agents/{id}Delete an agent
# Create a voice agent
curl -X POST http://localhost:9000/bot/api/agents \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{
      "name": "Sales Rep",
      "system_prompt": "You are a helpful sales assistant for Acme Corp.",
      "greeting": "Hi there! How can I help you today?",
      "voice": "tara",
      "voice_gender": "female",
      "llm_model": "gemini-2.5-flash",
      "video_enabled": true
    }'

# Response:
# {"id": "uuid", "name": "Sales Rep", "voice": "tara", ...}
# video_enabled: when true, camera is enabled by default for this agent.
# Can also be overridden per-session with ?video=1 on the offer endpoint.

WebRTC Signaling

Start a browser voice session by sending an SDP offer. The gateway returns an SDP answer and establishes a WebRTC connection to the Pipecat voice pipeline.

MethodEndpointDescription
POST/bot/api/offerSend SDP offer to start a voice session
PATCH/bot/api/offerSend ICE trickle candidates
# Start a WebRTC voice session
curl -X POST http://localhost:9000/bot/api/offer \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{
      "sdp": "<browser SDP offer string>",
      "type": "offer",
      "agent_id": "agent-uuid",
      "customer_id": "customer-uuid"
    }'

# Response: {"sdp": "<SDP answer>", "type": "answer", "pc_id": "..."}

# Start with camera enabled (video intelligence)
curl -X POST "http://localhost:9000/bot/api/offer?video=1" \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{"sdp": "...", "type": "offer", "agent_id": "..."}'

# Start with 3D avatar
curl -X POST "http://localhost:9000/bot/api/offer?avatar=simli\
    &avatar_face_id=b9e5fba3-..." \
    -d '{"sdp": "...", "type": "offer", "agent_id": "..."}'

# Send ICE candidates as they arrive
curl -X PATCH http://localhost:9000/bot/api/offer \
    -H "Content-Type: application/json" \
    -d '{"pc_id": "...", "candidates": [{"candidate": "..."}]}'

Query params: video=1 enables camera capture and visual context injection (face emotion, gaze, gestures, scene into LLM). avatar=simli enables a lip-synced 3D avatar. Both can be combined. For Twilio phone calls, the gateway accepts Twilio Media Streams over WebSocket at /ws/twilio.

Customers

Manage customer records linked to voice calls. Customers are scoped to the current organization.

MethodEndpointDescription
GET/bot/api/customersList all customers
POST/bot/api/customersCreate a customer record
GET/bot/api/customers/{id}Get customer by ID
PATCH/bot/api/customers/{id}Update customer details
DELETE/bot/api/customers/{id}Delete a customer
curl -X POST http://localhost:9000/bot/api/customers \
    -H "Content-Type: application/json" \
    -b "access_token=..." \
    -d '{
      "name": "Jane Smith",
      "phone_number": "+14155551234",
      "email": "jane@example.com"
    }'

Call History

Query call records including transcripts, metadata, duration, and recordings.

MethodEndpointDescription
GET/bot/api/callsList calls (filterable by agent, customer, date)
POST/bot/api/calls/startStart an outbound Twilio call
POST/bot/api/calls/twilio-statusTwilio status callback webhook
POST/bot/api/calls/twilio-amdTwilio answering machine detection callback

Call records include the full transcript (with per-segment metadata), call duration, recording path, and AI-generated summary. Data is stored in PostgreSQL and persists in the /data volume.

Voice Pipeline

The Pipecat voice pipeline processes audio through a chain of intelligent processors:

transport.input (WebRTC / Twilio audio + video)
    → VideoContext       Captures video frames → POST to :8002 (~2.5 fps)
    → WhissleSTT         ASR via local :8001, returns text + metadata
    → SmartEndpointer    Adapts VAD from intent head (open question = longer wait)
    → EmotionAdaptive    Injects emotion/intent + visual context into LLM
    → Backchannel        Plays "mm-hmm" cues during user speech
    → UserAggregator     Combines multi-utterance user turns
    → LLM               Generates response with speech + visual context
    → ClosingPhrase      Detects goodbye → auto-hangup
    → NumberPronouncer   "4500" → "forty-five hundred"
    → TTS (Kokoro)       Text-to-speech via local :8003
    → AudioHumanizer     Adds room tone + breath sounds
    → SilenceWatchdog    2-strike: prompt after silence, hangup after second
    → transport.output   WebRTC / Twilio audio out
    → AudioBuffer        Records both directions for call recording
    → AssistantAgg       Combines multi-chunk assistant turns

The pipeline runs at 24kHz for WebRTC and resamples from 8kHz µ-law for Twilio. All ASR metadata (emotion, intent, behavior) and visual context (face emotion, gaze, gestures, scene) are available in real-time and injected into the LLM context as developer messages. Video processing is fire-and-forget — it never blocks the audio path.

Agent

Intelligent agent powered by any LLM — cloud APIs or self-hosted. Handles summarization, coaching analysis, conversational AI, and tool-augmented workflows.

Process (single-turn)

POST http://localhost:8765/process

curl -X POST http://localhost:8765/process \
    -H "Content-Type: application/json" \
    -H "X-Device-Id: user-1" \
    -d '{
      "transcript": "What meetings do I have today?",
      "user_id": "user-1",
      "language": "en"
    }'

# Response: mode, processed_text, is_command, etc.

Chat (multi-turn streaming)

POST http://localhost:8765/voice-agent/chat/stream

curl -N -X POST http://localhost:8765/voice-agent/chat/stream \
    -H "Content-Type: application/json" \
    -d '{
      "messages": [
        {"role": "user", "content": "Summarize the key takeaways."}
      ],
      "user_id": "user-1"
    }'

# SSE response:
# data: {"type": "text_chunk", "content": "The key takeaways..."}
# data: {"type": "done"}

User Authentication

The voice calling platform includes a full user authentication system with email/password signup, JWT sessions, guest access, Google OAuth, and password reset flows. Auth state is stored in PostgreSQL and JWT cookies are HTTP-only with automatic refresh token rotation.

This is separate from the API token system used for ASR/TTS/Agent requests. User auth is used by the voice calling frontend and protects all /bot/api/* endpoints.

Auth Endpoints

MethodEndpointDescription
POST/bot/api/auth/signupRegister with email + password
POST/bot/api/auth/loginLogin — sets access_token + refresh_token cookies
POST/bot/api/auth/logoutClear session cookies and revoke refresh token
POST/bot/api/auth/refreshRotate access + refresh tokens (cookie-based)
GET/bot/api/auth/meGet current user profile + active organization
POST/bot/api/auth/guestCreate a guest session (no email required)
POST/bot/api/auth/forgot-passwordSend password reset email
POST/bot/api/auth/reset-passwordReset password with token from email
POST/bot/api/auth/verify-emailVerify email with token from email
POST/bot/api/auth/resend-verificationResend email verification
GET/bot/api/auth/configAuth configuration (Google OAuth enabled, etc.)
GET/bot/api/auth/google/startInitiate Google OAuth flow
GET/bot/api/auth/google/callbackGoogle OAuth callback handler
# Sign up
curl -X POST http://localhost:9000/bot/api/auth/signup \
    -H "Content-Type: application/json" \
    -d '{"email": "user@example.com", "password": "securepass123"}'

# Login — response sets HTTP-only cookies
curl -X POST http://localhost:9000/bot/api/auth/login \
    -H "Content-Type: application/json" \
    -c cookies.txt \
    -d '{"email": "user@example.com", "password": "securepass123"}'

# Use cookies for authenticated requests
curl http://localhost:9000/bot/api/auth/me -b cookies.txt

# Guest access (no credentials needed)
curl -X POST http://localhost:9000/bot/api/auth/guest -c cookies.txt

Organizations

Multi-tenant support via organizations. Each organization has its own agents, customers, and call history. Users can belong to multiple organizations with different roles (owner, admin, member).

MethodEndpointDescription
GET/bot/api/orgsList organizations the current user belongs to
POST/bot/api/orgsCreate a new organization
GET/bot/api/orgs/activeGet the currently active organization
POST/bot/api/orgs/switchSwitch active organization
GET/bot/api/orgs/{id}Get organization details
PATCH/bot/api/orgs/{id}Update organization (name, slug)
DELETE/bot/api/orgs/{id}Delete organization (owner only)
GET/bot/api/orgs/{id}/membersList organization members
PATCH/bot/api/orgs/{id}/members/{user_id}Change member role
# Create an organization
curl -X POST http://localhost:9000/bot/api/orgs \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"name": "Acme Corp", "slug": "acme"}'

# Switch active organization
curl -X POST http://localhost:9000/bot/api/orgs/switch \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"organization_id": "org-uuid"}'

Roles: owner (full control + delete), admin (manage members + settings), member (use agents + view calls).

Invitations

Invite users to join an organization via email. Invitations include a secure token link that can be accepted from the frontend.

MethodEndpointDescription
GET/bot/api/orgs/{id}/invitationsList pending invitations for an organization
POST/bot/api/orgs/{id}/invitationsCreate an invitation (sends email)
DELETE/bot/api/orgs/{id}/invitations/{inv_id}Revoke a pending invitation
GET/bot/api/invitations/preview?token=...Preview invitation details (public)
POST/bot/api/invitations/acceptAccept an invitation and join the organization
# Invite a user
curl -X POST http://localhost:9000/bot/api/orgs/{org_id}/invitations \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"email": "colleague@example.com", "role": "member"}'

# Accept invitation (the invited user)
curl -X POST http://localhost:9000/bot/api/invitations/accept \
    -H "Content-Type: application/json" \
    -b cookies.txt \
    -d '{"token": "invitation-token-from-email"}'

Database Schema

PostgreSQL runs embedded inside the container on port 5432. Data persists in the /data/pgdata directory on the Docker volume. Schema migrations run automatically on first start.

Core tables

TableKey ColumnsDescription
agentsname, system_prompt, greeting, voice, llm_model, organization_idVoice agent configurations
customersname, phone_number, email, organization_idCustomer records linked to calls
callsagent_id, customer_id, transcript (JSONB), metadata (JSONB), recording_path, duration, statusCall history with full transcripts and recordings

Auth tables

TableKey ColumnsDescription
usersemail (CITEXT), password_hash, auth_provider, device_idUser accounts
user_sessionsuser_id, refresh_token_hash, device_info, replaced_byJWT sessions with token rotation chain
oauth_accountsuser_id, provider, provider_user_idGoogle OAuth linked accounts
password_resetsuser_id, token_hash, expires_atPassword reset tokens (1h expiry)
email_verificationsuser_id, token_hash, expires_atEmail verification tokens
audit_logsuser_id, action, ip_address, user_agentSecurity audit trail

Multi-tenancy tables

TableKey ColumnsDescription
organizationsname, slug (CITEXT), owner_user_idOrganization accounts
organization_membershipsuser_id, organization_id, role (owner/admin/member)User-org membership with roles
organization_invitationsemail, organization_id, role, token_hash, expires_atPending invitations

Configuration

All configuration via environment variables passed to docker run -e.

VariableDefaultDescription
VARIANTen-fullModel variant: en-tiny, hinglish, en-lite, en-full, multi-full, multi-zh, all
ASR_DEVICEautoInference device: auto, cpu, cuda
LLM_PROVIDERclaudeLLM provider: claude, gemini, groq, or local (llama.cpp / vLLM / Ollama)
ANTHROPIC_API_KEYAnthropic Claude API key
ANTHROPIC_MODELclaude-sonnet-4-6Claude model ID
GEMINI_API_KEYGoogle Gemini API key
STORAGE_MODEsqliteData storage: sqlite (local) or firestore (cloud)
TTS_ENGINEkokoroTTS engine (kokoro is default and baked in)
AUTH_MODElocalToken auth mode: local (SQLite), remote (cloud backend), hybrid (local then remote)
AUTH_DB_PATH/data/auth/tokens.dbPath to SQLite token database
GATEWAY_INTERNAL_SECRETautoShared secret for internal service-to-service auth
HIPAA_MODEfalseEnable HIPAA mode: 1h session TTL, encrypted DB
VIDEO_CONTEXT_URLhttp://127.0.0.1:8002Video intelligence service URL (internal)
VIDEO_LLM_BACKENDautoVision LLM: anthropic, gemini, or custom (any vision endpoint). Auto-detects
VIDEO_LLM_MODELautoVision model ID (e.g. claude-haiku-4-5, gemini-2.0-flash)
FRONTEND_URLhttp://localhost:3000Frontend origin for CORS (comma-separated for multiple)
TWILIO_ACCOUNT_SIDTwilio account SID (for outbound phone calls)
TWILIO_AUTH_TOKENTwilio auth token
TWILIO_PHONE_NUMBERTwilio phone number for outbound calls
GOOGLE_CLIENT_IDGoogle OAuth client ID (optional)
GOOGLE_CLIENT_SECRETGoogle OAuth client secret (optional)

Need help?

See industry-specific solutions or contact us for custom integrations.