API Documentation
Complete reference for Whissle Gateway — ASR, TTS, voice calling, diarization, metadata, AI analysis, auth, and multi-tenancy.
Overview
Whissle Gateway runs six services in a single Docker container, managed by supervisord:
| Service | Port | Protocol | Function |
|---|---|---|---|
| PostgreSQL | 5432 | TCP | Persistent database for agents, calls, users, organizations |
| ASR | 8001 | REST + WebSocket | Speech recognition, diarization, metadata |
| TTS | 8003 | WebSocket | Text-to-speech (Kokoro 82M, 55 voices) |
| Pipecat | 8000 | REST + WebSocket | Voice calling platform (WebRTC + Twilio) |
| Agent | 8765 | REST + SSE | LLM processing, summarization, coaching |
| Gateway | 9000 | REST + WebSocket | Unified proxy (requires API token) |
For development and testing, hit the service ports directly (8001, 8003, 8000, 8765). The gateway proxy at 9000 adds authentication, rate limiting, and usage tracking for production use. Voice calling endpoints are proxied at /bot/*.
API Authentication
The gateway includes a built-in API token system for authenticating ASR, TTS, and Agent requests. An admin token is auto-generated on first start. For user-level auth (voice calling, organizations), see User Auth.
Getting your admin token
# Check Docker startup logs for the admin token: docker logs whissle-gateway 2>&1 | grep "Token:" # Or read it from the persistent data volume: docker exec whissle-gateway cat /data/auth/admin_token.txt
Creating user tokens
# Create a token for an application or user
curl -X POST http://localhost:9000/auth/tokens \
-H "Authorization: Bearer wh_YOUR_ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"user_id": "my-app", "label": "My Application"}'
# Response:
# {"success": true, "token": "wh_a1b2c3...", "user_id": "my-app", ...}Using tokens
# REST — Authorization header curl -X POST http://localhost:9000/asr/transcribe \ -H "Authorization: Bearer wh_YOUR_TOKEN" \ -F "file=@call.mp3" -F "diarize=true" # WebSocket — query parameter wscat -c "ws://localhost:9000/listen?token=wh_YOUR_TOKEN" wscat -c "ws://localhost:9000/asr/stream?token=wh_YOUR_TOKEN"
Token management
| Method | Endpoint | Description |
|---|---|---|
| POST | /auth/tokens | Create a new API token (requires admin) |
| GET | /auth/tokens | List all active tokens (requires admin) |
| GET | /auth/tokens?user_id=X | List tokens for a specific user |
| DELETE | /auth/tokens/{id} | Revoke a token by ID (requires admin) |
| GET | /auth/usage?user_id=X&days=7 | Usage logs for a user |
| GET | /auth/usage/summary | Aggregated usage stats |
All /auth/* endpoints require the admin token. Tokens and usage data persist in /data/auth/tokens.db and survive container restarts when using a Docker volume.
ASR — Batch Transcription
Upload an audio file and receive a complete transcription with metadata.
Basic usage
curl -X POST http://localhost:8001/transcribe \
-F "file=@call.mp3"
# With diarization + metadata + AI analysis
curl -X POST http://localhost:8001/transcribe \
-F "file=@call.mp3" \
-F "diarize=true" \
-F "num_speakers=2" \
-F "punctuation=true" \
-F "itn=true" \
-F "metadata_prob=true" \
-F "word_timestamps=true" \
-F "speech_analysis=true" \
-F "summarize=sales_coaching" \
-o result.jsonParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| file | file | required | Audio file — MP3, WAV, FLAC, OGG, M4A, WebM |
| language | string | auto | Language hint: en, hi, zh, ja, ko, es, fr, de, etc. |
| model | string | default | ASR model: en-in-tech-misc, hinglish-loans, zh, whissle-large |
| diarize | bool | false | Enable speaker diarization (ECAPA-TDNN + agglomerative clustering) |
| num_speakers | int | auto | Exact number of speakers (skips auto-detection) |
| min_speakers | int | 1 | Minimum speakers for auto-detection |
| max_speakers | int | 10 | Maximum speakers for auto-detection |
| punctuation | bool | true | Restore punctuation (commas, periods, question marks) and capitalization |
| itn | bool | true | Inverse text normalization — 'twenty three' → '23', 'पाँच सौ' → '500' |
| use_lm | bool | true | Use KenLM n-gram language model for beam search decoding |
| lm_mode | string | balanced | LM mode: greedy, balanced, strict — controls LM weight on decoding |
| beam_width | int | 100 | Beam width for KenLM beam search (higher = better, slower) |
| lm_alpha | float | 0.1 | LM weight (alpha) for beam search scoring |
| lm_beta | float | 0.5 | Word insertion bonus (beta) for beam search |
| metadata_prob | bool | false | Include probability distributions for emotion, intent, age, gender |
| word_timestamps | bool | false | Include per-word start and end timestamps with confidence scores |
| speech_analysis | bool | false | Include speech pattern analysis: lexical fluency, vocabulary range, grammar, pace |
| speaker_embedding | bool | false | Return 192-dim speaker embedding vector |
| summarize | string | — | AI analysis: true, sales_coaching, collections, or custom prompt text |
| hotwords | string | — | Comma-separated hotwords for boosting (e.g. 'LoanTap,GoBoult,EMI') |
| hotword_weight | float | 10.0 | Hotword boost weight in beam search |
| noise_reduce | bool | false | Apply noise reduction before transcription |
| gain_normalize | bool | false | Normalize audio gain before transcription |
| profanity_filter | bool | false | Mask profanity in transcript output |
| n_best | int | 1 | Number of alternative transcriptions to return |
| format | string | — | Output format: srt, vtt, segments (for subtitles) |
| structured_intents | bool | false | Two-tier intent classification: speech_act + domain |
| intent_labels | string | — | Filter intent predictions to specific labels |
Response Format
{
"transcript": "Hello, good morning. How are you?",
"duration": 5.2,
"inference_time": "0.823s",
"model": "whissle-large",
"decoder": "beam_search_kenlm",
// Present when diarize=true
"num_speakers": 2,
"segments": [
{
"speaker": "SPEAKER_00",
"text": "Hello, good morning.",
"start": 1.0,
"end": 1.9,
"confidence": 0.96,
// Always present — top-1 prediction per category
"metadata": {
"emotion": "EMOTION_NEUTRAL",
"behavior": "BEHAVIOR_DIRECT",
"role": "ROLE_INTERVIEWER",
"eval": "EVAL_NONE",
"age": "AGE_30_45",
"gender": "GENDER_MALE"
},
// Present when metadata_prob=true — full distributions
"metadata_probs": {
"emotion": {
"EMOTION_NEUTRAL": 0.82,
"EMOTION_HAPPY": 0.12,
"EMOTION_SAD": 0.03,
"EMOTION_ANGRY": 0.02,
"EMOTION_FEAR": 0.01
},
"age": { "AGE_30_45": 0.65, "AGE_18_30": 0.25, ... }
},
// Present when word_timestamps=true
"words": [
{"word": "Hello", "start": 1.0, "end": 1.3, "confidence": 0.98},
{"word": "good", "start": 1.4, "end": 1.6, "confidence": 0.95},
{"word": "morning", "start": 1.6, "end": 1.9, "confidence": 0.97}
],
"entities": [
{"type": "PERSON", "value": "Diana"},
{"type": "ORG", "value": "American Amicable"}
]
}
],
// Present when speech_analysis=true
"speech_analysis": {
"lexical_fluency": 0.92,
"vocabulary_range": 0.85,
"grammar_score": 0.88
},
// Present when summarize=... is set
"summary": "...", // text or markdown
"analysis": { ... } // structured JSON (sales_coaching, collections)
}Other Batch Endpoints
| Endpoint | Description |
|---|---|
| POST /transcribe | Full transcription with all metadata and analysis options |
| POST /transcribe/clean | Returns only clean transcript text — fastest, no metadata |
| POST /transcribe/raw | Returns raw model output including inline action tokens |
| POST /transcribe/pcm | Transcribe raw PCM s16le audio bytes (no file header needed) |
| POST /transcribe/long | Long audio with VAD segmentation — returns segments array |
| POST /transcribe/batch | Batch multiple files in one request |
| GET /models | List loaded ASR models and their metadata |
| GET /intents | List available intent labels for the loaded model |
| GET /languages | List supported languages and their KenLM availability |
| GET /status | Server status — loaded models, memory, uptime |
ASR — WebSocket Streaming
Real-time transcription over WebSocket. Send raw PCM audio chunks, receive interim and final transcripts with metadata as they arrive.
WebSocket ws://localhost:8001/stream
# Protocol:
# 1. Connect
# 2. Send JSON config
# 3. Send binary PCM chunks (16kHz, 16-bit, mono)
# 4. Receive JSON transcripts (interim + final)
# 5. Send {"type": "end"} to finalizePython example
import asyncio, json, websockets
async def stream(audio_path):
async with websockets.connect("ws://localhost:8001/stream") as ws:
# 1. Config
await ws.send(json.dumps({
"type": "config",
"sample_rate": 16000,
"language": "en",
"model": "en-in-tech-misc", # optional
"interim_results": True
}))
# 2. Stream audio chunks (100ms = 3200 bytes at 16kHz s16le)
with open(audio_path, "rb") as f:
while chunk := f.read(3200):
await ws.send(chunk)
await asyncio.sleep(0.05) # pace to real-time
# 3. Signal end
await ws.send(json.dumps({"type": "end"}))
# 4. Receive results
async for msg in ws:
data = json.loads(msg)
if data.get("is_final"):
print(f"FINAL: {data['text']}")
print(f" emotion={data['metadata']['emotion']}")
elif data["type"] == "transcript":
print(f" interim: {data['text']}")
elif data["type"] == "end":
break
asyncio.run(stream("call.wav"))Streaming Config
| Field | Type | Default | Description |
|---|---|---|---|
| type | string | required | Must be 'config' |
| sample_rate | int | 16000 | Audio sample rate in Hz |
| language | string | auto | Language hint |
| model | string | default | ASR model to use |
| interim_results | bool | true | Send interim (partial) transcripts |
| channel | string | microphone | Audio channel name (for multi-channel tagging) |
Streaming Response
// Interim transcript (partial, updates as more audio arrives)
{
"type": "transcript",
"channel": "microphone",
"text": "Hello how are",
"is_final": false,
"utterance_end": false
}
// Final transcript (complete utterance with metadata)
{
"type": "transcript",
"text": "Hello, how are you?",
"is_final": true,
"utterance_end": true,
"process_ms": 510,
"metadata": {
"emotion": "EMOTION_NEUTRAL",
"behavior": "BEHAVIOR_DIRECT",
"role": "ROLE_INTERVIEWER",
"age": "AGE_30_45",
"gender": "GENDER_MALE"
}
}
// Session end
{ "type": "end" }Speaker Diarization
Multi-speaker separation using ECAPA-TDNN 192-dim speaker embeddings + agglomerative clustering. Available in both batch and streaming modes.
# Batch with diarization
curl -X POST http://localhost:8001/transcribe \
-F "file=@meeting.wav" \
-F "diarize=true" \
-F "num_speakers=3"
# Response includes per-speaker segments:
# [SPEAKER_00]: "Hello, good morning."
# [SPEAKER_01]: "Hi, how are you?"
# [SPEAKER_00]: "I'm doing well, thank you."
# [SPEAKER_02]: "Let's get started."
# Each segment has its own metadata (emotion, age, gender per speaker)Pipeline: Silero VAD segmentation → per-segment ASR with speaker embedding → agglomerative clustering → merge adjacent same-speaker segments.
Metadata
Every ASR segment includes metadata extracted in a single forward pass — no separate models or API calls. Tags available depend on which ASR model is loaded.
Common tags (all models)
| Tag | Values | Description |
|---|---|---|
| emotion | EMOTION_NEUTRAL, EMOTION_HAPPY, EMOTION_SAD, EMOTION_ANGRY, EMOTION_FEAR, EMOTION_SURPRISE | Speaker emotion per utterance |
| age | AGE_0_18, AGE_18_30, AGE_30_45, AGE_45_60, AGE_60+ | Estimated speaker age range |
| gender | GENDER_MALE, GENDER_FEMALE | Speaker gender |
en-in-tech-misc (en-full, en-lite variants)
| Tag | Classes | Description |
|---|---|---|
| behavior | 26 types | BEHAVIOR_EXPLAIN, BEHAVIOR_QUESTION_OPEN, BEHAVIOR_QUESTION_CLOSED, BEHAVIOR_ACKNOWLEDGE, BEHAVIOR_DIRECT, BEHAVIOR_EVALUATE, BEHAVIOR_STRUCTURE, BEHAVIOR_THINK_ALOUD, BEHAVIOR_INFORM, BEHAVIOR_REASON, BEHAVIOR_ABILITY, BEHAVIOR_COMMIT, BEHAVIOR_ADVISE, BEHAVIOR_REFLECT, BEHAVIOR_AFFIRM, BEHAVIOR_FACILITATE, BEHAVIOR_FILLER, BEHAVIOR_EXPRESS, BEHAVIOR_FOLLOW_NEUTRAL, BEHAVIOR_RAISE_CONCERN, BEHAVIOR_REFRAME, BEHAVIOR_SUPPORT, BEHAVIOR_CONFRONT, BEHAVIOR_WARN |
| eval | 8 types | EVAL_NONE, EVAL_CORRECT, EVAL_PROBE, EVAL_PARTIAL, EVAL_INCORRECT, EVAL_HINT, EVAL_SKIP |
| role | 3 types | ROLE_INTERVIEWER, ROLE_INTERVIEWEE |
hinglish-loans (hinglish variant)
| Tag | Classes | Description |
|---|---|---|
| intent | 13 types | INTENT_GREETING, INTENT_IDENTITY_VERIFY, INTENT_PAYMENT_REMINDER, INTENT_PAYMENT_INSTRUCTION, INTENT_CLAIMS_PAID, INTENT_PROMISE_TO_PAY, INTENT_PAYMENT_QUERY, INTENT_AMOUNT_DISPUTE, INTENT_FINANCIAL_HARDSHIP, INTENT_COMPLAINT, INTENT_URGENCY_PRESSURE, INTENT_ACKNOWLEDGMENT, INTENT_OTHER |
| role | 3 types | ROLE_AGENT, ROLE_CUSTOMER, ROLE_OTHER |
zh (multi-zh, all variants)
| Tag | Classes |
|---|---|
| dialect | DIALECT_NORTH, DIALECT_SOUTH, DIALECT_OTHERS |
whissle-large (multi-full, multi-zh, all variants)
| Tag | Classes |
|---|---|
| intent | 31 groups — inline action tokens extracted from the vocabulary |
AI Analysis
Add summarize=mode to any batch transcription. The diarized transcript + per-segment metadata is sent to the configured LLM (Claude or Gemini) for analysis.
Preset: sales_coaching
Scores 8 sales best practices: greeting, discovery questions, active listening, objection handling, value proposition, urgency creation, next steps, closing.
curl -F "file=@call.mp3" -F "diarize=true" \
-F "summarize=sales_coaching" http://localhost:8001/transcribe
# Returns analysis.overall_score, analysis.practices,
# analysis.buyer_outcome, analysis.highlights,
# analysis.behaviors (per-segment labels)Preset: collections
Compliance scoring for debt collection calls: identity verification, reason stated, amount mentioned, no harassment. Call outcome and next action.
-F "summarize=collections" # Returns analysis.compliance, analysis.call_outcome, # analysis.customer_sentiment, analysis.next_action
Preset: true (general)
Markdown summary with overview, participants, key topics, emotional dynamics, entities, and outcome.
-F "summarize=true"
Custom prompt
Pass any string — it becomes the LLM prompt with the full transcript + metadata appended.
-F "summarize=You are a medical call quality analyst. \
Score this call on: 1) empathy (0-10), \
2) medical history completeness, 3) HIPAA compliance. \
Return JSON with scores and recommendations."
# The LLM receives your prompt + every segment with
# emotion, intent, behavior, age, gender, role labels.TTS — Text-to-Speech
Kokoro 82M — non-autoregressive, single forward pass, 55 voices, 10 languages. Sub-200ms TTFB on CPU.
WebSocket ws://localhost:8003/stream
# Protocol:
# 1. Connect
# 2. Send config: {"type": "config", "voice": "af_heart"}
# 3. Send speak: {"type": "speak", "text": "Hello!"}
# 4. Receive binary PCM s16le chunks (24kHz mono)
# 5. Receive {"type": "done"} when completePython example
import asyncio, json, wave, websockets
async def speak(text, voice="af_heart"):
async with websockets.connect("ws://localhost:8003/stream") as ws:
await ws.send(json.dumps({"type": "config", "voice": voice}))
await ws.send(json.dumps({"type": "speak", "text": text}))
audio = b""
async for msg in ws:
if isinstance(msg, bytes):
audio += msg # PCM s16le, 24kHz
elif json.loads(msg).get("type") == "done":
break
with wave.open("out.wav", "wb") as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(24000)
wf.writeframes(audio)
asyncio.run(speak("Hello from Whissle Gateway."))TTS Config
| Field | Type | Default | Description |
|---|---|---|---|
| type | string | required | Must be 'config' |
| voice | string | af_heart | Voice ID (see /voices endpoint) |
| language | string | en | Language for phonemization |
| speed | float | 1.0 | Speech speed (0.5 – 2.0) |
| temperature | float | 0.6 | Sampling temperature (0.0 – 2.0) |
| top_k | int | 50 | Top-k sampling (0 – 200) |
| exaggeration | float | 0.0 | Voice exaggeration (prosody variation) |
Speak message fields:
| Field | Type | Description |
|---|---|---|
| type | string | Must be 'speak' |
| text | string | Text to synthesize |
| instruct | string | Optional style instruction (e.g. 'speak slowly and calmly') |
Voices
Query available voices via REST:
curl http://localhost:8003/voices
# Returns: {"voices": {"af_heart": {...}, "am_adam": {...}, ...}}55 voices across American (af_, am_), British (bf_, bm_), and other accents. Voice IDs follow the pattern: {accent}{gender}_{name}.
Voice Calling
Built-in voice calling platform powered by Pipecat. Browser voice chat via WebRTC, phone calls via Twilio Media Streams. The full pipeline runs ASR → LLM → TTS inside the container with real-time emotion and intent metadata.
All voice calling endpoints are served by the Pipecat service on port 8000 and proxied through the gateway at /bot/api/*. Authentication uses HTTP-only JWT cookies (see User Auth).
Voice Agents
Create and manage voice agent personas. Each agent has a system prompt, greeting message, voice, and LLM model configuration.
| Method | Endpoint | Description |
|---|---|---|
| GET | /bot/api/agents | List all agents for the current organization |
| POST | /bot/api/agents | Create a new voice agent |
| GET | /bot/api/agents/{id} | Get agent by ID |
| PATCH | /bot/api/agents/{id} | Update agent settings |
| DELETE | /bot/api/agents/{id} | Delete an agent |
# Create a voice agent
curl -X POST http://localhost:9000/bot/api/agents \
-H "Content-Type: application/json" \
-b "access_token=..." \
-d '{
"name": "Sales Rep",
"system_prompt": "You are a helpful sales assistant for Acme Corp.",
"greeting": "Hi there! How can I help you today?",
"voice": "tara",
"voice_gender": "female",
"llm_model": "gemini-2.5-flash"
}'
# Response:
# {"id": "uuid", "name": "Sales Rep", "voice": "tara", ...}WebRTC Signaling
Start a browser voice session by sending an SDP offer. The gateway returns an SDP answer and establishes a WebRTC connection to the Pipecat voice pipeline.
| Method | Endpoint | Description |
|---|---|---|
| POST | /bot/api/offer | Send SDP offer to start a voice session |
| PATCH | /bot/api/offer | Send ICE trickle candidates |
# Start a WebRTC voice session
curl -X POST http://localhost:9000/bot/api/offer \
-H "Content-Type: application/json" \
-b "access_token=..." \
-d '{
"sdp": "<browser SDP offer string>",
"type": "offer",
"agent_id": "agent-uuid",
"customer_id": "customer-uuid"
}'
# Response: {"sdp": "<SDP answer>", "type": "answer", "pc_id": "..."}
# Send ICE candidates as they arrive
curl -X PATCH http://localhost:9000/bot/api/offer \
-H "Content-Type: application/json" \
-d '{"pc_id": "...", "candidates": [{"candidate": "..."}]}'For Twilio phone calls, the gateway accepts Twilio Media Streams over WebSocket at /ws/twilio. Configure your Twilio TwiML to point to this endpoint.
Customers
Manage customer records linked to voice calls. Customers are scoped to the current organization.
| Method | Endpoint | Description |
|---|---|---|
| GET | /bot/api/customers | List all customers |
| POST | /bot/api/customers | Create a customer record |
| GET | /bot/api/customers/{id} | Get customer by ID |
| PATCH | /bot/api/customers/{id} | Update customer details |
| DELETE | /bot/api/customers/{id} | Delete a customer |
curl -X POST http://localhost:9000/bot/api/customers \
-H "Content-Type: application/json" \
-b "access_token=..." \
-d '{
"name": "Jane Smith",
"phone_number": "+14155551234",
"email": "jane@example.com"
}'Call History
Query call records including transcripts, metadata, duration, and recordings.
| Method | Endpoint | Description |
|---|---|---|
| GET | /bot/api/calls | List calls (filterable by agent, customer, date) |
| POST | /bot/api/calls/start | Start an outbound Twilio call |
| POST | /bot/api/calls/twilio-status | Twilio status callback webhook |
| POST | /bot/api/calls/twilio-amd | Twilio answering machine detection callback |
Call records include the full transcript (with per-segment metadata), call duration, recording path, and AI-generated summary. Data is stored in PostgreSQL and persists in the /data volume.
Voice Pipeline
The Pipecat voice pipeline processes audio through a chain of intelligent processors:
transport.input (WebRTC / Twilio audio)
→ WhissleSTT ASR via local :8001, returns text + metadata
→ SmartEndpointer Adapts VAD from intent head (open question = longer wait)
→ EmotionAdaptive Injects emotion/intent into LLM context
→ Backchannel Plays "mm-hmm" cues during user speech
→ UserAggregator Combines multi-utterance user turns
→ LLM (Gemini) Generates response with speech metadata context
→ ClosingPhrase Detects goodbye → auto-hangup
→ NumberPronouncer "4500" → "forty-five hundred"
→ TTS (Kokoro) Text-to-speech via local :8003
→ AudioHumanizer Adds room tone + breath sounds
→ SilenceWatchdog 2-strike: prompt after silence, hangup after second
→ transport.output WebRTC / Twilio audio out
→ AudioBuffer Records both directions for call recording
→ AssistantAgg Combines multi-chunk assistant turnsThe pipeline runs at 24kHz for WebRTC and resamples from 8kHz µ-law for Twilio. All ASR metadata (emotion, intent, behavior) is available in real-time and injected into the LLM context as developer messages.
Agent
Intelligent agent powered by Claude or Gemini. Handles summarization, coaching analysis, conversational AI, and tool-augmented workflows.
Process (single-turn)
POST http://localhost:8765/process
curl -X POST http://localhost:8765/process \
-H "Content-Type: application/json" \
-H "X-Device-Id: user-1" \
-d '{
"transcript": "What meetings do I have today?",
"user_id": "user-1",
"language": "en"
}'
# Response: mode, processed_text, is_command, etc.Chat (multi-turn streaming)
POST http://localhost:8765/voice-agent/chat/stream
curl -N -X POST http://localhost:8765/voice-agent/chat/stream \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Summarize the key takeaways."}
],
"user_id": "user-1"
}'
# SSE response:
# data: {"type": "text_chunk", "content": "The key takeaways..."}
# data: {"type": "done"}User Authentication
The voice calling platform includes a full user authentication system with email/password signup, JWT sessions, guest access, Google OAuth, and password reset flows. Auth state is stored in PostgreSQL and JWT cookies are HTTP-only with automatic refresh token rotation.
This is separate from the API token system used for ASR/TTS/Agent requests. User auth is used by the voice calling frontend and protects all /bot/api/* endpoints.
Auth Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /bot/api/auth/signup | Register with email + password |
| POST | /bot/api/auth/login | Login — sets access_token + refresh_token cookies |
| POST | /bot/api/auth/logout | Clear session cookies and revoke refresh token |
| POST | /bot/api/auth/refresh | Rotate access + refresh tokens (cookie-based) |
| GET | /bot/api/auth/me | Get current user profile + active organization |
| POST | /bot/api/auth/guest | Create a guest session (no email required) |
| POST | /bot/api/auth/forgot-password | Send password reset email |
| POST | /bot/api/auth/reset-password | Reset password with token from email |
| POST | /bot/api/auth/verify-email | Verify email with token from email |
| POST | /bot/api/auth/resend-verification | Resend email verification |
| GET | /bot/api/auth/config | Auth configuration (Google OAuth enabled, etc.) |
| GET | /bot/api/auth/google/start | Initiate Google OAuth flow |
| GET | /bot/api/auth/google/callback | Google OAuth callback handler |
# Sign up
curl -X POST http://localhost:9000/bot/api/auth/signup \
-H "Content-Type: application/json" \
-d '{"email": "user@example.com", "password": "securepass123"}'
# Login — response sets HTTP-only cookies
curl -X POST http://localhost:9000/bot/api/auth/login \
-H "Content-Type: application/json" \
-c cookies.txt \
-d '{"email": "user@example.com", "password": "securepass123"}'
# Use cookies for authenticated requests
curl http://localhost:9000/bot/api/auth/me -b cookies.txt
# Guest access (no credentials needed)
curl -X POST http://localhost:9000/bot/api/auth/guest -c cookies.txtOrganizations
Multi-tenant support via organizations. Each organization has its own agents, customers, and call history. Users can belong to multiple organizations with different roles (owner, admin, member).
| Method | Endpoint | Description |
|---|---|---|
| GET | /bot/api/orgs | List organizations the current user belongs to |
| POST | /bot/api/orgs | Create a new organization |
| GET | /bot/api/orgs/active | Get the currently active organization |
| POST | /bot/api/orgs/switch | Switch active organization |
| GET | /bot/api/orgs/{id} | Get organization details |
| PATCH | /bot/api/orgs/{id} | Update organization (name, slug) |
| DELETE | /bot/api/orgs/{id} | Delete organization (owner only) |
| GET | /bot/api/orgs/{id}/members | List organization members |
| PATCH | /bot/api/orgs/{id}/members/{user_id} | Change member role |
# Create an organization
curl -X POST http://localhost:9000/bot/api/orgs \
-H "Content-Type: application/json" \
-b cookies.txt \
-d '{"name": "Acme Corp", "slug": "acme"}'
# Switch active organization
curl -X POST http://localhost:9000/bot/api/orgs/switch \
-H "Content-Type: application/json" \
-b cookies.txt \
-d '{"organization_id": "org-uuid"}'Roles: owner (full control + delete), admin (manage members + settings), member (use agents + view calls).
Invitations
Invite users to join an organization via email. Invitations include a secure token link that can be accepted from the frontend.
| Method | Endpoint | Description |
|---|---|---|
| GET | /bot/api/orgs/{id}/invitations | List pending invitations for an organization |
| POST | /bot/api/orgs/{id}/invitations | Create an invitation (sends email) |
| DELETE | /bot/api/orgs/{id}/invitations/{inv_id} | Revoke a pending invitation |
| GET | /bot/api/invitations/preview?token=... | Preview invitation details (public) |
| POST | /bot/api/invitations/accept | Accept an invitation and join the organization |
# Invite a user
curl -X POST http://localhost:9000/bot/api/orgs/{org_id}/invitations \
-H "Content-Type: application/json" \
-b cookies.txt \
-d '{"email": "colleague@example.com", "role": "member"}'
# Accept invitation (the invited user)
curl -X POST http://localhost:9000/bot/api/invitations/accept \
-H "Content-Type: application/json" \
-b cookies.txt \
-d '{"token": "invitation-token-from-email"}'Database Schema
PostgreSQL runs embedded inside the container on port 5432. Data persists in the /data/pgdata directory on the Docker volume. Schema migrations run automatically on first start.
Core tables
| Table | Key Columns | Description |
|---|---|---|
| agents | name, system_prompt, greeting, voice, llm_model, organization_id | Voice agent configurations |
| customers | name, phone_number, email, organization_id | Customer records linked to calls |
| calls | agent_id, customer_id, transcript (JSONB), metadata (JSONB), recording_path, duration, status | Call history with full transcripts and recordings |
Auth tables
| Table | Key Columns | Description |
|---|---|---|
| users | email (CITEXT), password_hash, auth_provider, device_id | User accounts |
| user_sessions | user_id, refresh_token_hash, device_info, replaced_by | JWT sessions with token rotation chain |
| oauth_accounts | user_id, provider, provider_user_id | Google OAuth linked accounts |
| password_resets | user_id, token_hash, expires_at | Password reset tokens (1h expiry) |
| email_verifications | user_id, token_hash, expires_at | Email verification tokens |
| audit_logs | user_id, action, ip_address, user_agent | Security audit trail |
Multi-tenancy tables
| Table | Key Columns | Description |
|---|---|---|
| organizations | name, slug (CITEXT), owner_user_id | Organization accounts |
| organization_memberships | user_id, organization_id, role (owner/admin/member) | User-org membership with roles |
| organization_invitations | email, organization_id, role, token_hash, expires_at | Pending invitations |
Configuration
All configuration via environment variables passed to docker run -e.
| Variable | Default | Description |
|---|---|---|
| VARIANT | en-full | Model variant: hinglish, en-lite, en-full, multi-full, multi-zh, all |
| ASR_DEVICE | auto | Inference device: auto, cpu, cuda |
| LLM_PROVIDER | claude | LLM for agent + summarization: claude or gemini |
| ANTHROPIC_API_KEY | — | Anthropic Claude API key |
| ANTHROPIC_MODEL | claude-sonnet-4-6 | Claude model ID |
| GEMINI_API_KEY | — | Google Gemini API key |
| STORAGE_MODE | sqlite | Data storage: sqlite (local) or firestore (cloud) |
| TTS_ENGINE | kokoro | TTS engine (kokoro is default and baked in) |
| AUTH_MODE | local | Token auth mode: local (SQLite), remote (cloud backend), hybrid (local then remote) |
| AUTH_DB_PATH | /data/auth/tokens.db | Path to SQLite token database |
| GATEWAY_INTERNAL_SECRET | auto | Shared secret for internal service-to-service auth |
| HIPAA_MODE | false | Enable HIPAA mode: 1h session TTL, encrypted DB |
| FRONTEND_URL | http://localhost:3000 | Frontend origin for CORS (comma-separated for multiple) |
| TWILIO_ACCOUNT_SID | — | Twilio account SID (for outbound phone calls) |
| TWILIO_AUTH_TOKEN | — | Twilio auth token |
| TWILIO_PHONE_NUMBER | — | Twilio phone number for outbound calls |
| GOOGLE_CLIENT_ID | — | Google OAuth client ID (optional) |
| GOOGLE_CLIENT_SECRET | — | Google OAuth client secret (optional) |
Need help?
See industry-specific solutions or contact us for custom integrations.
