Author: Chaitanya Kalbhairav
Department: CSE (AI), Vishwakarma Institute of Technology, Pune, India
Email: [email protected]
- Research Paper: (https://drive.google.com/file/d/13clEW0UfqT2kV2ssnedwLidx83AVvtXD/view?usp=sharing)
- Video Link: Watch Video
- Info: Additional Info
The Neural Conversational Telephony Pipeline (NCTP) is a low-latency, autonomous voice AI system designed for real-time telephony conversations. It integrates streaming Automatic Speech Recognition (ASR), Large Language Model (LLM) reasoning, and ultra-fast Text-to-Speech (TTS) to achieve human-like conversational flow with minimal perceptible delay.
The system uses commercial services such as Twilio/LiveKit, Deepgram, OpenAI GPT, and Cartesia TTS, orchestrated via FastAPI, and leverages MongoDB Atlas for memory and context retrieval.
Additionally, a research paper detailing the architecture, methodology, performance analysis, and future directions is included in this repository.
- Real-time bidirectional voice interactions over telephony networks
- Ultra-low response latency (< 800 ms average; P95 < 1000 ms)
- Streaming ASR with Deepgram for continuous transcription
- GPT-powered agent for contextual reasoning, intent recognition, and tool execution
- High-speed RAG context retrieval using MongoDB Atlas
- Streaming TTS with Cartesia for immediate audio output
- Modular, containerized microservices for horizontal scalability
- Accompanying research paper for academic reference
The system follows a microservice-based architecture following the “Smart Endpoints, Dumb Pipes” principle:
- Media Gateway (Twilio/LiveKit) – Handles voice ingress/egress via low-latency WebRTC streaming
- Deepgram ASR – Streaming speech-to-text with chunked audio processing (~100 ms buffers)
- Orchestration (FastAPI + LiveKit Agents) – Manages conversation state and streaming flow
- GPT Agent (OpenAI) – Contextual reasoning, tool execution, and response generation
- MongoDB Atlas – Session memory and RAG context retrieval for fast, grounded reasoning
- Cartesia TTS – Ultra-low-latency text-to-speech streaming
Refer to Figure 1 in the research paper for a detailed visual overview.
- Python 3.11+
- FastAPI
- MongoDB Atlas account (for RAG context)
- Twilio/LiveKit account (for telephony streaming)
- Deepgram account (for ASR)
- Cartesia account (for TTS)
- Node.js (for NextJS frontend)
# Clone the repository
git clone https://github.com/<your-username>/NCTP.git
cd NCTP
# Create and activate a virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install backend dependencies
pip install -r requirements.txt
# Navigate to frontend and install dependencies
cd frontend
npm install