Mon, 16 Sep 2024 19:58:03 GMT

September 2024 Update: Daily Bots now supports native Twilio Voice integration. Learn more.

Today we’re sharing Daily Bots, a hosted AI bots platform.

Developers can ship voice-to-voice AI with any LLM; build with Open Source SDKs; and run on Daily’s real-time global infrastructure:

Create AI agents that talk naturally.
Design voice-to-voice AI flexibly, with leading commercial and open models. We’ve partnered with Anthropic, Cartesia, Deepgram, and Together AI. You can also use any LLM that supports OpenAI-compatible APIs.
Build ultra low latency experiences for desktop, mobile, and telephone.
Use the leading Open Source tooling for voice-to-voice and multimodal video AI. Daily Bots implements the RTVI standard for real-time inference, and is built on the Pipecat server-side framework.
Launch quickly and scale on Daily’s global WebRTC infrastructure.

In this post, we’ll talk about why we built Daily Bots and what it does; how we're excited to work with our partners; and also some of the fun demos you can play with.

If you’d like to jump straight in, here are docs and demos — our playground demo with configurable LLMs; function calling demo and vision with Anthropic; and iOS and Android. Sign up here (with a $10 credit during launch week).

Why Daily Bots

At Daily, we’ve been building real-time audio and video infrastructure since 2016. Our customers have been developers building conversational experiences – it started with people talking to each other.

Now, with generative AI, the definition of conversational experiences has expanded. Today’s Large Language Models are very good at open-ended conversations. They can follow scripts and perform multi-step tasks. They can call out to external systems and APIs.

Voice-driven LLM workflows are starting to have a big impact in healthcare and education. LLMs are improving the customer support experience and enterprise workflows. Virtual characters will transform video games and entertainment. And this is just the start of the impact of AI.

Building experiences in which humans can have useful, natural, real-time conversations with AI models involves:

Choosing and writing code for the right generative AI models for your specific use case.
Orchestrating the human -> AI -> human conversation loop, incorporating prompting, state management, data flow between models, and calling out to external systems.
Standing up both audio/video infrastructure and AI/orchestration infrastructure – service discovery, routing, autoscaling, fault tolerance, observability.
Having good client SDKs for all the platforms you need to support.

Over the past year and a half, as we’ve been helping our customers stand up new AI-powered real-time features, we’ve put together a complete set of tools that check all the boxes above.

We’ve rolled these tools and best practices into two big Open Source projects: Pipecat for server-side AI orchestration and the RTVI open standard for real-time inference clients. These are truly vendor neutral efforts, with a growing community and contributors from a wide range of stakeholders.

Now we’re filling another gap in the voice-to-voice and real-time ecosystem with Daily Bots.

Daily Bots lets you run your RTVI/Pipecat AI agents end-to-end on Daily’s infrastructure.
Start a real-time AI session with a single call to /api/v1/bots/start. Launch fast. Scale without limits. If your needs evolve beyond Daily Bots, you can take your code to another platform or stand up your own infrastructure.

AI that talks naturally

Human conversation is complicated!

We interrupt each other. We know when someone finishes speaking and expects us to talk. We change topics and go off on tangents.

And, most of all, we almost always respond quickly. Long pauses make conversations feel so unnatural that most people will just opt out. It’s critical to have voice-to-voice response times faster than 1 second. (Faster than 800ms is better!)

Daily Bots implements best practices for all of the hard, low-level challenges that voice AI product teams face. With a few lines of code, developers can leverage:

A modular architecture that enables easy switching between different LLMs and voice models. Use state-of-the-art LLMs with large parameter counts where needed. Or use models optimized for conversational response times.
Multi-turn context management, with tool calling and vision input.
Voice-to-voice response times as low as 500ms.
Interruption handling with word-level context accuracy.
Phrase endpointing that combines voice activity detection, semantic cues, and noise-level averaging.
Echo cancellation and background noise reduction.
Metrics and observability down to the level of individual media streams from every session.

Flexibility to use the best models, and the best models for your use case

Daily Bots developers can use both commercial and open models. You can use our integrated LLMs, or "Bring Your Own (API) Key" (BYOK) for your preferred service.

We’ve directly integrated with Anthropic, Cartesia, Deepgram, and Together AI.

Anthropic’s Claude 3.5 Sonnet is an excellent multi-turn conversational model. Daily Bots includes support for Sonnet’s vision input, tool calling, and the brand new context caching feature.
Cartesia’s Sonic voice model has raised the bar for voice quality at extremely low latencies. Cartesia offers a wide range of excellent voices, plus the ability to create your own voices.
Deepgram is a long-time Daily partner and the long-time leader in real-time speech to text accuracy and multi-language support.
Together AI delivers fast, high quality inference for all three sizes of Meta’s Llama 3.1 LLMs: 8B, 70B, and 405B.

With all of these partners, we do consolidated billing. You get just one bill from Daily, with line items showing your usage of each model. Also, it’s likely that you will benefit from higher rate limits and lower pricing when you use our partners’ services through Daily Bots. See Daily Bots pricing here.

Of course, you can always BYOK for both our partners and other services.

We can support any LLM provider that offers OpenAI-compatible APIs. We work regularly with OpenAI, Groq, and Fireworks, for example.

If you need custom models, our partners offer fine-tuned models and inference services for enterprise customers.

Daily Bots infrastructure can also be deployed inside your Virtual Private Cloud. If you manage your own inference, co-locating orchestration compute with inference has latency, cost, and compliance benefits.

Build now, and for the future

Our goal with Daily Bots is to accelerate the development of real-time, multimodal AI.

With a few lines of code, configure bots that scale on demand, on Daily’s infrastructure, automatically keeping pace as your application’s usage increases.

Write clients for iOS, Android, and the Web using the RTVI Open Source SDKs and Daily Bots helper libraries.

Buy phone numbers from Daily and make your bots accessible via dial-in.

All of this runs on Daily’s Global Mesh Network. Our distributed points of presence deliver 13ms first-hop latency to 5 billion people on six continents. (A little more, on average, if you happen to be in Antarctica.)

It’s also worth noting that Daily Bots is only one of your options, if you’re building real-time AI agents on the Open Source toolkits we use at Daily.

Definitely go check out Vapi and Tavus, for example. They’ve developed specialized technology, and best practices, to support different applications of multimodal inference. Vapi has great voice APIs, with user-friendly dashboards and excellent telephony support. Tavus’s Conversational Video Interface powers AI apps that can speak, hear, and see naturally. We’re proud these innovative platforms also leverage Daily’s WebRTC infrastructure.

If you’re interested in real-time AI, you can leverage Tavus or Vapi; build on the Daily Bots Open Source cloud; or strike out on your own and stand up your own Pipecat-based infrastructure!

Demos, demos, demos & starting out

We’ve had a ton of fun building out Daily Bots.

Check out Chad’s function-calling weather reporter, and our ExampleBot playground demo. There's also a vision demo, and iOS and Android.
Get started with a special launch week $10 credit by signing up for Daily Bots here. Or go straight to the docs.

AI is moving fast! Check out Vapi and Tavus. Join the Daily community on Discord. Let us know if you find Daily Bots, RTVI, and Pipecat useful. We’re excited to build the future with you.

]]>

Wed, 26 Jun 2024 21:32:56 GMT

UPDATE, August 2024: Voice AI moves fast! We’ve updated our demo since this post was published a few weeks ago. The links below are edited, to point to our updated ultra low latency demo. We built the original demo on Cerebrium’s excellent serverless infrastructure.

Speed is important for voice AI interfaces. Humans expect fast responses in normal conversation – a response time of 500ms is typical. Pauses longer than 800ms feel unnatural.

Source code for the bot is here. And here is a demo you can interact with.

Try the demo: https://demo.dailybots.ai/

Technical tl;dr

Today’s best transcription models, LLMs, and text-to-speech engines are very good. But it’s tricky to put these pieces together so that they operate at human conversational latency. The technical drivers that are most important, when optimizing for fast voice-to-voice response times are:

Network architecture
AI model performance
Voice processing logic

Today’s state-of-the-art components for the fastest possible time to first byte are:

WebRTC for sending audio from the user’s device to the cloud
Deepgram’s fast transcription (speech-to-text) models
Llama 3 70B or 8B
Deepgram’s Aura voice (text-to-speech) model

In our original demo, we self-host all three AI models – transcription, LLM, and voice generation – together in the same Cerebrium container. Self-hosting allows us to do several things to reduce latency.

Tune the LLM for latency (rather than throughput).
Avoid the overhead of making network calls out to any external services.
Precisely configure the timings we use for things like voice activity detection and phrase end-pointing.
Pipe data between the models efficiently.

We are targeting an 800ms median voice-to-voice response time. This architecture hits that target and in fact can achieve voice-to-voice response times as low as 500ms.

Optimizing for low latency: models, networking, and GPUs

The very low latencies we are targeting here are only possible because we are:

Using AI models chosen and tuned for low latency, running on fast hardware in the cloud.
Sending audio over a latency-optimized WebRTC network.
Colocating components in our cloud infrastructure so that we make as few external network requests as possible.

AI models and latency

All of today’s leading LLMs, transcription models, and voice models generate output faster than humans speak (throughput or tokens per second). So we don’t usually have to worry much about our models having fast enough throughput.

On the other hand, most AI models today have fairly high latency relative to our target voice-to-voice response time of 500ms. When we are evaluating whether a model is fast enough to use for a voice AI use case, the kind of fast we’re measuring and optimizing is the latency kind.

We are using Deepgram for both transcription and voice generation, because in both those categories Deepgram offers the lowest-latency models available today. Additionally, Deepgram’s models support “on premises” operation, meaning that we can run them on hardware we configure and manage. This gives us even more leverage to drive down latency. (More about running models on hardware we manage, below.)

Deepgram’s Nova-2 transcription model can deliver transcript fragments to us as quickly as 100ms. Deepgram’s Aura voice model running in our Cerebrium infrastructure has a time to first byte as low as 80ms. These latency numbers are very good! The state of the art in both transcription and voice generation are rapidly evolving, though. We expect lots of new features, new commercial competitors, and new open source models to ship in 2024 and 2025.

Llama 3 70B is among the most capable LLMs available today. We’re running Llama 3 70B on NVIDIA H100 hardware, using the vLLM inference engine. This configuration can deliver a median time to first token (TTFT) latency of 80ms. The fastest hosted Llama 3 70B services have latencies approximately double that number. (Mostly because there is significant overhead in making a network request to a hosted service.) Typical TTFT latencies from larger-parameter SOTA LLMs are 300-400ms.

WebRTC networking for voice AI

WebRTC is the fastest, most reliable way to send audio and video over the Internet. WebRTC connections prioritize low latency and the ability to adapt quickly to changing network conditions (for example, packet loss spikes). For more information about the WebRTC protocol and how WebRTC and WebSockets complement each other, read this short explainer.

Connecting users to nearby servers is also important. Sending a data packet round-trip between San Francisco and New York takes about 70ms. Sending that same packet from San Francisco to, say, San Jose takes less than 10ms.

In a perfect world, we would have voice bots running everywhere, close to all users. This may not be possible, though, for a variety of reasons. The next best option is to design our network infrastructure so that the “first hop” from the user to the WebRTC cloud is as short as possible. (Routing data packets over long-haul Internet connections is significantly slower and more variable than routing data packets internally over private cloud backbone connections.) This is called edge or mesh networking, and is important for delivering reliable audio at low latency to real-world users. If you’re interested in this topic, here’s a deep dive into WebRTC mesh networking.

Where the components run – self-hosting the LLM and voice models

The code for an AI voice bot is usually not terribly large or complicated. The bot code manages the orchestration of transcription, LLM context-management and inference, and text-to-speech voice generation. (In many applications, the bot code will also read and write data from external systems.)

But, while voice bot logic is often simple enough to run locally on a user’s mobile device or in a web browser process, it almost always makes sense to run voice bot code in the cloud.

High-quality, low-latency transcription requires cloud computing horsepower.
Making multiple requests to AI services – transcription, LLM, text-to-speech – is faster and more reliable from a server in the cloud than from a user’s local machine.
If you are using external AI services you need to proxy them or access them only from the cloud to avoid baking API keys into client applications.
Bots may need to perform long-running processes, or may need to be accessible via telephone as well as browser/app.

Once you are running your bot code in the cloud, the next step in reducing latency is to make as few requests out to external AI services as possible. We can do this by running the transcription, LLM, and text-to-speech (TTS) models ourselves, on the same computing infrastructure where we run the voice bot code.

Colocating voice bot code, the LLM, and TTS in the same infrastructure saves us 50-200ms of latency from network requests to external AI services. Managing the LLM and TTS models ourselves also allows us to tune and configure them to squeeze out even more latency gains.

The downside of managing our own AI infrastructure is additional cost and complexity. AI models require GPU compute. Managing GPUs is a specific devops skill set, and cloud GPU availability is more constrained than general compute (CPU) availability.

Voice AI latency summary – adding up all the milliseconds

So, if we’re aiming for 800ms median voice-to-voice latency (or better) what are the line items in our latency “budget?”

Here’s a list of the processing steps in the voice-response loop. These are the operations that have to be performed each time a human talks and a voice bot responds. The numbers in this table are typical metrics from our reasonably well optimized demo running on NVIDIA containers hosted by Cerebrium.

Next steps

For some voice agent applications, the cost and complexity of managing AI infrastructure won’t be worth taking on. It’s relatively easy today to achieve voice-to-voice latency in the neighborhood of two to four seconds using hosted AI services. If latency is not a priority, there are many LLMs that can be accessed via APIs and have time to first token metrics of 500-1500ms. Similarly, there are several good options for transcription and voice generation that are not as fast as Deepgram, but deliver very high quality text-to-speech and speech-to-text.

However, if fast, conversational voice responsiveness is a primary goal, the best way to achieve that with today’s technology is to optimize and colocate the major voice AI components together.

If this is interesting to you, definitely try out the demo, read the demo source code (and experiment with the Pipecat open source voice AI framework), and learn more about Cerebrium's fast and scalable AI infrastructure.

]]>

Thu, 25 Apr 2024 20:27:40 GMT

Zoom Web SDK feature gaps

Perhaps because Zoom has always prioritized development of its own Windows and macOS applications, Zoom’s Web Video SDK is relatively feature poor. In addition, technology mismatch between Zoom’s video stack and how video is implemented in web browsers limits the Zoom Web SDK’s functionality.

Developers porting from WebRTC platforms will find that many things they consider “table stakes” are missing.

Standard HTML video and audio elements aren’t supported

The Zoom Web SDK only supports video rendering via drawing to a single canvas element. The SDK automatically plays all audio streams internally through a WebAudio pipeline.

This means that you can’t use and elements to play video and audio, as you would in a normal web app. Video can’t be styled using CSS. Each video tile must be drawn as a 16:9 rectangle on a single, shared canvas. (The Zoom consumer web app uses multiple canvas elements, but the Web SDK only supports drawing to a single canvas.)

This use of a canvas for video rendering also creates performance and responsiveness issues. Here is video showing how the official Zoom Web SDK demo application looks when its window is resized.

Zoom canvas resize issues

Styling and positioning video tiles requires writing a lot of complex code

Because all inbound video streams must be drawn on a single canvas and can only be drawn as 16:9 rectangles, creating anything other than a very simple UX requires a lot of code.

For example, implementing square or round video tiles — or even rounded corners — requires using techniques like drawing to an offscreen canvas.

Zoom does not provide any library support for implementing these kinds of custom, multi-pass canvas rendering operations. For example, you will need to write code by hand for double buffering, aligning pixels on the

Maximum video resolution is 720p

Zoom sets a hard limit of 720p on video resolution. This makes use cases that require high quality live streaming or cloud recording impossible.

No high-fidelity audio support

Zoom's consumer applications have support for sending higher fidelity audio, intended for music use cases. This is called "music mode" or "original sound" in the UX.

The Zoom Web SDK does not allow the audio stream to be configured for higher fidelity. The audio stream is locked to a configuration appropriate for low-bandwidth speech streams. Here is a Zoom developer forum post raising this issue.

To support music use cases, WebRTC platforms generally implement audio presents, expose multiple low-level audio parameters, or both.

Virtual backgrounds and background blur are not available in Safari

This is presumably a limitation that Zoom will fix at some point. But as of December 17 2023, virtual backgrounds and background blur are not supported in Safari.

No custom video or audio tracks

The Zoom Web SDK only allows video and audio input from system devices or a URL. Custom tracks are not supported. So it is impossible to do any local video or audio processing on a camera or mic stream before sending a track into a session.

You can’t bring your own or third-party background replacement or noise suppression solutions into your web app. You are limited to the Zoom Web SDK’s features, which are significantly less capable than offerings from, for example, Krisp.ai and Banuba.

See also No access to raw media tracks, below.

No low-level video simulcast control

Zoom provides default send- and receive-side bandwidth and video quality management. In the Zoom native clients, the algorithms for this work quite well. They do not work as well in the Zoom Web SDK and are not flexible enough to deliver the best possible user experience across the full range of real-world use cases.

For example, live streaming scenarios often require sending very high quality video layers that are appropriate for cloud recording and for users on fast network connections, alongside lower-quality fallback layers for users on slower connections. The Zoom Web SDK does not support this.

Limited debugging information

Experienced WebRTC developers rely heavily on a combination of:

standards-based debugging and testing tools like Chrome’s webrtc-internals interface and testRTC
platform-specific metrics and logs (see Daily’s metrics docs here, for example)

Zoom’s proprietary approach means that standard video debugging and performance optimization tools mostly aren’t useful. And the Zoom platform does not offer any detailed post-session logs or metrics data.

No end-to-end encryption

In 2020 the Federal Trade Commission accused Zoom of making substantive misrepresentations about security and encryption, including in HIPAA documentation. Zoom entered into an agreement with the FTC that mandated security improvements and that Zoom stop falsely claiming to support end-to-end encryption. In 2021 Zoom settled a related class action lawsuit for $85m.

Today, Zoom offers optional end-to-end encryption in their native macOS, Windows, and Zoom Room applications. This encryption is proprietary and it’s not possible to verify independently that Zoom is encrypting all data end-to-end.

End-to-end encryption is not supported at all in the Zoom Video SDKs for developers.

WebRTC platforms can build on top of WebRTC’s excellent, standards-based support for auditable end-to-end encryption. When a WebRTC connection is configured so that data is routed peer-to-peer, it is possible for any third party (including tech-savvy end users) to independently verify that data is encrypted end-to-end.

No HLS live streaming or recording

Zoom offers RTMP live streaming and MP4 cloud recording. Zoom does not offer HLS live streaming or recording. HLS has a number of advantages over both RTMP and MP4 for many of today’s live streaming and recording use cases.

With HLS, you can live stream directly to an audience of any size (millions of viewers). No transcoding or rebroadcasting services are needed.

Using HLS also gives you multi-bitrate recordings that are immediately playable on any device and any network connection. Again, no transcoding is needed for production-ready, on-demand streaming. Just set up the CDN of your choice in front of your HLS recordings bucket to create a cost-effective video streaming solution that’s compatible with any hosting stack.

For more information on how WebRTC, RTMP, and HLS compare, see our technical deep dive into these three widely used video protocols.

No access to raw media tracks

The Zoom Web SDK does not give developers access to raw audio or video data. This makes it impossible to build applications that do any processing of inbound audio or video. For example, you can’t do any filtering or analysis of audio, can’t implement client-side transcription, and can’t build AI-powered video features like face filters.

No React helper libraries

The React front-end framework is widely used for dynamic, single-page web apps. React offers sophisticated state management features and a powerful virtual DOM abstraction.

Some of React’s abstractions are tricky to use efficiently and safely in combination with real-time video and audio elements. For this reason, many WebRTC platforms offer React-specific helper libraries. For example: Daily’s daily-react and Vonage’s opentok-react.

Zoom Web SDK performance issues

The Zoom Web SDK uses some components of Zoom’s proprietary video stack, combined with some parts of the web browser’s native WebRTC support. This is a creative approach. But it results in high CPU usage, video quality problems, and call scaling issues.

Video quality

Zoom encodes and decodes video and audio using custom WebAssembly modules rather than the web’s standard codecs. This means that the Zoom Web SDK uses more CPU than the native browser WebRTC stack does. Zoom’s web video resolution is limited to 720p and is often lower in real-world situations, especially on older devices and most mobile phones (even current-generation iPhones).

Here’s a video showing pixelated, low resolution video quality in the Zoom Web SDK sample app running in Safari on an iPhone 15. This test is easy to replicate. Simply run the sample app and join a call from both an iPhone and a laptop.

Zoom iPhone video quality

For video and audio transport, Zoom uses WebRTC data channels rather than WebRTC media tracks.

Zoom's combination of using both non-standard encoding and non-standard media transport makes it impossible to “shape” the bitrate used for video as effectively as a native WebRTC solution.

These limitations show up as jerky video — freezes and inconsistent framerates — any time there are variable network conditions or local packet loss. For a simple, real-world test, start a video call and then walk away from your WiFi router until the signal starts to degrade. A good video calling implementation should handle moderate packet loss with very little visual impact. The Zoom Web SDK exhibits freezes and jerky video even for users on fairly good WiFi networks.

High CPU usage

Efficient CPU usage is critical for video applications. The Zoom Web SDK can’t make use of the highly optimized H.264 and VP8 codecs that are built into today’s web browsers.

As a result, on older computers and phones, the Zoom Web SDK has issues with high CPU usage and low video quality. Even on newer laptops and phones, Zoom in a web browser can’t display multiple videos in grid mode with acceptable visual performance.

Here are CPU usage tests on a fairly typical older laptop, a 2.6 GHz Dual-Core Intel Core i5 macOS machine manufactured in 2020.

In a 2-person test call, the Zoom Web SDK sample application delivers video at 360p resolution. The Safari process uses 90-100% CPU as measured by Activity Monitor. The video frame rate is inconsistent and the machine overall feels heavily loaded and laggy.

Here is the same 2-person test using Daily’s native WebRTC SDK. Configured to deliver the same video resolution as Zoom (360p), the Safari process uses 25-40% CPU. The video frame rate is consistent at 30fps. The machine feels responsive. Daily can also deliver 720p video on this machine, but CPU usage goes up to 80% and if other applications are running at the same time, the machine may start to lag. So we generally don’t recommend trying to send and receive 720p video on older devices.

Here is a four-person call on the same machine. With the Zoom Web SDK, Safari CPU usage is 120%. The machine is very laggy. Audio and video are out of sync by several seconds. The Zoom sample application has gotten confused about the pixel resolution of the local video stream.

Here is the same four-person call using Daily’s native WebRTC SDK. Configured to deliver the same resolution as the Zoom Web SDK (360p), CPU usage is 70%, the frame rate is steady, and the machine is responsive.

Scaling calls

The Zoom Web SDK is limited to a maximum call size of 1,000 participants. This puts interactive streaming use cases like live auctions, events with audience participation, and social games out of reach.

SDK maturity and developer tooling

Zoom has historically focused on consumer desktop applications. The Zoom Web SDK is less mature, has fewer features, and performs poorly compared to the company’s core products. It has not been widely used in browser-oriented embedded video applications.

As of the first week of December, 2023, the Zoom Web SDK shows fewer than 4,000 downloads per week on npmjs.com. Daily's npmjs download stats average about 10 times Zoom's downloads, week over week.

Daily's npmjs download counts average 10x greater than Zoom's, week over week

Zoom’s guides and official sample application for the Web SDK are incomplete and sometimes misleading. Code workarounds are required for the SDK to work properly on Safari. The Zoom guide for migrating from Twilio Video recommends implementing a precall test in a way that won’t be helpful for a real-time video application. Zoom’s code samples don’t cover basic topics like how to listen for important browser events.

Here’s a video showing the official Zoom web sample app leaving stale video participants in a session for more than 2 minutes.

Zoom demo app stale participants

Zoom provides little support for event logging, load testing, session analytics, integration with BI data systems, and many other things that are helpful for production coding.

Development teams building video applications that need to run in a web browser should carefully consider all of these issues before committing to building and maintaining applications using the Zoom Web SDK.

]]>

Tue, 05 Dec 2023 19:19:08 GMT

UPDATE: Learn about our updated AI offerings. We've released Pipecat, the Open Source framework for voice and multimodal AI. Daily Bots is a hosted Pipecat offering, for developers to build with any LLM and Open Source SDKs, on our global infrastructure.

How many times have you been in a video meeting and figured out who was going to keep notes, keep track of who said what, and come up with action points during the call?

Often, the mental gymnastics of coordinating discussion points and action items hampers productivity and the ability to be truly present in a conversation.
To enhance collaboration, Zoom recently launched a real-time summarization feature enabled by the Zoom AI companion.

Let's take a look at how this works and show you how to build your own AI meeting assistant using best-in-class infrastructure from Daily, Deepgram, and OpenAI.

In this post, Christian and I will show you how we created a real-time LLM-powered meeting assistant with Daily. We’ll cover:

How to create video meeting sessions with the help of Daily’s REST API.
How to create an AI assistant bot that joins your meeting with Daily’s Python SDK.
How to ask your AI assistant questions about the ongoing meeting through OpenAI.
How to manage the lifecycle of your bot and session.

What we’re building

We’re building an application that call participants can use to

Access a real-time summary of the video call.
Pose custom prompts to the AI assistant.
Generate a clean transcript of the call
Generate real-time close captions

The demo contains a server and client component. When the user opens the web app in their browser, they’re faced with a button to create a new meeting or join an existing one:

When a meeting is created, the user joins the video call. Shortly thereafter, another participant named “Daily AI Assistant” joins alongside them:

The AI assistant begins transcription automatically, and as you speak with others in the call (or just to yourself), you should see live captions come up:

When you or another user wants to catch up on what’s been said so far, they can click on the “AI Assistant” button and request either a general summary or input a custom prompt with their own question:

The user can also request a cleaned-up transcript of the meeting by clicking the “Transcript” button:

Here’s a small GIF showing the custom query feature:

Now, let’s take a look at how to run the application locally.

Running the demo

Prepare the repository and dependencies:

git clone git@github.com:daily-demos/ai-meeting-assistant.git
cd ai-meeting-assistant
git checkout v1.0
python3 -m venv venv
source venv/bin/activate

Inside your virtual environment (which should now be active if you ran the source command above, install the server dependencies and start the server:

pip install -r server/requirements.txt
quart --app server/main.py --debug run

Now, navigate into the client directory and serve the frontend

cd client
yarn install
yarn dev

Open the displayed localhost port in your browser as shown in your client terminal.

With the demo up and running, let’s take a look at the core components.

Core components

Core server components

All the AI operations happen on the server. The core components of the backend are as follows:

The Operator class is responsible for keeping track of all assistant sessions currently in progress. It is also the entry point to any of the sessions when a query is made using an HTTP endpoint.
The Session class encapsulates a single running assistant session. This includes creating a Daily room with Daily’s REST API and instantiating an AI assistant for it, joining the Daily room with a daily-python bot, keeping track of any cached summaries, and handling relevant Daily events. The Session class also inherits from the daily-python EventHandler, which enables it to start listening for relevant Daily events (such as meeting joins, app messages, incoming transcription messages, and more)
The Assistant base class defines the methods any assistant needs to implement for a Session to work with it.
The OpenAIAssistant class is our example assistant implementation. It handles all interactions with OpenAI and keeps track of the context to send for each prompt.

Core client components

The AIAssistant React component connects to the server, maintains the chat history and processes user input
The Transcript React component maintains a cleaned up transcript of the conversation
The App component sets up the Daily iframe, renders the AIAssistant and Transcript components, configures the custom buttons, and renders closed captions

Now that we have an overview of the core component, let’s dig into the session creation flow.

Server implementation

Session creation

A session is created when the client makes a POST request to the server’s /session endpoint. This endpoint invokes the operator’s create_session() method:

def create_session(self, room_duration_mins: int = None,
                  room_url: str = None) -> str:
   """Creates a session, which includes creating a Daily room."""


   # If an active session for given room URL already exists,
   # don't create a new one
   if room_url:
       for s in self._sessions:
           if s.room_url == room_url and not s.is_destroyed:
               return s.room_url


   # Create a new session
   session = Session(self._config, room_duration_mins, room_url)
   self._sessions.append(session)
   return session.room_url

Above, the operator first checks if a session for the provided room URL (if any) already exists. If not, or if an existing room URL has not been provided, it creates a Session instance and then appends it to its own list of sessions. Then, it returns the Daily room URL of the session back to the endpoint handler (which returns it to the user).

A few things happen during session creation. I won’t show all the code in-line, but provide links to the relevant parts below:

If no Daily room URL has been provided, a room is created, along with an accompanying owner meeting token with which our assistant bot will join the room later.
If a Daily room URL had been provided to the /session endpoint, the session instance retrieves information about the given room and initializes a session using that room instead of creating a new one.
An OpenAIAssistant instance is created. There is one assistant per session.
In a new thread, the session begins to poll the room’s presence data, monitoring how many people are in the room, since there’s no point having the assistant hang around a room when it’s empty.
When there’s at least one person in the room, the session creates a Daily call client and uses it to join the room.
Once the session has connected to the Daily room, the on_joined_meeting() completion callback is invoked. At this point, the session starts transcription within the room and sets its in-call user name to “Daily AI Assistant”.

The client will have received a response with the new Daily room URL right after step 1 above, meaning it can go ahead and join the room in its own time.

Now that we know how a session is created, let’s go through how transcription messages are handled.

Handling transcription events and building the OpenAI context

Daily partners with Deepgram to power our built-in transcription features. Each time a transcription message is received during a Daily video call, our EventHandler (i.e., the Session class) instance’s on_transcription_message() callback gets invoked.

Here, the Session instance formats some metadata that we want to include with each message and sends it off to the assistant instance:

server/call/session.py:

def on_transcription_message(self, message):
   """Callback invoked when a transcription message is received."""
   user_name = message["user_name"]
   text = message["text"]
   timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
   metadata = [user_name, 'voice', timestamp]
   self._assistant.register_new_context(text, metadata)

The self._assistant.register_new_context() method then takes the text and metadata information and formats it into a single OpenAI ChatCompletionUserMessageParam, which it adds to its context collection:

server/assistant/openai_assistant.py:

def register_new_context(self, new_text: str, metadata: list[str] = None):
   """Registers new context (usually a transcription line)."""
   content = self._compile_ctx_content(new_text, metadata)
   user_msg = ChatCompletionUserMessageParam(content=content, role="user")
   self._context.append(user_msg)

The final context message will look something like this:

[Liza | voice | 2023-11-13 23:24:10] Hello! I’m speaking.

Now that we know how transcription messages are being registered in our assistant implementation context, let’s take a look at how to actually use them by querying the assistant.

Using the OpenAI assistant

There are two primary ways to use the configured AI assistant: to generate a generic meeting summary or to issue custom queries. Both of these types of query can be performed through HTTP endpoints or "app-message” events.

Querying entry points: HTTP and `”app-message”`

To see the querying entry points implemented for the HTTP flow, refer to the /summary and /query routes. When a request is made to one of these routes, the Operator instance is instructed to find the relevant Session instance and invoke its query_assistant() method.

To see the querying entry point implemented for the ”app-message” flow, refer to the on_app_message() method within the Session class.

If the assistant is queried through HTTP requests, the answer will be sent back directly in the server’s response. If it’s queried through ”app-message” events, the response will be transmitted back through the send_app_message() call client method provided by daily-python.

Custom query or general summary

There are two types of queries an end user can make to the AI assistant:

The user can ask for a “general summary” of the meeting.
The user can pass in a custom query with their own question.

A “general summary” is what is produced by default if no custom query is provided to the assistant. Here’s the default prompt we opted to use to generate this general summary:

_default_prompt = ChatCompletionSystemMessageParam(
        content="""
         Based on the provided meeting transcript, please create a concise summary. Your summary should include:


            1. Key discussion points.
            2. Decisions made.
            3. Action items assigned.


        Keep the summary within six sentences, ensuring it captures the essence of the conversation. Structure it in clear, digestible parts for easy understanding. Rely solely on information from the transcript; do not infer or add information not explicitly mentioned. Exclude any square brackets, tags, or timestamps from the summary.
        """,
        role="system")

When the session’s query_assistant() method is invoked without a custom query, it will produce a general summary by default. A general summary will be regenerated no more than once every 30 seconds. If a summary that is newer than 30 seconds already exists, the session will just return that instead of sending a prompt to OpenAI again:

def query_assistant(self, recipient_session_id: str = None,
                   custom_query: str = None) -> [str | Future]:
   """Queries the configured assistant with either the given query, or the
   configured assistant's default"""


   want_cached_summary = not bool(custom_query)
   answer = None


   # If we want a generic summary, and we have a cached one that's less than 30 seconds old,
   # just return that.
   if want_cached_summary and self._summary:
       seconds_since_generation = time.time() - self._summary.retrieved_at
       if seconds_since_generation < 30:
           self._logger.info("Returning cached summary")
           answer = self._summary.content
    # The rest of the method below…

If we don’t want a general summary or a cached summary doesn’t exist yet, the session goes ahead and queries OpenAI:

def query_assistant(self, recipient_session_id: str = None,
                   custom_query: str = None) -> [str | Future]:
    
    # …Previously-covered logic above…
    
    # If we don't have a cached summary, or it's too old, query the
    # assistant.
    if not answer:
       self._logger.info("Querying assistant")
       try:
           answer = self._assistant.query(custom_query)
           # If there was no custom query provided, save this as cached
           # summary.
           if want_cached_summary:
               self._logger.info("Saving general summary")
               self._summary = Summary(
                   content=answer, retrieved_at=time.time())
       except NoContextError:
           answer = ("Sorry! I don't have any context saved yet. Please try speaking to add some context and "
                     "confirm that transcription is enabled.")
    # Rest of the method below…

Above, we query the assistant and then, if relevant, cache the returned summary for subsequent returns. If we encounter a NoContextError, it means a summary or custom query was requested before any transcription messages have been registered, so a generic error message is returned.

Finally, the retrieved answer is sent to the client either in the form of a string (which then gets propagated to and returned by the relevant request handler) or an ”app-message” event:

def query_assistant(self, recipient_session_id: str = None,
                   custom_query: str = None) -> [str | Future]:
    # …Previously-covered logic above…


    # If no recipient is provided, this was probably an HTTP request through the operator
    # Just return the answer string in that case.
    if not recipient_session_id:
       return answer


    # If a recipient is provided, this was likely a request through Daily's app message events.
    # Send the answer as an event as well.
    self._call_client.send_app_message({
       "kind": "assist",
       "data": answer
    }, participant=recipient_session_id, completion=self.on_app_message_sent)

Now that we know how sessions are created and queried, let’s look at the final important piece of the puzzle: ending a session.

Ending a session

This demo creates sessions which expire in 15 minutes by default, but this expiry can be overridden with a room_duration_mins /session request parameter. Once a room expires, all participants (including the assistant bot) will be ejected from the session.
But what if users are done with a session before it expires, or the server as a whole is shut down? It is important to properly clean up after each session. And the most important thing to keep in mind is that you must explicitly tell your daily-python bot to leave the room unless you want it to hang around indefinitely.

In this demo, there’s no point having the bot hanging around and keeping the session alive if there is no one in the actual Daily room. What’s there to assist with if there’s no one actually there?! So, our rules for the session are as follows:

An assistant bot only joins the room when there is at least one present participant already there.
The Session instance waits up to 5 minutes for at least one person to show up after creating a Daily room. If no one shows up within that time, the session is flagged as destroyed and will eventually be cleaned up by the Operator instance
Once a session has started and a bot has joined, the session pays attention to participants leaving the call. When no more participants are present, a shutdown process begins. The session waits for 1 minute before completing this process, allowing some time for users to rejoin and continue the session.
Once the 1-minute shutdown timer runs out, the session instructs the assistant bot to leave the room with the call client’s leave() method.
Once the bot has successfully left the room (confirmed via invocation of our specified leave callback, the session is flagged as destroyed.
Every 5 seconds, the Operator instance runs an operation which removes any destroyed sessions from its session collection.

And now our cleanup is complete!

Considerations for production

Rate limiting and authorization

One important thing to consider is that this demo does not contain any rate limiting or authorization logic. The HTTP endpoints to query any meeting on the configured Daily domain can be freely used by anyone with a room URL and invoked as many times as one wishes. Before deploying something like this to production, consider the following:

Who should be allowed to query information about an ongoing meeting?
How often should they be permitted to do so?

You can ensure that only authorized users have access to meeting information by either gating it behind your own auth system or using Daily’s meeting tokens. A meeting token can be issued on a per-room basis, either by retrieving one from Daily’s REST API or self-signing a token yourself using your Daily API key. Meeting tokens can also contain claims indicating certain privileges and permissions for the holder. For example, you could make it so that only a user with an owner token is able to send custom queries to the assistant, but users with regular tokens are able to query for a general summary. Read more about obtaining, handing, and validating meeting tokens in your application.

Context token limits

By default, the backend uses OpenAI’s GPT 3.5 Turbo model, which can handle a context length of 4000 tokens. You can specify another model name in the OPENAI_MODEL_NAME environment variable, keeping potential tradeoffs in mind. For example, GPT 4 Turbo supports a whopping 120,000-token context, but we’ve found it to be more sluggish than 3.5 Turbo.

Additionally, consider optimizing the context itself. We went for the most straightforward approach of storing context in memory exactly as it comes from Daily’s transcription events, but for a production use case you may consider deprecating older context if appropriate, or replacing older context with previously generated, more concise summary output. The approach depends entirely on your specific use case.

Client implementation

With a server being set up and ready to run your personal AI assistant, it only requires a client to utilize the endpoints we’ve outlined before and incorporate them into a slick client application. We’ll be using Daily Prebuilt to take care of all the core functionality for the video call which allows us to keep the code of this demo small and focussed on integrating the AI assistant’s server component.

Setting up the app

We’ll build the demo app on top of Next.js, but it can be built using any JavaScript framework or no framework at all.

The main application is rendered on the index page route and renders the App component. The App component instantiates the Daily iframe and creates the server session. A Daily room URL can be provided through an input field or through a url query parameter and will make sure that the bot participant joins the given room. When no URL is provided, the client will join the room URL that is returned from the /create-session endpoint.

The app doesn’t send HTTP requests to the Python server directly: instead HTTP requests are routed through Next.js API routes to circumvent potential CORS issues.

In setting up the Daily iframe we’ll also mount the AIAssistant and Transcript components and configure custom buttons to:

toggle the AI Assistant view
toggle the Transcript view
toggle on-screen captions

As of writing this, Daily Prebuilt doesn’t support on-screen captions out of the box, but having captions rendered on screen helps comprehend how the spoken words were transcribed to text. Eventually the transcriptions make the context for the AI Assistant.

Building the AI Assistant UI

The AI Assistant is rendered next to the Prebuilt iframe. The AIAssistant component renders a split-view consisting of a top area for a meeting summary and bottom area with a simplified chat-like UI.

The Summary button is connected to the /summary endpoint on the Python server. Once clicked, it will request a summary from the server and render it in the top area of the AIAssistant view. Since the timing can vary, depending on whether the server returns a cached response or generates a new summary, we’ll disable the button while a summary is being fetched.

The input field and Submit button allow users to ask individual questions and connect to the /query endpoint. Similarly to the Summary button, the input field and Submit button are disabled while a query is being processed. The user’s question and the assistant’s answer are then rendered in the message stream.

Finally the Transcript component automatically requests a cleaned up transcript using the /query endpoint. The component updates the rendered transcript every 30 seconds, in case new transcription lines were captured. When receiving transcript lines from deepgram, sentences might be broken into fragments. Providing a cleaned up transcript drastically improves the readability for the end-user. The Transcript component is technically always rendered to make sure that the useTranscription hook always receives the transcription app data events in order to maintain the cleaned up transcript state. We hide the component using display: none.

Closing the cycle

When the meeting ends, the Daily Prebuilt iframe is being torn down and the user is returned to the start screen.

Conclusion

In this post, Christian and I showed you how to build your own live AI-powered meeting assistant with Daily and OpenAI. If you have any feedback or questions about the demo, please don’t hesitate to reach out to our support team or head over to our Discord community.

]]>

Tue, 21 Nov 2023 20:23:50 GMT