A freeq AV agent joins a live voice/video call, hears every
participant, and can talk and show video back. Eliza
— the agent that transcribes calls, answers spoken questions, and renders
a live video tile — is one. This guide shows how to build your own.
An AV agent is an ordinary freeq bot plus three crates that handle the
call:
| Crate | Role |
|---|---|
freeq-sdk (freeq_sdk::av) |
Call signaling — start/join/leave over IRC. |
freeq-av |
The media session — connect to the SFU, publish the agent's audio + video, decode every participant. |
freeq-agent-kit |
Voice-agent helpers — turn a PCM stream into utterances, detect when the agent is addressed. |
The split mirrors the protocol: signaling is IRC, media is MoQ. If you
want the wire-level detail first, read the
AV Call Protocol. Otherwise, start here.
IRC connection (freeq-sdk)
│
├─ watch TAGMSGs ─▶ parse_av_state ─▶ "a call started"
│ │
│ AvSession::connect (freeq-av)
│ │
│ ┌───────────────┴───────────────┐
│ publish agent's recv() one
│ audio + video AvParticipant
│ │ per speaker
│ Speaker │
│ (enqueue to talk) VadSegmenter (agent-kit)
│ │
│ utterance ─┴─▶ do something
└──────────────────────────────────────────── (transcribe, answer, …)
Every agent has the same skeleton:
av-state TAGMSGs. When a call starts, decide whetherAvSession — it publishes the agent's broadcast and handsVadSegmenter.Speaker.freeq_sdk::av¶The agent connects to IRC like any bot (see the
Bot Quickstart) and watches every TAGMSG for
call state:
use freeq_sdk::av::{parse_av_state, AvAction};
while let Some(event) = events.recv().await {
if let Event::TagMsg { tags, .. } = &event {
if let Some(state) = parse_av_state(tags) {
match state.action {
AvAction::Started => {
// A call opened. Join it.
let instance = freeq_sdk::av::new_av_instance();
handle.av_join(&channel, &state.session_id, &instance).await?;
// …then open an AvSession (step 2).
}
AvAction::Ended => { /* tear the call down */ }
_ => {}
}
}
}
}
new_av_instance() mints the per-device id; handle.av_start /
av_join / av_leave send the signaling TAGMSGs. To initiate a call
rather than wait for one, probe GET /api/v1/channels/{channel}/sessions
first and av_start only if there's no active session — see
discover-or-start.
freeq-av¶AvSession is the whole media plane behind one handle. You give it where
to connect, an audio source for the agent's own voice, and a video
source; it connects to the SFU, publishes the agent's broadcast, watches
every other participant, decodes their audio, and reconnects on its own
if the transport drops.
use std::sync::Arc;
use std::sync::atomic::AtomicU32;
use freeq_av::{AvConfig, AvSession, Speaker, broadcast_path};
// A Speaker is how the agent talks; the paired source is what the
// session publishes. (The AtomicU32 is a loudness meter — wire it to a
// video tile, or ignore it.)
let (speaker, audio_source) = Speaker::new(Arc::new(AtomicU32::new(0)));
let config = AvConfig {
sfu_url: "https://irc.freeq.at:8080/av/moq".parse()?,
session_id: session_id.clone(),
our_broadcast: broadcast_path(&session_id, "myagent", &instance),
my_nick: "myagent".to_string(),
};
// `make_video` is called once per (re)connect — return a fresh
// iroh-live VideoSource each time. For a video-light agent, a static
// test pattern works; Eliza renders an animated tile (step 5).
let mut session = AvSession::connect(config, audio_source, move || {
iroh_live::media::test_sources::TestPatternSource::new(640, 360)
});
AvSession::recv() then yields one AvParticipant per remote speaker
the session starts tapping. Each carries a tokio::sync::mpsc::Receiver
of decoded PCM frames:
while let Some(mut participant) = session.recv().await {
println!("now hearing {}", participant.nick);
tokio::spawn(async move {
while let Some(frame) = participant.audio.recv().await {
// `frame.samples` is f32 PCM at `frame.format` — feed it on.
}
// The receiver closes when the participant leaves or the
// session reconnects. A reconnect re-announces everyone, so
// `recv()` will yield them again.
});
}
Dropping the AvSession ends the call: it stops publishing and closes
every participant stream.
Enqueue PCM on the Speaker and it goes out on the agent's broadcast:
speaker.enqueue(&tts_pcm, 24_000); // resampled to the broadcast rate
if speaker.is_speaking() { /* still draining the queue */ }
speaker.clear(); // stop immediately (barge-in)
freeq-agent-kit¶A participant's audio arrives as a stream of tiny PCM frames. A
transcriber wants whole utterances. VadSegmenter does the cut: it
accumulates while someone is talking and flushes at a natural pause.
use freeq_agent_kit::{VadConfig, VadSegmenter};
let mut segmenter = VadSegmenter::new(VadConfig::default());
while let Some(frame) = participant.audio.recv().await {
// Resample to 16 kHz mono for the recognizer first (see Eliza's
// `to_whisper_pcm`), then:
if let Some(utterance) = segmenter.push(&pcm_16k_mono) {
// `utterance` is one complete spoken turn — send it to STT.
}
}
push returns Some only on the frame that completes an utterance;
pre-speech silence and noise-only blips are dropped inside the segmenter,
so they never reach — and never hallucinate words out of — the recognizer.
The kit has the rest of the voice-agent glue too:
extract_addressed(text, "myagent") — was the agent addressed byis_hallucination(text) — drop the canonical Whisper silencesplit_speech_and_links(reply) — split a reply into speakable textfreeq-agent-kit is dependency-free — pull it in without dragging the
media stack along.
The full skeleton, with the call lifecycle wired up:
// On AvAction::Started (or after a discover-or-start probe):
let instance = freeq_sdk::av::new_av_instance();
handle.av_join(&channel, &session_id, &instance).await?;
let (speaker, audio_source) = Speaker::new(Arc::new(AtomicU32::new(0)));
let config = AvConfig {
sfu_url: sfu_url.clone(),
session_id: session_id.clone(),
our_broadcast: broadcast_path(&session_id, &nick, &instance),
my_nick: nick.clone(),
};
let mut session = AvSession::connect(config, audio_source, make_video);
// Keep `speaker` and `session` alive for the call. Dropping `session`
// ends it. One task per participant:
tokio::spawn(async move {
while let Some(mut p) = session.recv().await {
let speaker = speaker.clone();
tokio::spawn(async move {
let mut seg = VadSegmenter::new(VadConfig::default());
while let Some(frame) = p.audio.recv().await {
let pcm = to_whisper_pcm(&frame.samples, frame.format);
if let Some(utterance) = seg.push(&pcm) {
// transcribe → maybe answer → speaker.enqueue(...)
}
}
});
}
});
That is the entire agent loop. Everything past seg.push — the STT call,
the LLM, the text-to-speech, the video tile — is what makes your agent
different from the next one.
freeq-eliza is the worked example. On top of this skeleton she adds:
extract_addressed fires, an LLM answers andSpeaker.speaker.clear()iroh-live VideoSource that renders anHer code is the place to see every piece of this guide in production:
the av-state handler, the AvSession wiring, and the per-participant
VAD loop are all in freeq-eliza/src/irc.rs.