- Timeline key fetch now filters by sent_ts (max 60s age) to avoid
using keys from a previous call session
- After 3+ consecutive DEC_FAILED events, automatically re-fetches
key from timeline in case rotation happened
- Tracks DEC_FAILED count per participant, resets on OK
This should fix the issue where the bot picks up stale encryption keys
from previous calls and can't decrypt the current caller's audio.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Plays immediate spoken feedback so the user knows the bot is processing
their screen share / camera before the vision API responds.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The on_e2ee_state callback crashed with NameError on time.monotonic()
when video tracks (screen share) arrived, preventing E2EE key re-derivation
and causing the bot to miss screen-share related questions.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: aggressive video re-keying (set_key at 0.3/0.8/2/5s intervals)
briefly cleared encryption_key between SetKey and HKDF callback, causing
DEC_FAILED oscillation. Single set_key per track subscription is sufficient.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR #904 callback-based HKDF hack only fired for the first frame cryptor
(audio), leaving video frame cryptors with PBKDF2 - DEC_FAILED oscillation.
PR #921 integrates HKDF natively at the WebRTC C++ level, applying uniformly
to all frame cryptors (audio + video).
Also removes aggressive video re-keying workaround and adds 5s cooldown
to DEC_FAILED re-keying handler to prevent tight loops.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Video frame cryptors may not be fully initialized when set_key() is
first called during on_track_subscribed. Audio works immediately but
video oscillates OK↔DEC_FAILED with the same key.
Add staggered re-keying at 0.3s, 0.8s, 2s, 5s after video track
subscription to ensure the key is applied after the frame cryptor
is fully ready.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KDF_PBKDF2=0 does NOT mean raw mode — libwebrtc applies its built-in
PBKDF2 on top of pre-derived keys, causing DEC_FAILED for audio too.
Revert to KDF_HKDF=1 (Rust applies HKDF, we pass raw base keys).
Keep diagnostic improvements:
- _derive_and_set_key() wrapper with logging
- Per-track type logging (audio vs video) in on_track_subscribed
- Frame size check in look_at_screen (detect E2EE failure)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch from Rust-side HKDF (KDF_HKDF=1) to Python-side HKDF derivation
with raw key mode (KDF_RAW=0). This eliminates potential HKDF implementation
mismatches between Rust FFI and Element Call JS that caused video frame
decryption failures (audio worked, video showed 8x8 garbage frames).
Changes:
- Add _derive_and_set_key() helper that pre-derives HKDF then calls set_key()
- Set key_derivation_function=KDF_RAW (proto 0 = no Rust-side derivation)
- Replace all direct set_key() calls with _derive_and_set_key()
- Add per-track diagnostic logging (audio vs video)
- Add frame size check in look_at_screen (detect E2EE failure early)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Video tracks (camera + screen share) were never getting E2EE keys set
via set_key() because the condition on track_subscribed only matched
audio tracks (kind==1). This caused DEC_FAILED for all video frames,
making look_at_screen return encrypted garbage or fail entirely.
Also added track source logging to distinguish camera vs screen share.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise VAD thresholds (activation 0.65→0.75, min speech 0.4→0.6s,
min silence 0.55→0.65s) to reduce false triggers from background noise
- Add "focus on latest message" instruction to all prompts (voice + text)
- Add "greet and wait" behavior for new conversations instead of auto-continuing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Video track kind is 2 (not 0) in LiveKit Python SDK — camera was never captured
- Replace broken confluence_collab.create_page import with direct REST API call
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When browse_url fails with DNS resolution error (common with STT-misrecognized
domain names like "klicksports" instead of "clicksports"), automatically try a
web search to find the correct domain and retry. Applied to both text and voice bot.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both bots can now fetch and read web pages via browse_url tool.
Uses httpx + BeautifulSoup to extract clean text from HTML.
Complements existing web_search (Brave) with full page reading.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MAT-58: Add recent_confluence_pages tool to both voice and text chat.
Shows last 5 recently modified pages so users can pick directly
instead of having to search every time.
MAT-59: Integrate sentry-sdk in all three entry points (agent.py,
bot.py, voice.py). SENTRY_DSN env var, traces at 10% sample rate.
Requires creating project in Sentry UI and setting DSN.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add create_confluence_page tool to voice mode (basic auth)
- Add confluence_update_page and confluence_create_page tools to text chat (OAuth)
- Fix update tool: wrap each paragraph in <p> tags instead of single wrapper
- Update system prompt to mention create capability
Previously only search/read were available. User reported bot couldn't
write to or create Confluence pages — because the tools didn't exist.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three fixes for the bot going silent after ~10 messages:
1. STT artifact handler now returns early — previously detected noise
leaks ("Vielen Dank.", etc.) but still appended them to transcript,
inflating context until LLM timed out after 4 retries.
2. Context truncation — caps LLM chat context at 40 items and internal
transcript at 80 entries to prevent unbounded growth in long sessions.
3. LLM timeout recovery — watchdog detects when agent has been silent
for >60s despite user activity, sends a recovery reply asking user
to repeat their question instead of staying permanently silent.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Voice bot could read/update Confluence pages but could not search.
Users asking to search Confluence got a refusal. Now the voice bot
has search_confluence using CQL queries via the service account.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace inline regex section parser in voice.py with confluence_collab
library (BS4 parsing, 409 conflict retry). Bot now loads section outline
into LLM context when Confluence links are detected.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Voice bot can now see the users camera or screen share when asked.
Captures a single frame, encodes as JPEG, sends to Sonnet vision
with full context (transcript + document). Triggered by phrases like
schau mal, siehst du das, can you see this.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"bist du dir sicher" / "are you sure" / "stimmt das wirklich" now also
trigger Opus escalation for fact-checking the previous answer.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sonnet can now escalate complex questions to Opus via a function tool,
same pattern as search_web and read_confluence_page. Full context
(transcript + document) is passed automatically. Triggered by user
phrases like "denk genauer nach" / "think harder" or when Sonnet is
unsure about complex analysis.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a voice call ends and a document was loaded in the room, the bot
now analyzes the transcript for document-specific changes/corrections
and posts them as a structured "Dokument-Aenderungen" message. Returns
nothing if no document changes were discussed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Confluence tools default to active page from room context — no more
asking user for page_id
- Prompt allows roleplay/mock interviews when document context present
- Explicit instruction not to ask for page_id
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable realtime Confluence page editing during Element Call voice sessions.
- Add read_confluence_page and update_confluence_page function tools
- Detect Confluence URLs shared in Matrix rooms, store page ID for voice context
- Section-level updates via heading match + version-incremented PUT
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pass PDF document context from room to voice session so the voice LLM
can answer questions about uploaded PDFs. Persist call transcripts and
post an LLM-generated summary to the room when the call ends.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Store user timezone as [PREF:timezone] in memory service
- Query timezone preference on session start, override default
- Add set_user_timezone tool so bot learns timezone from conversation
- On time-relevant questions, bot asks if user is still at stored location
- Seeded Europe/Nicosia for @christian.gick:agiliton.eu
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Bot now knows the user's timezone (Europe/Berlin default) and which
LLM model it's running on, so it can answer questions about both.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Removing the blocking wait entirely caused DEC_FAILED - the rotated key
had not arrived via nio sync before the pipeline started. Restore a short
3s wait (down from 10s) which is enough for nio to deliver the rotated key.
Also fix on_mute/on_unmute arg order (participant, publication - not reversed).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace _extract_voice_memories with _store_voice_exchange
- Store raw "User: ... / Assistant: ..." pairs directly
- No LLM call needed — faster, cheaper, no lost context
- Load as "Frühere Gespräche" with full thread context
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
flash_v2_5 had audible compression artifacts. multilingual_v2 has higher
fidelity while speed=1.15 via VoiceSettings still gives snappier delivery.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Model: eleven_multilingual_v2 → eleven_flash_v2_5 (lower latency)
- Speed: 1.15x via VoiceSettings
- Stability/similarity tuned for natural German speech
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EC rotates encryption key when bot joins LiveKit room. The rotated
key arrives via Matrix sync 3-5s later. Previous 2s wait was too
short - DEC_FAILED before new key arrived.
Extended wait to 10s. Added logging to bot.py to trace why late
key events were not being processed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The HKDF sed patch in Dockerfile was wrong — it swapped salt/info
based on incorrect analysis of minified JS. The original Rust FFI
parameters are correct: salt="LKFrameEncryptionKey", info=[0;128].
Also removed Python-side HMAC pre-ratcheting of keys. Element Call
uses explicit key rotation via Matrix events, not HMAC ratcheting.
Added diagnostic logging to trace exact key bytes during E2EE setup.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Allows disabling E2EE for diagnostic purposes. When disabled, bot
connects to LiveKit without frame encryption.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Inline E2EE options had 3 wrong values vs Element Call JS SDK:
- failure_tolerance=-1 (infinite, hid all DEC_FAILED) → 10
- key_ring_size=16 (too small, keys overflow) → 256
- ratchet_window_size=16 (wrong) → 10
Now uses _build_e2ee_options() which was already correct but never called.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ElevenLabs scribe_v2_realtime also produces non-asterisk artifacts like
"Untertitel: ARD Text im Auftrag von Funk (2017)" from TV/radio audio.
Add pattern matching for subtitle metadata, copyright notices, and
parenthetical/bracketed annotations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace Jack Marlowe (slow/raw) with Robert Ranger (deep/natural) for
a more pleasant conversational voice assistant experience.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace broken _VoiceAgent stt_node override with _NoiseFilterAgent that uses
on_user_turn_completed() + StopResponse. This operates downstream of VAD+STT
so no backpressure risk to the audio pipeline.
When ElevenLabs scribe_v2_realtime produces *Störgeräusche* etc., the agent
now silently suppresses them before the LLM responds. The prompt-based filter
is kept as defense-in-depth.
Fixes: MAT-41
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The _count_frames coroutine created a second rtc.AudioStream on the caller's
audio track, competing with AgentSession's internal pipeline for event loop
time. Under load, this caused VAD to miss speech → user_state stuck on "away".
- Remove _count_frames AudioStream (debugging artifact)
- Add VAD state diagnostics (speaking count, away duration)
- Add VAD watchdog: warns if user_state=away >30s (MAT-40 detection)
Fixes: MAT-40
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add LLM prompt rule to ignore *Störgeräusche* etc. annotations
instead of overriding stt_node (which broke VAD pipeline)
- Switch voice to vmVmHDKBkkCgbLVIOJRb per user preference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Threshold 0.60 too strict, user speech consistently not detected.
Back to default 0.50 with min_speech_duration=0.2 as noise guard.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>