Commit Graph

95 Commits

Author SHA1 Message Date
Christian Gick
9e146da3b0 feat(CF-1812): Use confluence-collab for section-based page editing
Replace inline regex section parser in voice.py with confluence_collab
library (BS4 parsing, 409 conflict retry). Bot now loads section outline
into LLM context when Confluence links are detected.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 11:37:37 +02:00
Christian Gick
326a874aa7 feat: Add on-demand camera/screen vision via look_at_screen tool
Voice bot can now see the users camera or screen share when asked.
Captures a single frame, encodes as JPEG, sends to Sonnet vision
with full context (transcript + document). Triggered by phrases like
schau mal, siehst du das, can you see this.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 06:36:52 +02:00
Christian Gick
cfb26fb351 feat: Add doubt triggers to think_deeper tool
"bist du dir sicher" / "are you sure" / "stimmt das wirklich" now also
trigger Opus escalation for fact-checking the previous answer.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 06:23:51 +02:00
Christian Gick
6081f9a7ec feat(MAT-46): Add think_deeper tool for Opus escalation in voice calls
Sonnet can now escalate complex questions to Opus via a function tool,
same pattern as search_web and read_confluence_page. Full context
(transcript + document) is passed automatically. Triggered by user
phrases like "denk genauer nach" / "think harder" or when Sonnet is
unsure about complex analysis.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 06:13:44 +02:00
Christian Gick
de66ba5eea feat(MAT-46): Extract and post document annotations after voice calls
When a voice call ends and a document was loaded in the room, the bot
now analyzes the transcript for document-specific changes/corrections
and posts them as a structured "Dokument-Aenderungen" message. Returns
nothing if no document changes were discussed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 20:18:00 +02:00
Christian Gick
6a6f9ef1c4 fix(voice): auto-use active Confluence page ID, allow roleplay on docs
- Confluence tools default to active page from room context — no more
  asking user for page_id
- Prompt allows roleplay/mock interviews when document context present
- Explicit instruction not to ask for page_id

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 14:31:49 +02:00
Christian Gick
c5e1c79e1b fix(voice): reduce phantom speech responses from ambient noise
- Raise VAD activation_threshold 0.50→0.65, min_speech_duration 0.2→0.4s
- Add ghost phrase filter: suppress 1-2 word hallucinations (Danke, Ja, etc)
- Strengthen prompt: stay silent unless clearly addressed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 13:48:14 +02:00
Christian Gick
b275e7cb88 feat(voice): add Confluence read/write tools for voice sessions
Enable realtime Confluence page editing during Element Call voice sessions.
- Add read_confluence_page and update_confluence_page function tools
- Detect Confluence URLs shared in Matrix rooms, store page ID for voice context
- Section-level updates via heading match + version-incremented PUT

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 13:09:34 +02:00
Christian Gick
e81aa79396 fix: increase voice PDF context to 40k chars, fix language detection sanity
- Voice context per-document limit 10k→40k chars (was cutting off at page 6)
- Language detection: reject results >30 chars (LLM returning sentences)
- Voice.py: generalize "PDF" label to "Dokumente"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 12:40:13 +02:00
Christian Gick
90e662be96 feat(voice): PDF context in voice calls + call transcript summary (MAT-10)
Pass PDF document context from room to voice session so the voice LLM
can answer questions about uploaded PDFs. Persist call transcripts and
post an LLM-generated summary to the room when the call ends.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 11:21:31 +02:00
Christian Gick
1ec63b93f2 feat(voice): per-user timezone via memory preferences
- Store user timezone as [PREF:timezone] in memory service
- Query timezone preference on session start, override default
- Add set_user_timezone tool so bot learns timezone from conversation
- On time-relevant questions, bot asks if user is still at stored location
- Seeded Europe/Nicosia for @christian.gick:agiliton.eu

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 11:02:25 +02:00
Christian Gick
e84260f839 feat(prompt): add user timezone and LLM model to voice prompt
Bot now knows the user's timezone (Europe/Berlin default) and which
LLM model it's running on, so it can answer questions about both.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 10:56:40 +02:00
Christian Gick
277d6b5fe4 fix(e2ee): restore 3s key rotation wait, fix mute callback arg order
Removing the blocking wait entirely caused DEC_FAILED - the rotated key
had not arrived via nio sync before the pipeline started. Restore a short
3s wait (down from 10s) which is enough for nio to deliver the rotated key.

Also fix on_mute/on_unmute arg order (participant, publication - not reversed).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 10:43:38 +02:00
Christian Gick
a11cafc1d6 feat(memory): store full conversation exchanges instead of LLM-extracted facts
- Replace _extract_voice_memories with _store_voice_exchange
- Store raw "User: ... / Assistant: ..." pairs directly
- No LLM call needed — faster, cheaper, no lost context
- Load as "Frühere Gespräche" with full thread context

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 10:40:59 +02:00
Christian Gick
150df19be1 fix(tts): revert to multilingual_v2 for better quality, keep speed 1.15x
flash_v2_5 had audible compression artifacts. multilingual_v2 has higher
fidelity while speed=1.15 via VoiceSettings still gives snappier delivery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 10:38:46 +02:00
Christian Gick
294fbac913 feat(tts): switch to flash model + speed 1.15x for snappier voice
- Model: eleven_multilingual_v2 → eleven_flash_v2_5 (lower latency)
- Speed: 1.15x via VoiceSettings
- Stability/similarity tuned for natural German speech

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 10:33:27 +02:00
Christian Gick
c532f4678d fix(e2ee): consolidate key timing + noise filtering (MAT-40, MAT-41)
- set_key() only called after frame cryptor exists (on_track_subscribed / late arrival)
- Remove 10s blocking key rotation wait; keys applied asynchronously
- Add DEC_FAILED (state 3) to e2ee_state recovery triggers
- VAD watchdog re-applies all E2EE keys on >30s stuck as recovery
- Expand STT artifact patterns (English variants, double-asterisk)
- Add NOISE_LEAK diagnostic logging at STT level

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 08:33:40 +02:00
Christian Gick
4b4a150fbf fix(e2ee): extend key rotation wait to 10s, debug late key events
EC rotates encryption key when bot joins LiveKit room. The rotated
key arrives via Matrix sync 3-5s later. Previous 2s wait was too
short - DEC_FAILED before new key arrived.

Extended wait to 10s. Added logging to bot.py to trace why late
key events were not being processed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 21:54:27 +02:00
Christian Gick
230c083b7b fix(e2ee): revert incorrect HKDF patch, remove pre-ratcheting
The HKDF sed patch in Dockerfile was wrong — it swapped salt/info
based on incorrect analysis of minified JS. The original Rust FFI
parameters are correct: salt="LKFrameEncryptionKey", info=[0;128].

Also removed Python-side HMAC pre-ratcheting of keys. Element Call
uses explicit key rotation via Matrix events, not HMAC ratcheting.

Added diagnostic logging to trace exact key bytes during E2EE setup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 21:44:11 +02:00
Christian Gick
ea52236880 feat(e2ee): make E2EE configurable via E2EE_ENABLED env var
Allows disabling E2EE for diagnostic purposes. When disabled, bot
connects to LiveKit without frame encryption.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 20:14:06 +02:00
Christian Gick
e3be4512d9 fix(e2ee): use correct Element Call E2EE parameters
Inline E2EE options had 3 wrong values vs Element Call JS SDK:
- failure_tolerance=-1 (infinite, hid all DEC_FAILED) → 10
- key_ring_size=16 (too small, keys overflow) → 256
- ratchet_window_size=16 (wrong) → 10

Now uses _build_e2ee_options() which was already correct but never called.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 20:00:55 +02:00
Christian Gick
7b7079352f fix(noise): expand STT artifact filter to catch subtitle metadata leaks
ElevenLabs scribe_v2_realtime also produces non-asterisk artifacts like
"Untertitel: ARD Text im Auftrag von Funk (2017)" from TV/radio audio.
Add pattern matching for subtitle metadata, copyright notices, and
parenthetical/bracketed annotations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 19:43:22 +02:00
Christian Gick
c38ab96054 chore(voice): switch to Robert Ranger voice
Replace Jack Marlowe (slow/raw) with Robert Ranger (deep/natural) for
a more pleasant conversational voice assistant experience.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 19:34:54 +02:00
Christian Gick
fa9e95b250 fix(noise): filter STT noise annotations via on_user_turn_completed
Replace broken _VoiceAgent stt_node override with _NoiseFilterAgent that uses
on_user_turn_completed() + StopResponse. This operates downstream of VAD+STT
so no backpressure risk to the audio pipeline.

When ElevenLabs scribe_v2_realtime produces *Störgeräusche* etc., the agent
now silently suppresses them before the LLM responds. The prompt-based filter
is kept as defense-in-depth.

Fixes: MAT-41

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 19:07:31 +02:00
Christian Gick
6c1073e79d fix(vad): remove competing AudioStream that caused intermittent VAD failures
The _count_frames coroutine created a second rtc.AudioStream on the caller's
audio track, competing with AgentSession's internal pipeline for event loop
time. Under load, this caused VAD to miss speech → user_state stuck on "away".

- Remove _count_frames AudioStream (debugging artifact)
- Add VAD state diagnostics (speaking count, away duration)
- Add VAD watchdog: warns if user_state=away >30s (MAT-40 detection)

Fixes: MAT-40

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 19:02:39 +02:00
Christian Gick
a8d4663f10 fix(tts): revert to Jack Marlowe voice, vmVmHDKBkkCgbLVIOJRb not accessible
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:52:06 +02:00
Christian Gick
06b588f313 fix(voice): add noise annotation filter to prompt + switch voice
- Add LLM prompt rule to ignore *Störgeräusche* etc. annotations
  instead of overriding stt_node (which broke VAD pipeline)
- Switch voice to vmVmHDKBkkCgbLVIOJRb per user preference

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 18:49:31 +02:00
Christian Gick
e926908af7 test: revert to base Agent to check if stt_node override breaks VAD
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 18:45:56 +02:00
Christian Gick
fb09808a8c fix(vad): lower activation threshold 0.60→0.50
Threshold 0.60 too strict, user speech consistently not detected.
Back to default 0.50 with min_speech_duration=0.2 as noise guard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 18:42:21 +02:00
Christian Gick
8f80e7d543 fix(tts): switch to Jack Marlowe - native German voice
Replace George (British EN) with Jack Marlowe (Gng1FdSGZlhs6jKgzAxL),
the only native German voice in the library. Fixes garbled number/date
pronunciation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 18:37:05 +02:00
Christian Gick
125b0f5d2e fix(tts): spell out numbers in words for German TTS
George (British) voice mangles German digit strings. Force LLM to
write all numbers as German words so TTS pronounces them correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 18:35:52 +02:00
Christian Gick
1b08683c17 fix(vad): lower activation threshold 0.75→0.60
0.75 too strict, user voice not detected. 0.60 with min_speech_duration=0.2
should balance noise rejection vs speech detection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 18:15:06 +02:00
Christian Gick
8445c9325c revert(tts): remove pcm_24000 encoding, keep language=de
pcm_24000 caused silent playback through livekit. Reverting to
plugin default encoding which is known working.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 18:12:35 +02:00
Christian Gick
e090c60c19 feat(tts): upgrade to pcm_24000 encoding + language=de
Switch from mp3_22050_32 (default) to lossless PCM 24kHz for cleaner
voice output. Add language=de for German text normalization.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 18:08:23 +02:00
Christian Gick
1e1911995f fix(stt): filter ElevenLabs noise annotations before LLM
scribe_v2_realtime annotates background audio as *Störgeräusche*,
*Fernsehgeräusche* etc. Override stt_node to drop these so the LLM
only receives actual speech transcripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 17:59:17 +02:00
Christian Gick
02a7c91eaf fix(vad): raise activation threshold to reduce noise triggers
activation_threshold 0.5→0.75, min_speech_duration 0.05→0.2s
Prevents ambient noise from triggering STT and producing
'Schlechte Qualität' transcripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 17:52:38 +02:00
Christian Gick
39ef4e0054 fix(stt): pass http_session to ElevenLabs STT plugin
Plugin requires explicit aiohttp session; livekit http_context not available
in this job setup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 17:45:42 +02:00
Christian Gick
2dce8419d4 fix(stt): set scribe_v2_realtime model with language_code for streaming STT
- Add model_id="scribe_v2_realtime" (already set) + language_code from STT_LANGUAGE env (default "de")
- Remove _stt_session from cleanup loop (plugin uses livekit http_context)
- Remove _stt_session stub from __init__

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 17:26:51 +02:00
Christian Gick
4012950197 fix: Use scribe_v2_realtime model for ElevenLabs STT (streaming mode)
scribe_v1 (REST) sets streaming=False, incompatible with livekit-agents 1.4 AgentSession.
scribe_v2_realtime uses WebSocket streaming (confirmed working with Starter plan).
Removes separate _stt_session aiohttp client.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 17:24:16 +02:00
Christian Gick
52f8cb569c feat(voice): add cross-call memory and Brave Search tool
- Query user memories at call start and inject into agent system prompt
- Extract new facts after each exchange using claude-haiku via LiteLLM
- Add Brave Search tool (@function_tool) for current data queries
- Pass memory client and caller_user_id through VoiceSession constructor
- Pre-compute 8 HMAC-ratcheted EC keys for reliable E2EE decryption

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 15:27:59 +02:00
Christian Gick
2b8744de6e fix(voice): full E2EE bidirectional audio pipeline working
- bot.py: track active callers per room; only stop session when last
  caller leaves (fixes premature cancellation when Playwright browser
  hangs up while real app is still in call)

- voice.py: pre-compute 8 HMAC-ratcheted keys from EC's base key so
  decryption works immediately without waiting ~30s for Matrix to
  deliver EC's key-rotation event (root cause of user→bot silence)

- voice.py: fix set_key() argument order (identity, key, index) at all
  call sites — was (identity, index, key) causing TypeError

- voice.py: add audio frame monitor (AUDIO_FLOW) and mute/unmute event
  handlers for diagnostics

- voice.py: update livekit-agents 1.4.2 event names: user_state_changed,
  user_input_transcribed, conversation_item_added

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 15:17:35 +02:00
Christian Gick
c379064f80 fix(voice): set caller key in on_track_subscribed — frame cryptor must exist for HKDF to apply
Root cause: C++ set_key() only applies HKDF when impl_->GetKey(pid) returns a valid
handler, which requires the frame cryptor for that participant to be initialized.
Frame cryptors are created at track subscription time, not at connect time.

Calling set_key(caller_identity, key) immediately after connect() skips HKDF
derivation (impl_->GetKey returns null) → raw key stored → DEC_FAILED.

Fix: move caller key setting to on_track_subscribed where frame cryptor definitely exists.
Also update on_encryption_key to use set_key() for key rotation updates.
2026-02-22 14:05:54 +02:00
Christian Gick
190b35945c fix(voice): guard e2ee_manager access when E2EE disabled (diagnostic mode) 2026-02-22 13:46:51 +02:00
Christian Gick
c188a2daf6 test(voice): disable E2EE entirely — check if EC sends plaintext vs encrypted
If VAD triggers → EC audio reaches pipeline without decryption (plaintext or format issue).
If VAD silent → E2EE encryption on EC side but key/format mismatch on our side.
Note: bot greeting will be unencrypted so EC may not hear it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 13:34:26 +02:00
Christian Gick
3d05b503c6 test(voice): pre-derive HKDF in Python, use set_shared_key to bypass Rust FFI HKDF
Diagnostic: if Rust FFI HKDF produces different result than EC JS HKDF,
set_key(caller) would always fail (DEC_FAILED). Test: pre-derive AES key
in Python matching livekit-client-sdk-js params (SHA-256, salt=LKFrameEncryptionKey,
info=128-zeros, 16-byte output), pass to set_shared_key() which stores raw (no KDF).
If user→bot decryption now works, root cause = Rust HKDF mismatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 12:24:57 +02:00
Christian Gick
7adeebfe05 fix(voice): restore set_shared_key fallback + failure_tolerance=10 from working commit e3ede3f
The confirmed-working Feb 21 commit (e3ede3f) used:
- kp.set_shared_key(caller_key) as fallback for incoming audio decryption
- failure_tolerance=10 (not -1) so DEC_FAILED state changes are visible

Per-participant kp.set_key() alone is insufficient — the patched Rust FFI
appears to fall back to shared_key for incoming track decryption.
failure_tolerance=-1 was masking the DEC_FAILED state making diagnosis hard.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 11:46:32 +02:00
Christian Gick
2a799f5760 fix(voice): set caller E2EE key on participant_connected + for all remote LK identities
Two race conditions when bot joins first (remote=0):
1. Key arrives before participant joins LK → on_participant_connected now applies stored keys
2. Key arrives after session start → on_encryption_key now sets key for all remote_participants by LK identity

Fixes identity mismatch between Matrix device_id (from key event) and LK participant identity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 11:30:23 +02:00
Christian Gick
5d31886192 debug(voice): add VAD start/stop events to trace where audio pipeline breaks
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 11:18:51 +02:00
Christian Gick
f74a11fde8 fix(voice): separate aiohttp sessions for STT and TTS
Sharing one session between ElevenLabs STT (WebSocket) and TTS (HTTP)
can cause connection conflicts. Use dedicated sessions for each.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 11:15:46 +02:00
Christian Gick
e3c1ded328 feat(voice): inject datetime into prompt, respond in DE/EN
- Add VOICE_TIMEZONE env var (default: Europe/Berlin) for local time
- Bot knows exact date/time at call start via _build_voice_prompt()
- Respond in user language (DE or EN) instead of always German

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-22 11:02:56 +02:00