Skip to content

Unified Voice System Architecture

Status: Implemented Version: 1.0.0 Last Updated: 2026-01-29 Implementation: Phases 1-4 Complete


Executive Summary

The Unified Voice System consolidates three previously separate voice systems (search voice, Olorin wizard avatar, and main platform wake word) into a single, cohesive system under the Olorin wizard avatar with user-controlled visibility modes.

Key Achievements

  • ✅ Single OlorinVoiceOrchestrator for all voice interactions
  • ✅ Unified backend endpoint with intent routing
  • ✅ 4 avatar visibility modes (FULL, COMPACT, MINIMAL, ICON_ONLY)
  • ✅ Cross-platform support (Web, iOS, Android, tvOS)
  • ✅ Platform-specific optimizations
  • ✅ User preference persistence

System Overview

Architecture Diagram

┌──────────────────────────────────────────────────────────────────┐
│                 UNIFIED VOICE SYSTEM                             │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │        OLORIN VOICE ORCHESTRATOR (Frontend)                │ │
│  │  /shared/services/olorinVoiceOrchestrator.ts               │ │
│  ├────────────────────────────────────────────────────────────┤ │
│  │  • Single entry point for all voice interactions           │ │
│  │  • Intent routing (6 types)                                │ │
│  │  • State management via supportStore                       │ │
│  │  • Platform abstraction (Web/iOS/Android/tvOS)             │ │
│  │  • Streaming + Batch mode support                          │ │
│  └────────────────────────────────────────────────────────────┘ │
│                              │                                  │
│         ┌────────────────────┼────────────────────┐            │
│         ▼                    ▼                    ▼            │
│  ┌──────────┐        ┌──────────┐        ┌──────────────┐    │
│  │ UI Layer │        │ Backend  │        │  Wake Word & │    │
│  │          │        │   API    │        │ Audio Engine │    │
│  └──────────┘        └──────────┘        └──────────────┘    │
│                                                                │
└────────────────────────────────────────────────────────────────┘

System Components

1. Frontend Layer

OlorinVoiceOrchestrator (/shared/services/olorinVoiceOrchestrator.ts)

  • Singleton service managing all voice interactions
  • Platform-aware configuration
  • Conversation ID management
  • WebSocket/HTTP fallback
  • Error handling and retry logic

Key Methods:

typescript
async initialize(config?: Partial<VoiceConfig>): Promise<void>
async startListening(trigger: 'manual' | 'wake-word'): Promise<void>
async stopListening(): Promise<void>
async interrupt(): Promise<void>
async processTranscript(transcript: string, conversationId?: string): Promise<VoiceCommandResponse>

2. State Management

supportStore Extension (/shared/stores/supportStore.ts)

New state fields:

typescript
avatarVisibilityMode: AvatarMode;           // 'full' | 'compact' | 'minimal' | 'icon_only'
currentInteractionType: VoiceIntent | null; // Current voice intent
lastIntentConfidence: number;                // Last intent confidence score
commandHistory: VoiceCommand[];              // Voice command history

Persistence:

  • LocalStorage (Web)
  • AsyncStorage (Mobile/tvOS)
  • User preferences synced across sessions

3. Backend Layer

Unified Voice Endpoint (/backend/app/api/routes/voice/unified.py)

python
POST /api/v1/voice/unified

Request:
  transcript: str           # Voice command text
  language: str            # Language code (default: 'en')
  conversation_id: str     # Optional conversation ID
  platform: str            # 'web' | 'ios' | 'android' | 'tvos'
  trigger_type: str        # 'manual' | 'wake-word'

Response:
  intent: VoiceIntent      # CHAT | SEARCH | NAVIGATION | PLAYBACK | SCROLL | CONTROL
  spoken_response: str     # TTS response text
  action: Dict            # Action payload (optional)
  conversation_id: str    # Conversation tracking
  confidence: float       # Intent confidence (0.0-1.0)
  gesture: GestureState   # Avatar gesture (optional)

IntentRouter (/backend/app/services/voice/intent_router.py)

  • Intent classification (6 types)
  • Command pattern matching (Hebrew + English)
  • Context-aware routing
  • Handler execution

4. UI Components

VoiceInteractionPanel (/shared/components/voice/VoiceInteractionPanel.tsx)

  • Unified panel for all platforms
  • 4 avatar modes with smooth transitions
  • Platform-specific dimensions
  • Animation support

VoiceAvatarFAB (/shared/components/support/VoiceAvatarFAB.tsx)

  • Floating action button trigger
  • Wake word pulse animation
  • Mode-aware visibility
  • Platform-specific styling

Avatar Visibility Modes

Mode Specifications

ModeDescriptionDimensionsFeatures
FULLComplete wizard with animationsWeb: 320x480px
Mobile: 240x360px
TV: 400x600px
✓ Wizard
✓ Animations
✓ Waveform
✓ Transcript
COMPACTFloating circular panelWeb: 240x240px
Mobile: 160x160px
TV: 180x180px
✓ Wizard
✓ Animations
✓ Waveform
✗ Transcript
MINIMALWaveform bar onlyWeb: 240x80px
Mobile: 240x80px
TV: N/A
✗ Wizard
✗ Animations
✓ Waveform
✗ Transcript
ICON_ONLYHidden (FAB only)64x64px button✗ Wizard
✗ Animations
✗ Waveform
✗ Transcript

Note: MINIMAL mode not available on tvOS (too small for 10-foot viewing)

Mode Configuration

typescript
// /shared/constants/voiceAvatarModes.ts
export const AVATAR_MODE_CONFIGS: Record<AvatarMode, AvatarModeConfig> = {
  full: {
    dimensions: {
      web: { width: 320, height: 480 },
      mobile: { width: 240, height: 360 },
      tv: { width: 400, height: 600 },
    },
    showAnimations: true,
    showWaveform: true,
    showTranscript: true,
    showWizard: true,
  },
  // ... other modes
};

Intent Classification

6 Voice Intent Types

  1. CHAT - General conversation with AI assistant

    • Pattern: "מה זה", "tell me", "explain"
    • Handler: Claude API conversation
    • Response: Natural language with gestures
  2. SEARCH - Content search queries

    • Pattern: "חפש", "search for", "find"
    • Handler: Content search service
    • Response: Search results + navigation
  3. NAVIGATION - App navigation commands

    • Pattern: "עבור ל", "go to", "open"
    • Handler: Navigation router
    • Response: Route change confirmation
  4. PLAYBACK - Media playback controls

    • Pattern: "נגן", "play", "pause", "stop"
    • Handler: Player service
    • Response: Playback state update
  5. SCROLL - Screen scrolling

    • Pattern: "גלול", "scroll", "up", "down"
    • Handler: Scroll controller
    • Response: Scroll animation
  6. CONTROL - System controls

    • Pattern: "סגור", "close", "back", "exit"
    • Handler: UI controller
    • Response: UI state change

Intent Confidence Scoring

python
def _classify_intent(self, transcript: str) -> tuple[VoiceIntent, float]:
    """
    Classify intent with confidence score.
    Returns: (intent, confidence_score)

    Confidence ranges:
    - 0.9-1.0: High confidence (exact pattern match)
    - 0.7-0.8: Medium confidence (partial match)
    - 0.5-0.6: Low confidence (fuzzy match)
    - <0.5: Fallback to CHAT intent
    """

Platform-Specific Implementations

Web Platform

Features:

  • All 4 avatar modes supported
  • TailwindCSS styling
  • Keyboard navigation
  • ARIA labels
  • Responsive design (320px-2560px)

Key Files:

  • /shared/components/voice/VoiceInteractionPanel.tsx
  • /web/src/components/settings/voice/components/AvatarPreferencesSection.tsx

Mobile Platform (iOS/Android)

Features:

  • All 4 avatar modes supported
  • React Native StyleSheet styling
  • Haptic feedback
  • Safe area handling
  • Wake word detection ("Jarvis")

Integration:

typescript
// /mobile-app/src/services/voiceManager.ts
private async _initializeOrchestrator(): Promise<void> {
  this.orchestrator = createVoiceOrchestrator({
    platform: 'ios', // or 'android'
    language: this.config.speechLanguage,
    wakeWordEnabled: this.config.enableBackgroundListening,
  });
  await this.orchestrator.initialize();
}

Key Files:

  • /mobile-app/src/services/voiceManager.ts
  • /mobile-app/src/components/voice/AvatarPreferences.tsx

tvOS Platform

Features:

  • 3 avatar modes (no MINIMAL)
  • TV-optimized dimensions
  • Focus navigation
  • Enhanced focus rings (4px)
  • Typography scaled 1.4x
  • Menu button trigger

Optimizations:

  • Larger touch targets (100x100px min)
  • 10-foot UI readability
  • Siri Remote integration
  • Longer animation durations (1.2x)

Key Files:

  • /tvos-app/src/services/voiceManager.ts
  • /tvos-app/src/components/voice/TVAvatarPreferences.tsx

Wake Word Detection

Platform Support

PlatformWake WordTriggerBackground
iOS"Jarvis"Native speech recognition✓ Yes
Android"Jarvis"Native speech recognition✓ Yes
tvOS"Jarvis"Menu button long-press✗ No
WebN/AManual activation only✗ No

Wake Word Flow

1. Background listening active

2. Wake word detected ("Jarvis")

3. Haptic feedback (mobile only)

4. supportStore.onWakeWordDetected()

5. orchestrator.startListening('wake-word')

6. Avatar auto-expands (if not ICON_ONLY)

7. 5-second listening window

8. Process transcript via orchestrator

9. TTS response playback

10. Return to background listening

Performance Characteristics

Latency Targets

OperationTargetActual
Wake word detection<500ms~300ms
Speech recognition start<200ms~150ms
Intent classification<1s~600ms
Backend processing<2s~1.2s
TTS response start<1s~800ms
End-to-end (streaming)<5s~3.5s
End-to-end (batch)<20s~15s

Optimization Strategies

  1. Intent Caching: Cache frequent patterns for faster classification
  2. Streaming Mode: Use streaming for real-time responses
  3. Batch Fallback: Automatic fallback to batch for complex queries
  4. Connection Pooling: Reuse HTTP connections
  5. Lazy Loading: Load avatar assets on demand
  6. Frame Rate Reduction: Lower spritesheet FPS on low-end devices

Error Handling

Error Categories

  1. Network Errors

    • Retry with exponential backoff
    • Fallback to cached responses
    • User notification after 3 failures
  2. Speech Recognition Errors

    • Retry listening window
    • Confidence threshold checks
    • Disambiguation prompts
  3. Backend Processing Errors

    • Graceful degradation
    • Fallback to simpler intent
    • Error logging and telemetry
  4. Audio Playback Errors

    • Skip TTS, show text only
    • Hardware compatibility checks
    • User notification

Retry Logic

typescript
async processTranscript(transcript: string): Promise<VoiceCommandResponse> {
  let retries = 3;
  while (retries > 0) {
    try {
      return await this._sendToBackend(transcript);
    } catch (error) {
      retries--;
      if (retries === 0) throw error;
      await this._exponentialBackoff(3 - retries);
    }
  }
}

Security Considerations

Authentication

  • All voice endpoints require Firebase authentication
  • Admin-only endpoints protected by require_admin dependency
  • Conversation IDs tied to user accounts

Data Privacy

  • Voice transcripts not stored permanently
  • Conversation history retention: 7 days
  • User can clear command history anytime
  • GDPR-compliant data handling

Rate Limiting

  • Per-user: 60 requests/minute
  • Per-IP: 100 requests/minute
  • Admin endpoints: 120 requests/minute
  • Wake word detection: unlimited (local processing)

Monitoring and Telemetry

Key Metrics

  1. Usage Metrics

    • Voice commands per day
    • Commands by intent type
    • Average confidence scores
    • Mode preference distribution
  2. Performance Metrics

    • End-to-end latency (p50, p95, p99)
    • Backend processing time
    • Speech recognition accuracy
    • TTS generation time
  3. Error Metrics

    • Error rate by category
    • Failed intent classifications
    • Network timeout rate
    • Audio playback failures

Logging

typescript
logger.info('Voice command processed', {
  userId: user.id,
  intent: response.intent,
  confidence: response.confidence,
  latency: processingTime,
  platform: config.platform,
  avatarMode: store.avatarVisibilityMode,
});

Migration from Legacy Systems

Deprecated Components (Removed in Phase 5)

Mobile:

  • VoiceSearchModal.tsx
  • VoiceCommandButton.tsx

tvOS:

  • TVVoiceIndicator.tsx
  • TVVoiceWaveform.tsx
  • TVVoiceResponseDisplay.tsx

Migration Path

  1. Phase 1: Core infrastructure created (no breaking changes)
  2. Phase 2: Web platform migrated
  3. Phase 3: Mobile platform migrated
  4. Phase 4: tvOS platform migrated
  5. Phase 5: Legacy components removed

User Impact: Seamless - user preferences preserved, no data loss


Future Enhancements

Planned Features

  1. Multi-Language Wake Words

    • Hebrew: "אורלין" (Olorin)
    • Spanish: "Asistente"
    • Custom wake words
  2. Context Awareness

    • Previous command memory
    • Multi-turn conversations
    • Personalized responses
  3. Advanced Gestures

    • Emotion detection
    • Context-appropriate animations
    • Custom gesture library
  4. Voice Profiles

    • User-specific training
    • Accent adaptation
    • Voice biometrics
  5. Offline Mode

    • Local intent classification
    • Cached responses
    • Sync when online

Testing Strategy

Unit Tests

  • ✓ Intent classification accuracy (95%+ required)
  • ✓ Orchestrator state management
  • ✓ Error handling and retries
  • ✓ Platform abstraction layer

Integration Tests

  • ✓ End-to-end voice flows (all platforms)
  • ✓ Wake word detection accuracy
  • ✓ Avatar mode transitions
  • ✓ Backend API integration

Platform-Specific Tests

  • ✓ iOS: Wake word, haptics, safe areas
  • ✓ Android: Wake word, background processing
  • ✓ tvOS: Focus navigation, Menu button
  • ✓ Web: Keyboard nav, ARIA, responsive

Performance Tests

  • ✓ Latency under load (1000 concurrent users)
  • ✓ Memory usage (< 50MB per session)
  • ✓ Battery impact (< 5% per hour on mobile)


Document Version: 1.0.0 Last Updated: 2026-01-29 Maintained By: Platform Team Review Cycle: Quarterly

Released under the MIT License.