Architecture¶
Overview¶
Hermes is a native Swift/SwiftUI macOS app that runs as a menu bar agent (LSUIElement). It captures two independent audio streams, transcribes them locally, and displays results in a floating overlay.
┌─────────────────┐ ┌──────────────────┐
│ System Audio │ │ Microphone │
│ (Zoom, Meet, │ │ (Your voice) │
│ Teams, etc.) │ │ │
└────────┬────────┘ └────────┬──────────┘
│ │
CATap + IOProc AVAudioEngine
│ │
▼ ▼
┌────────────────────────────────────────────┐
│ AudioCaptureManager │
│ Resamples to 16kHz mono Float32 │
│ Labels: .them Labels: .me │
└────────────────────┬───────────────────────┘
│
▼
┌────────────────────────────────────────────┐
│ TranscriptionCoordinator │
│ 10-second buffer window │
│ Silence gate (.them only, RMS < 0.001) │
│ Speaker turn merging │
└────────────────────┬───────────────────────┘
│
▼
┌────────────────────────────────────────────┐
│ TranscriptionEngine │
│ WhisperKit (large-v3, Apple Neural Engine)│
│ Batch transcribe(audioArray:) │
└────────────────────┬───────────────────────┘
│
▼
┌──────────────┐ ┌─────────────────┐
│ Overlay UI │ │ SwiftData/ │
│ (NSPanel) │ │ SQLite │
└──────────────┘ └─────────────────┘
Audio Capture¶
System Audio — CATap¶
Hermes uses Core Audio Taps (CATap, macOS 14.2+) to capture system audio. This is the same mechanism professional audio tools use to tap into the system audio graph.
The setup follows a two-phase pattern:
-
Create a
CATapDescriptionwithisExclusive = trueand an empty process list. This tells Core Audio to capture all system audio output (the semantics are inverted — "exclusive" means the process list is an exclusion list; empty exclusion list = capture everything). -
Create an aggregate device with an empty sub-device list, then attach the tap post-creation via
kAudioAggregateDevicePropertyTapList. This two-phase approach avoids connection errors that occur with single-dictionary setup. -
Register a C-function-pointer IOProc via
AudioDeviceCreateIOProcID. Block-based IOProcs with dispatch queues do not fire for aggregate devices with CATap — this is a Core Audio limitation. -
Read format from the tap via
kAudioTapPropertyFormat, not from the output device.
Why not AVAudioEngine for system audio?
AVAudioEngine's installTap() is capped at ~100ms callback intervals, which is too coarse. CATap with a raw IOProc gives sample-accurate callbacks. AVAudioEngine works fine for mic capture though.
Microphone — AVAudioEngine¶
The microphone is captured via standard AVAudioEngine with installTap() on the input node. The specific input device is set via kAudioOutputUnitProperty_CurrentDevice on the engine's audio unit.
Cleanup Order¶
CATap resources must be cleaned up in a specific order to avoid crashes:
- Stop the aggregate device
- Destroy the IO proc
- Destroy the aggregate device
- Destroy the process tap
Transcription¶
WhisperKit¶
Hermes uses WhisperKit (MIT license) with the large-v3-v20240930_626MB model running on Apple Neural Engine.
Key details:
- Batch API only — WhisperKit's streaming API is a Pro SDK feature. Hermes buffers 10 seconds of audio then calls
transcribe(audioArray:). - Input format — 16kHz mono Float32 (resampled via
AudioBufferConverter). - Special token stripping — WhisperKit output contains tokens like
<|startoftranscript|>,<|en|>,<|endoftext|>. These are stripped via regex post-processing. - No VAD chunking —
chunkingStrategy: .vadwas removed because it caused special token leakage into output text.
Silence Gate¶
When no audio is playing on the system channel, Whisper hallucinates common phrases ("Thank you", "Bye", etc.) from zero-filled buffers. Hermes applies an RMS-based silence gate (threshold 0.001) to the .them channel only.
The gate is not applied to the .me (microphone) channel because mic audio has very low RMS (0.00006–0.0005) even with real speech.
Speaker Turn Merging¶
Consecutive transcript segments from the same speaker are merged into a single growing line in the UI, rather than creating a new line for each 10-second chunk. This produces natural-looking paragraphs instead of fragmented output.
Persistence¶
Meeting sessions and transcript segments are stored via SwiftData (backed by SQLite) at:
The MeetingSession model stores session metadata (start time, end time, title) and its associated TranscriptSegment entries (speaker, text, timestamp).
UI¶
| Component | Implementation |
|---|---|
| Menu bar icon | NSStatusItem with custom HermesIcon image asset |
| Floating overlay | NSPanel (.nonactivatingPanel, .floating) with SwiftUI hosted content |
| Collapse/expand | OverlayState ObservableObject bridges SwiftUI ↔ NSPanel, animated resize |
| Session history | Standard NSWindow with NavigationSplitView |
Technology Stack¶
| Layer | Technology |
|---|---|
| Language | Swift 6 |
| UI framework | SwiftUI + AppKit (NSPanel) |
| Audio (system) | CATap, Core Audio C API |
| Audio (mic) | AVAudioEngine |
| Transcription | WhisperKit (Apple Neural Engine) |
| Persistence | SwiftData / SQLite |
| Build system | XcodeGen → Xcode |
| CI/CD | GitHub Actions |
| Distribution | DMG (signed & notarized) |