How we built a voice dictation app that types into any window
How we built a real-time, offline-capable speech-to-text desktop app in Python that injects keystrokes directly into whatever window you're working in.
Most dictation tools dump text into their own window. You transcribe, then copy-paste it somewhere else. That breaks your flow. Tacet listens to your mic and types directly into whatever app is focused. VS Code, Slack, your browser, a terminal. No clipboard, no plugins, no switching windows.
Existing options either lock you into one ecosystem (Apple Dictation, Google Voice Typing) or require app-specific integrations. We wanted something that works everywhere, runs offline if you want privacy, and doesn’t need an internet connection or API key to function. Nothing on the market did all three.
Who actually uses this
People who type a lot and want to talk instead. Writers drafting long-form content. Developers who want to dictate comments or documentation without leaving their editor. Anyone with RSI or accessibility needs who can’t keyboard all day. People who care about privacy and don’t want their voice data hitting a server.
The tech stack
Python 3.9+ with CustomTkinter for the desktop GUI. faster-whisper for local transcription (runs on CTranslate2). sounddevice for real-time mic capture. pynput for global hotkeys and keystroke injection. numpy for audio signal processing. pystray for the system tray icon. OpenAI API and Deepgram API as optional cloud providers.
Nothing fancy. We picked Python because the AI/ML ecosystem is there. faster-whisper is a C++ Whisper implementation with Python bindings, so transcription speed is not a Python problem. The GUI is the only part that’s “slow Python,” and it doesn’t need to be fast.
The hardest part: live preview
When you’re speaking, Tacet shows a real-time preview of what it thinks you’re saying. Whisper re-transcribes the same audio window every 900ms. The naive approach is: backspace the old preview, type the new one. That causes visible flickering because you’re deleting and retyping 40+ characters multiple times per second.
We solved it with stable-prefix diffing. On each update, we find the longest common prefix between the old text and the new text. Then we only backspace the suffix that changed and type the new suffix. If Whisper goes from “I want to” to “I want to talk about”, we don’t touch the first 10 characters. We just append ” talk about”.
# Instead of backspacing the entire preview and retyping,
# we find what's already correct and only fix the tail.
common_len = 0
limit = min(len(old_text), len(new_text))
while common_len < limit and old_text[common_len] == new_text[common_len]:
common_len += 1
# Only delete what changed
old_suffix_len = len(old_text) - common_len
backspace(self.kb, old_suffix_len)
# Only type what's new
new_suffix = new_text[common_len:]
if new_suffix:
self._safe_type(new_suffix)
This cut visible keystroke churn by about 80% and made the preview feel smooth instead of jittery.
The other headache was voice activity detection. A fixed energy threshold doesn’t work because every mic has different background noise. We ended up tracking a noise floor using an exponential moving average, then setting speech/silence thresholds as multipliers above that floor. A quiet room and a noisy coffee shop both work without the user touching any settings.
# Fixed thresholds break in noisy rooms. Instead, we track
# the ambient noise level and set thresholds relative to it.
if noise_floor < 1e-6:
noise_floor = rms
else:
noise_floor = noise_alpha * rms + (1 - noise_alpha) * noise_floor
speech_threshold = max(self.energy_threshold, noise_floor * noise_speech_mult)
silence_threshold = max(self.energy_threshold, noise_floor * noise_silence_mult)
if rms >= speech_threshold:
self._speaking = True
How the data flows
Three threads. The main thread runs the GUI. A background thread captures audio from the mic and runs voice activity detection, chunking audio by silence gaps. When a chunk is ready, it goes into a queue. Two transcription worker threads pull from that queue and run the audio through the selected engine (local Whisper, OpenAI API, or Deepgram API).
Transcribed text flows through a five-stage processor pipeline: voice commands (“period” becomes ”.”), timestamps, template expansion, auto-capitalization, and word replacements. The final text gets injected as keystrokes via pynput into whatever window is active.
The engine and GUI are decoupled through callbacks. The engine doesn’t import anything from the GUI. It just calls on_status_change, on_preview_update, on_final_text. The GUI connects those to its update queue. You could run the engine headless without changing a line.
What it can actually do
Works with any application
Tacet types directly into whatever window is focused. It uses pynput to inject keystrokes at the OS level, so it works with any app that accepts keyboard input. No browser extensions, no API integrations, no plugins. Open your email, start dictating, and the words appear. Switch to Slack, keep talking. It handles special characters like newlines and tabs by pressing the actual keys (Enter, Tab) instead of trying to type escape sequences.
Adaptive voice activity detection
The app automatically detects when you start and stop speaking. It tracks ambient noise levels using an exponential moving average, then sets speech detection thresholds as multipliers above that noise floor. It works in a quiet room or a noisy environment without manual calibration. It also buffers 300ms of audio before speech is detected (preroll), so you never lose the first syllable of a sentence.
Offline-first with cloud fallback
The default engine is faster-whisper running locally on your CPU. No API keys, no internet, no data leaving your machine. On first run, Tacet speed-tests the selected model. If it takes more than 5 seconds to transcribe 1 second of audio, it automatically downgrades to the “tiny” model and saves that preference. If you want higher accuracy and don’t mind cloud, you can switch to OpenAI or Deepgram with one config change.
Customizable voice commands and text processing
Say “period” and it types ”.”. Say “new line” and it presses Enter. Say “delete that” and it backspaces the last chunk. All of these are configurable. You can add your own commands, disable built-in ones, or set up word replacements (“btw” becomes “by the way”). Text flows through a five-stage pipeline: voice commands, timestamps, templates, auto-capitalization, and replacements. Each stage is independent and can be toggled on or off.
Trade-offs we made
We chose CustomTkinter over Electron. Electron would have given us a prettier UI with less effort, but it ships a whole Chromium browser. For an app that sits quietly in your system tray and uses minimal resources, that felt wrong. CustomTkinter is lighter, starts faster, and keeps the install size small. The trade-off is fewer UI components and less visual polish.
We went with rule-based text processing instead of using an LLM for post-processing. An LLM could fix grammar, add smarter punctuation, maybe even restructure sentences. But it would add latency, require either a cloud call or a second local model, and make the output unpredictable. When someone dictates into a chat window, they want their words typed now, not corrected two seconds later. Rules are fast, predictable, and free.
We skipped real-time streaming transcription for the local engine. Whisper works on complete audio chunks, not streams. We could have built a streaming pipeline with overlapping windows, but the complexity wasn’t worth it. Instead, we tuned the chunking (silence detection, preroll buffering, minimum chunk duration) to feel responsive enough. Most utterances process in under a second on a decent CPU.
We also chose to inject keystrokes instead of using the clipboard. Clipboard injection would be simpler and more reliable across platforms, but it overwrites whatever the user copied last. That’s a dealbreaker for most workflows.
What we’d do differently
The config system started as a flat JSON file and grew organically. We ended up with legacy format migration code for word replacements because the GUI needed a different structure than the original config. If we started over, we’d define a proper schema with versioning from day one instead of bolting on migration logic after the fact.
The threading model works but it’s hand-rolled with locks, events, and utterance IDs for stale-detection. A library like concurrent.futures or an async approach with asyncio would have been cleaner. We’d also consider separating the engine into its own process to avoid the GIL entirely, since audio capture and transcription are both CPU-bound.
We’d also invest in automated testing earlier. The callback-based architecture makes the engine testable in theory, but we don’t have a test suite that exercises the full pipeline. That’s technical debt we’re carrying.