James Routley

"If a machine can think, it might think more intelligently than we do, and then where should we be?"

I have a dark secret to confess. I never learned to touch type properly despite computer classes in grade school requiring it and a long career working with ~~magic boxes~~ computers.

My fingers just can’t seem to work with the keyboard in the “proper” home row way of typing. I have a Frankenstein-like system, wherein most letters are typed by my left hand, leaving my right to handle punctuation, return and a few letters. Just enough muscle memory to be dangerous, with a need to keep glancing at the keyboard periodically, and yet fast enough that I don’t feel compelled to fix it, but slow enough that extended writing sessions leave me frustrated and fatigued. Error-prone typing turns what should be quick note-taking into a slog of backspacing and retyping.

My interest was piqued when I read a great post last week by, Hamilton Greene, a fellow Recurse Center alum, who was trying to run speech-to-text(STT) on Fedora Linux. They wanted a better, faster way to get words from their brain into their computer. It made sense! Because, for most of us, we can talk faster than we can type. Their solution worked by leveraging Google Docs, enabling dictation with a shortcut, speaking their thoughts, then copy-pasting the results wherever they actually needed them. They concluded by saying they’d love a proper local-first solution that could output text directly at their cursor, but hadn’t found one yet and weren’t quite ready to build it themselves. They had articulated exactly what I needed, but unfortunately their workaround would never work for me, having no Google account. Surely someone had solved this for Linux?!

After a bit of digging around I found WhisperTux. It was soooo dang close to what I wanted and needed. A local Whisper model handled the transcription via whisper.cpp for the STT engine. This meant:

my data was private (I didn’t want my random thoughts on OpenAI’s or any other server)
no internet dependency (I pay for bandwidth usage where I live and try to minimize use)

I installed it, set it up and tested it. It was simple to set up and was a matter of making a pyenv environment to install everything in and downloading a local AI model. I ended up settling on the Whisper Small model after trying a few other sizes. It seemed like a good mix between speed transcription and accuracy. For a brief moment I thought my search was over.

Cracks appeared once I stopped working with the GUI. The Sway Wayland clipboard integration was broken, so I had to copy/paste to where I need it. Not a total deal-breaker, but super annoying as someone who lives and works in terminal and CLI interfaces. It did prove that local STT works on Linux though, which was huge!

Whisper.cpp, the engine used by WhisperTux, was pretty straightforward seeming. As I read through its documentation I couldn’t help but wonder if I stripped away the Python scaffolding and built something simpler to interact with it and also with Wayland? A small, focused C program that tied directly into whisper.cpp without the baggage. 🤔 Something that fit into my non-GUI based setup, yet still could work for folks that use and love GUI envs (no shade!). That’s how Murmur was born!

Murmur aims to strip away all the GUI parts, GTK dependencies to deliver as small and lightweight a package as possible. It has a config file wherein users can supply custom settings like, what local AI model to use, how many threads to allow whisper.cc and more. There’s two executables build as part of the make process for Murmur. A daemon (murmur-daemon) which starts as a user service on login. The other (murmur-toggle) calls to the daemon and starts/stops transcription recordings. This toggling can be invoked from the terminal manually or, more realistically, whenever a shortcut key is pressed.

It tries to be as flexible as possible, leaving keybindings as something the user’s OS can handle, vs trying to set that from a config setting and manage that ourselves. As an example, for my setup, I use my Sway config file startup the murmur-daemon once at login and then use a keybinding to execute the murmur-toggle binary:

########################
###    Murmur ST  T    ###
########################
exec murmur-daemon
## Toggle Yap-mode (recording start/stop)
bindsym $mod+Shift+y exec murmur-toggle

Notifications automagically keep me informed at to when to start yapping, when my data’s being crunched, and when it’s done and waiting to Ctrl-v to where I need it; all via Wayland’s notify-send. I think working to improve the auto-injection (via something like ydotool and/or dotool) could make it perfect and most widely supported. Last idea for improving and expanding the feature set on this project would be to try to clear the clipboard after pasting automatically (for privacy purposes) OR more generally it could clear after N seconds regardless of if auto-pasting is used.

For now, I’m super happy with it!! 💖✨ I have a low-resource footprint, local STT tool that sits quietly in the background for when I need it. It’s helped a ton already with writing this blog post, doing a lot of the typing and leaving me to handle the editing, proofreading and refinement! Feel free to check out Murmur! Let me know if it works for you, if it doesn’t work, if you have ideas to improve it, or if you want to help develop it.

Yapping Beats Typing: Building a Local AI Speech-to-Text Solution for Wayland Linux