Speech to Text Setup for Linux

In this article, I want to explain the speech-to-text setup I have on my Linux system. With LLMs becoming more integrated into my usage of computers, it is beneficial to be able to input natural language into the computer more quickly and with less effort. I also find myself using this setup to enter queries into search engines. If you read my previous article, you will find that this setup will improve your experience with the plugin described therein even further.

This script allows you to start recording with your computer's microphone by holding down the Right Ctrl key. Once you've spoken your message, you can release the key to stop recording. The script then transcribes the text and outputs it as if you were pasting it from the clipboard. Be sure to have your cursor in the location you want the text to be inserted. It also places the text into the clipboard in case you need it again later. I've found this to be a surprisingly useful feature.

Requirements

You will need the following:

bash
a microphone (I use a Zealsound K66)
parec and amixer (for controlling the microphone)
whisper.cpp and a speech-to-text model (for transcribing text)
xsel (for saving to the clipboard)
xdotool (for pasting the text from the clipboard)
sxhkd (for assigning a hotkey)

Setup

First, you will need a microphone. I am using a Zealsound K66. I got it for under $40 on Amazon. I don't know if it's good for anything else, but the sound quality is great to my ears, and it gets the job done. It uses USB and sits on my desk directly in front of me. I didn't want to use a headset microphone because I wanted to be able to use it without grabbing the headset if I didn't happen to already have it on.

The key piece of this setup is whisper.cpp. It does the actual work of transcribing the sound file containing your speech into text. You will need to compile this project from source and download a model to use whisper. You can find instructions to do this in the project's README under the Quick Start section. There are several models to choose from, but I have found that the 'tiny' model is fast and transcribes the text well enough. I was surprised at how quick the transcription process happens with my old computer.

Once you have the project built, ensure the 'main' executable generated from building whisper.cpp is in your path. I setup a symbolic link with this command ln -s ~/Projects/whisper.cpp/main ~/bin/whisper. I put my model file in ~/.cache/whisper/ggml-tiny.en.bin. These will be used in the following script, which I have named speech-to-text.

#!/bin/bash

SPEECH_FILE="/tmp/mic_input.wav"
MODEL_FILE="$HOME/.cache/whisper/ggml-tiny.en.bin"

if [ "$1" = "stop" ]; then
  echo "Stopping recording..."
  set-status --not-recording
  pkill -SIGTERM parec
  exit 0
fi

echo "Starting speech-to-text process..."

echo "Display recording status..."
set-status --recording

echo "Starting audio recording..."
amixer set Mic -D hw:K66 cap > /dev/null # Unmute
timeout 180s parec --rate=16000 --latency-msec=20 --file-format=wav \
  -d alsa_input.usb-K66_K66_20190805V001-00.analog-stereo > "$SPEECH_FILE"
amixer set Mic -D hw:K66 nocap > /dev/null # Mute

echo "Transcribing audio..."
TRANSCRIPTION=$(
  whisper -nt -np -m "$MODEL_FILE" -f "$SPEECH_FILE" | \
    sed '/^$/d;s/^[[:space:]]*//')

if [ -n "$TRANSCRIPTION" ] && [ "$TRANSCRIPTION" != "[BLANK_AUDIO]" ]; then
  echo "Transcription: '$TRANSCRIPTION'"
  echo -n "$TRANSCRIPTION" | xsel -b && xdotool key shift+Insert
  notify-send "Copied text to clipboard." "$TRANSCRIPTION"
  echo "Done."
else
  echo "No text to output. The transcription may have failed or produced no results."
  notify-send "No text to output." "The transcription may have failed or produced no results."
fi

This script will be called with a start or stop parameter. The start parameter unmutes the mic and starts recording the audio. The output file is set to /tmp/mic_input.wav. When you call the script again with the stop parameter, it kills the program that does the recording. The original run of the script continues, muting the mic and doing the transcription using whisper. The transcribed text is copied to the clipboard and immediately pasted wherever your cursor is located.

To execute the script, I use the hotkey daemon, sxhkd, with the following configuration. This executes the script with the start parameter when I press the Right Ctrl key down and with the stop parameter when I release it.


{_, @}Control_R
        speech-to-text {start, stop}

I wanted a one key press and release for the hotkey, and luckily, sxhkd allows this with modifier keys. The only time I use Right Ctrl is to backspace a word, so using it doesn't interfere too much with normal work. Another option for the hotkey would be Right Alt, which I don't think I use at all. Regardless, this is configurable in the hotkey daemon.

Issues

A slight problem I ran into was that the microphone would fail to record on the first time I logged into my computer. I was able to fix this by running the following command in my .xinitrc. This just records with my mic for a tenth of a second on startup, so the next recording will work properly.

#!/bin/bash
timeout 0.1s parec -d alsa_input.usb-K66_K66_20190805V001-00.analog-stereo &

Another issue with this setup is that, currently, it could interfere with other uses of the microphone. Since the script mutes and unmutes the mic unconditionally, if you are using the mic for another purpose, like voice chat for example, the script will mute the mic, interrupting your chat. This would need to be fixed in the script somehow.

Conclusion

Even though it requires a good bit of setup, it is worth it for how easy it makes conversing with LLMs. This is now my preferred way to input natural language into the computer. When I first started using this setup, it felt like I was in Star Trek, but it feels pretty comfortable now.

If you want to find out how this setup can be used to ask an LLM for help on the command line, check out my previous article.

The plugin script referenced in this post is a simplified version. You can view the current version I use in my dotfiles repository on my GitHub. Feel free to modify it for your use.

My Dumb Site