If someone is interested, this is my supershort zsh/bash scripts that I keep in .zshrc for doing the same thing using plain whisper.cpp, ffmpeg and yt-dlp (`brew install whisper-cpp yt-dlp` for Mac); I output it in vtt format (subtitles) though, but it's easy enough to change it to txt.
yt_to_srt() {
local url="$1"
local output_base="$2"
local language="${3:-en}"
yt-dlp -x --audio-format wav --postprocessor-args "-ar 16000" -o "$output_base.wav" "$url"
whisper-cli --language "$language" --model "$WHISPER_MODEL" --split-on-word --max-len 65 --output-vtt --output-file "$output_base" --file "$output_base.wav"
rm "$output_base.wav"
}
file_to_srt() {
local filepath="$1"
local language="${2:-en}"
local filename=$(basename "$filepath")
local filename_no_ext="${filename%.*}"
local output_base="$filename_no_ext"
local temp_wav="$output_base.wav"
ffmpeg -i "$filepath" -vn -acodec pcm_s16le -ar 16000 -ac 1 "$temp_wav"
whisper-cli --language "$language" --model "$WHISPER_MODEL" --split-on-word --max-len 65 --output-vtt --output-file "$output_base" --file "$temp_wav"
rm "$temp_wav"
}
plus additional bootstrap script for large-v3-turbo model from my chez-moi dotfiles:
#!/bin/bash
# Download whisper.cpp models from Hugging Face (runs once per machine).
set -euo pipefail
MODELS_DIR="$HOME/whisper-models"
BASE_URL="https://huggingface.co/ggerganov/whisper.cpp/resolve/main"
MODELS=("ggml-large-v3-turbo.bin" "ggml-tiny.bin")
mkdir -p "$MODELS_DIR"
for model in "${MODELS[@]}"; do
if [ ! -f "$MODELS_DIR/$model" ]; then
echo "Downloading $model..."
curl -L --progress-bar -o "$MODELS_DIR/$model" "$BASE_URL/$model"
else
echo "$model already exists, skipping."
fi
done
echo "Whisper models ready at $MODELS_DIR"
Wouldn't it still be more efficient to do GPU transcriptions anyways? is this something we could actually put the effectively useless NPUs to use in modern laptops?
So, this project consists of a ~175 line README and a ~500 line Python program that glues yt-dlp and Kroko together. Neat.
I guess if it encourages you to install and figure out how to use ffmpeg, yt-dlp, kroko, numpy, and onnx that's a good thing. Sometimes just knowing a thing is possible is a huge benefit.
thank you. You nailed the actual value, that's right. The real win is just knowing you can do this on a laptop CPU, offline, no GPU or cloud bill. There are tiny done-for-you details, like rescaling token timestamps back to real time after the atempo speedup so --timestamps doesn't lie to you, but they are minor.
Do you have a link to results confirming this? Kroko does not seem to be on the Open ASR Leaderboard. Parakeet has an average WER of 6.32 across several common datasets.
I see the value as a centralized anti-content-blocker.
This repo is now a good way to centralize hacks around the sure-to-come blockers those platforms will add to prevent download.
Just like uBlockOrigin was a way to centralize all the "just run this greasemonkey script" comments, I can see this getting a huge following for people who really value transcriptions.
Had Claude test it out on 3 videos. Worked at 5-8x realtime. The beauty of it is that it works on all videos, not just the one with transcripts. Combine it with YouTube search and LLM takeaways from transcripts, and you have super-efficient content consumption. There are SaaS products that charge 1 cent per video for those with transcripts. There is a viable product in here somewhere, methinks.
Just download the model for your preferred language, all hosted on the Kroko-ASR collection here: https://huggingface.co/Banafo/Kroko-ASR/tree/main
Right now you have Dutch, French, Portuguese, Spanish, German, Italian, Swedish, Swiss German, Hebrew, and Turkish. Grab the one that matches your audio, point yapsnap at it with --model (or set KROKO_MODEL), and you're set!
Youtube has transcripts on most videos, not all. The others don't expose them. If you mean the "transcript APIs" for TikTok/IG/X, they are all transcribing audio like yapsnap does. If you have a way to pull native ones, let me know, genuinely curious.
NPUs - definitely a good use case for at least part of it, there are ports of whisper that use coreML/ANE with less power and 3x speed of CPU only
I guess if it encourages you to install and figure out how to use ffmpeg, yt-dlp, kroko, numpy, and onnx that's a good thing. Sometimes just knowing a thing is possible is a huge benefit.
This repo is now a good way to centralize hacks around the sure-to-come blockers those platforms will add to prevent download.
Just like uBlockOrigin was a way to centralize all the "just run this greasemonkey script" comments, I can see this getting a huge following for people who really value transcriptions.