I've been working on a karaoke app called Nightingale. You point it at your music folder and it turns your songs into karaoke - separates vocals from instrumentals, generates word-level synced lyrics, and lets you sing with highlighted lyrics and pitch scoring. Works with video files too.
Everything runs locally on your machine, nothing gets uploaded. No accounts, no subscriptions, no telemetry.
It ships as a single binary for Linux, macOS, and Windows. On first launch it sets up its own isolated Python environment and downloads the ML models it needs - no manual installation of dependencies required.
My two biggest drivers for the creation of this were:
The lack of karaoke coverage for niche, avant-garde, and local tracks.
Nostalgia for the good old cheesy karaoke backgrounds with flowing rivers, city panoramas, etc.
Some highlights:
Stem separation using the UVR Karaoke model (preserves backing vocals) or Demucs
Automatic lyrics via WhisperX transcription, or fetched from LRCLIB when available
Pitch scoring with player profiles and scoreboards
Gamepad support and TV-friendly UI scaling for party setups
GPU acceleration on NVIDIA (CUDA) and Apple Silicon (CoreML/MPS)
Built with Rust and the Bevy engine
The whole stack is open source. No premium tier, no "open core" - just the app.
Amazing work! I am thrilled someone was motivated to approach this problem and develop a creative solution like this. There are very limited options for Karaoke, especially in the FOSS space. Most Karaoke apps are super limited and that's driven many Karaoke enjoyers I know to YouTube in search of the songs they want to sing. This solution would give them the power to do even more songs, even better than what's out there now!
Questions for you:
1. What CUDA capability level is necessary for Nvidia GPU accelleration to work?
3. Are there any plans to support iGPU/NPU accelleration on AMD and Intel? Asking because those chips are most common in the mini computers sold at low cost these days.
My family members who love Karaoke and will be happy to try this. Looking forward to it!
Just tried it with B.E.D - Walk Away[0], unfortunately it lost track of the lyrics after 30 secs (Model is "large-v3"). Will play around a bit more, as it would be great to have a working karaoke generator.
Some quick feedback:
- Needs a way to skip for-/backwards during playback to validate the result
- Sentences seem to be recognized (first letter has uppercasing), but periods aren't added
- Needs an option to edit results from the track analysis
indeed, I'm running to two problems on the analyzer side:
1. align model sliding off (especially w/ chorus/back vocals present)
2. transcript skipping parts of lyrics in lyrics-heavy tracks (I tried a lot of russian rap, lol)
happy for contributions as I'm not that experienced w/ machine learning side of the project, mostly it was emperical "tweak the parameters and look what is changed"
I haven't tried Mandarin and Cantonese, but tried Japanese. back at that time, it performed poorly. however, I've tweaked a bunch of settings since then, so maybe it has changed. Hanzi is a supported font and can be output, but the transcript/alignment quality might not be the best
I've worked on a small toy project with a similar purpose in the past [1], though it's not nearly as polished as yours, and I've made some questionable decisions here and there.
I have questions about pitch tracking. It seems you do track the pitch for scoring, and there's a line at the top of the screen that seems related but that I can't figure out. For my use case, an important feature of karaoke apps is displaying how "high" the next note should be sung, or at least some hints. Is it something your app can do and I just haven't figured it out? Or would it be a feature request?
Really nice project, I'm looking forward to trying it!
Would it be possible to process songs on one device, and then use the result in another, or even multiple? Or would it be possible to run as separate server / client?
I ask mainly because the device I connect to my TV is definitely not the most powerful one, so it would be nice if I can preprocess the songs elsewhere.
My wife is a huge karaoke fan. I'm especially interested in the pitch scoring since we usually play the karaoke games on older consoles for that exact feature. Nobody really makes games like that anymore without a subscription (and most of these good modern karaoke platforms are exclusive to east asia anyways). If this works well this could make for some really fun social events, looking forward to trying this.
I don’t want to take away from OPs project (which seems really nice) but have you tried Ultrastar Deluxe ( https://usdx.eu/ ) in combination with USDB ( https://usdb.animux.de/ )?
This looks great, but I don't understand what it's supposed to do. I assumed the idea was "remove the lyrics" but of the 5 songs I tried (from Cry Cry Cry, Indigo Girls, and Suzanne Vega), none seemed to have any change from the original at all - it's showing the words on the screen (and the timing is perfect) but it's not removing the singing at all. How do you turn off the singing?
you can use + / - buttons on the keyboard to change the level of guidance according to your preference, generally there is a controls legend in the top right corner
indeed! the stem separation model is not ideal. you can try to change the stem separation model in the settings and reanalyze the song (make sure to click the trash button and then analyze, refresh button does not refresh the stems, only the transcript/alignment)
Yes, the free songs on YouTube aren't customizable: there's a low voice guide that you can't turn off (or turn up...), and you can't change tempo.
You can do this from their huge catalog of songs, using their official app or their web client: https://www.karafun.com/web/
Plus, they have music quizzes you can play with many people using smartphones as remote controls. It's super fun for parties where people don't want to sing all the time.
No joke, feel free to try it! The beauty of the approach is that you don't have any limitations. Just be prepared to degraded experience, sometimes models struggle even with the simplest pop tracks :D
Everything runs locally on your machine, nothing gets uploaded. No accounts, no subscriptions, no telemetry.
It ships as a single binary for Linux, macOS, and Windows. On first launch it sets up its own isolated Python environment and downloads the ML models it needs - no manual installation of dependencies required.
My two biggest drivers for the creation of this were:
Some highlights: The whole stack is open source. No premium tier, no "open core" - just the app.Feedback and contributions welcome.
Questions for you:
1. What CUDA capability level is necessary for Nvidia GPU accelleration to work?
3. Are there any plans to support iGPU/NPU accelleration on AMD and Intel? Asking because those chips are most common in the mini computers sold at low cost these days.
My family members who love Karaoke and will be happy to try this. Looking forward to it!
Some quick feedback:
Thanks for keeping it FOSS![0]: https://www.youtube.com/watch?v=_MFT4H3VoNE
indeed, I'm running to two problems on the analyzer side: 1. align model sliding off (especially w/ chorus/back vocals present) 2. transcript skipping parts of lyrics in lyrics-heavy tracks (I tried a lot of russian rap, lol)
happy for contributions as I'm not that experienced w/ machine learning side of the project, mostly it was emperical "tweak the parameters and look what is changed"
I've worked on a small toy project with a similar purpose in the past [1], though it's not nearly as polished as yours, and I've made some questionable decisions here and there.
I have questions about pitch tracking. It seems you do track the pitch for scoring, and there's a line at the top of the screen that seems related but that I can't figure out. For my use case, an important feature of karaoke apps is displaying how "high" the next note should be sung, or at least some hints. Is it something your app can do and I just haven't figured it out? Or would it be a feature request?
[1] https://github.com/eckter/karaoke_helper
Would it be possible to process songs on one device, and then use the result in another, or even multiple? Or would it be possible to run as separate server / client?
I ask mainly because the device I connect to my TV is definitely not the most powerful one, so it would be nice if I can preprocess the songs elsewhere.
There’s also a program for automatically downloading the songs: https://github.com/bohning/usdb_syncer
both transcript/alignment might not work perfectly, but it really depends on the song
you can use + / - buttons on the keyboard to change the level of guidance according to your preference, generally there is a controls legend in the top right corner
https://nightingale.cafe/docs/controls
[1] https://www.karafun.com/
You can do this from their huge catalog of songs, using their official app or their web client: https://www.karafun.com/web/
Plus, they have music quizzes you can play with many people using smartphones as remote controls. It's super fun for parties where people don't want to sing all the time.
Impressive, very nice. Now let's see my death metal collection.
Just joking! Very nice, thanks for open-sourcing it.
https://www.yarg.in
https://www.enchor.us/?&hasVocals=true
need more pioneers to try it out and spread the word!