The title is a bit confusing as open-source separation of ... reads like source separation, which this is not. Rather, it is a pitch detection algorithm which also classifies the instrument the pitch originated with.
I think it's really neat, but the results look like it could take more time to fix the output than using a manual approach (if really accurate results are required).
Is “source separation” better known as “stem separation” or is that something else? I think the latter term is the one I usually hear from musicians who are interested in taking a single audio file and recovering (something approximating) the original tracks prior to mixing (i.e. the “stems”).
Audio Source Separation I think is the general term used in research. It is often applied to musical audio though, where you want to do stem separation - that's source separation where you want to isolate audio stems, a term referring to audio from related groups of signals, e.g. drums (which can contain multiple individual signals, like one for each drum/cymbal).
Stem separation refers to doing it with audio playback fidelity (or an attempt at that). So it should pull the bass part out at high enough fidelity to be reused as a bass part.
This is a partly solved problem right now. Some tracks and signal types can be unmixed easier than others, it depends on what the sources are and how much post-processing (reverb, side chaining, heavy brick wall limiting and so on)
I'd agree with the partly. I have yet to find one that either isolates an instrument as a separate file or removes one from the rest of the mix that does not negatively impact the sound. The common issues I hear are similar to the early internet low bit rate compression. The new "AI" versions are really bad at this, but even the ones available before the AI craze were still susceptible
I'm far (far) from an expert in this field, but when you think about how audio is quantized into digital form, I'm really not sure how one solves this with the current approaches.
That is: frequencies from one instrument will virtually always overlap with another one (including vocals), especially considering harmonics.
Any kind of separation will require some pretty sophisticated "reconstruction" it seems to me, because the operation is inherently destructive. And then the problem becomes one of how faithful the "reproduction" is.
This feels pretty similar to the inpainting/outpainting stuff being done in generative image editing (a la Photoshop) nowadays, but I don't think anywhere near the investment is being made in this field.
Very interested to hear anyone with expertise weigh in!
I won't say expertise, but what I've done recently:
1) used PixBim AI to extract "stems" (drums, bass, piano, all guitars, vocals). Obviously a lossless source like FLAC works better than MP3 here
2) imported the stems to ProTools.
3) from there, I will usually re-record the bass, guitars, pianos and vocals myself. Occassionally the drums as well.
This is a pretty good way I found to record covers of tracks at home, re-using the original drums if I want to, keeping the tempo of the original track intact etc. I can embellish/replace/modify/simplify parts that I re-record obviously.
It's a bit like drawing using tracing paper, you're creating a copy to the best of your ability, but you have a guide underneath to help you with placement.
It's not really digital quantisation that's the problem, but everything else that happens during mixing - which is a much more complicated process, especially for pop/rock/electronic etc., than just "sum all the signals together".
There's a bunch of other stuff that happens during and after summing which makes it much harder to reliably 100% reverse that process.
I didn't mean to say that quantization was the problem, just that you're basically trying to pick apart a "pixel" (to continue my image-based analogy) that is a composite of multiple sounds (or partially-transparent image layers).
I was sincere when I said:
> I'm really not sure how one solves this with the current approaches.
I was hoping someone would come along and say it is, in fact, possible. :)
I didn't see it referenced directly anywhere in this post. However, for those interested, automatic music transcription (i.e., audio->MIDI) is actually a decently sized subfield of deep learning and music information retrieval.
There have been several successful models for multi-track music transcription - see Google's MT3 project (https://research.google/pubs/mt3-multi-task-multitrack-music...). In the case of piano transcription, accuracy is nearly flawless at this point, even for very low-quality audio:
He's trying to solve a second (also hard ish) problem as well, deriving an accurate musical score from MIDI data. It's a "sounds easy but isn't" problem, especially when audio to MIDI transcribers are great at pitch and onset times, but rather less reliable at duration and velocity.
I agree that the audio->score and MIDI->score problems are quite hard. There has been research in this area too, however it is far less developed than audio->MIDI.
That's because MIDI doesn't contain all the information that was in a score.
Scores are interpreted by musicians to create a performance, and MIDI is a capture of (some of) the data about that performance. Music engraving is full of implicit and explicit cultural rules, and getting it _right_ has parallels with handwritten kanji script in terms of both the importance of correctness to the reader, and the amount of traps for the unwary or uncultured.
All of which can be taken to mean "classical musicians are incredibly picky and anal about this stuff", or, "well-formed music notation conveys all sorts of useful contextual information beyond simply 'what note to play when'".
A lot of modern scores are written with MIDI in mind (whether or not the composer knows it - that's how they hear it the first 50 or so times). That should make it somewhat easier to go MIDI -> score for similar pieces. Current attempts I have seen still make a lot of stupid errors like making note durations too precise and spelling accidentals badly. There's probably still a lot of low-hanging fruit.
This is absolutely not easy, though, given all the cultural context. Things like picking up a "legato" or "cantabile" marking and choosing an accent vs a dagger or a marcato mark are going to be very difficult no matter what.
I ported their colab to runtime so I could use it more easily.
The MIDI output is... puzzling?
I've tried feeding it even simple stems and found the output unusable for some tracks, i.e. the MIDI output and audio were not well aligned and there were timing issues. On other audio it seemed to work fine.
Multi-track transcription has a long way to go before it seriously useful for real-world applications. Ultimately I think that converting audio into MIDI makes a lot more sense for piano/guitar transcription than it does for complex multi-instrument works with sound effects ect...
Luckily for me, audio-to-seq approaches do work very well for piano, which turns out to be an amazing way of getting expressive MIDI data for training generative models.
I developed https://pyaar.ai, it uses MT3 under the hood. I realized that continuous string instruments (guitar) that have things like slides, bends are quite difficult to capture in MIDI. Piano works much better because it's more discrete (the keys abstract away the strings) and so the MIDI file has better representation
> I realized that continuous string instruments (guitar) that have things like slides, bends are quite difficult to capture in MIDI.
It's just pitch bend?
I think trying to transcribe as MIDI is just a fundamentally flawed approach that has too many (well known) pitfalls to be useful.
A trained human can listen to a piece and transcribe it in seconds, but programming it as MIDI could take minutes/hours. If you're not trying to replicate how humans learn by ear, you're probably approaching this wrong.
Essentially, the leading way to do automatic music transcription is to train a neural network on supervised data, i.e., paired audio-MIDI data. In the case of piano recordings, there is a very good dataset for this task which was released by Google in 2018:
Most current research involves refining deep learning based approaches to this task. When I worked on this problem earlier this year, I was interested in adding robustness to these models by training a sort of musical awareness into them. You can see a good example of it in this tweet:
It can even export the separated tracks as midi files. It still has some problems but works very well. Stem separation is now standard in the musical software and almost every DAW provides it.
RipX can do stem separation and allows repitching notes in the mix. If that is what you want to do it is great.
I find moises (https://moises.ai/) to be easy to use for the tasks I need to do. It allows transposing or time scaling the entire song. It does stem separation and has a simple interface for muting and changing the volume on a per-track basis. It auto-detects the beat and chords.
I'm not affiliated, just a happy nearly-daily user for learning and practicing songs. I boost the original bass part and put everything else at < 10% volume to hear the bass part clearly clearly (which often shows how bad online transcriptions are, even paid ones). Once once I know the part, I mute the bass part and play along with the original song as if I was the bass player.
This is really cool, but there's real-world instrument physics that might not be captured by simple Fourier transform templates, like a trumpet playing softly can have a significantly different harmonic spectrum than the same trumpet playing loudly, even at the same pitch
Trumpets produce a rich harmonic series with strong overtones, meaning their Fourier transform would show prominent peaks at integer multiples of the fundamental frequency. Instruments like flutes have more pure tones, but brass instruments typically have stronger higher harmonics, which would lead to more complex partial derivatives in the matrix equation shown in the article
So this script uses bandpass filtering and cross-correlation of attack/release envelopes to identify note timing. Given that brass instruments can exhibit non-linear behavior where the harmonic content changes significantly with playing intensity (think of the brightness difference between pp and ff passages), not sure how would this algorithm could handle intensity-dependent timbral variations. I'd consider adding intensity-dependent Fourier templates for each instrument to improve accuracy
As someone who uses source separation twice a week for mixing purposes the number of other instruments that can produce sounds of "vocal" quality is high. These models all stop functiining well when you have bands where the instruments don't sound typical and aren't played and/or mixed in a way that achieves maximum separation between them — e.g. an electrical guitar with a distorted harmonic hitting the same note as your singer while the drummer plays only shrieking noises on their cymbals and the bass player simulates a punching kick drum on their instrument.
In these situations (experimental music) source separation will produce completely unpredictable results, thst may or may not be useful for musical rebalancing.
What tool do you use for the source separation? Everything I've used so far is great for learning or transcribing to MIDI but the separated tracks always have a strange phasing sound to them. Are you doing something to clean that up before mixing back in or are the results already good enough?
Looks like this may be the work of Joshua Bird's little brother (?). Joshua bird did some impressive projects already, that were featured on HN before: https://www.youtube.com/@joshuabird333
Someone created something. Its value greatly exceeds the perceived "degradation of the environment" of a spelling mistake. Not acknowledging that says more about the pedant than the creator.
I think decomposition is the word, source separation in this case (misleadingly) referes to the fact that the decomposed notes can be separated into different sources.
I'm a long-time fan of Ultrastar Deluxe, which is an open-source clone of Singstar. This is a karaoke game where people compete by singing along to the tune. It recognizes the notes you are singing and compares them to a vocals-timings mapping file for that particular song. The better you sing to the tune (getting the words correct doesn't matter), the higher your score.
While there are extensive libraries of fan-made song mappings, it's never enough, and there are very few mapped songs in languages other than English or Spanish (if you or your friends prefer your native language). Doing the entire mapping manually is time-consuming, not to mention that I am almost tone-deaf myself, which would make it even more difficult. I have been wondering for a long time what software I could use to make this process easier to automate. This seems like a great tool for capturing vocal timings and notes from original songs.
I have it on my bucket list to create a Singstar playlist in my native language and host a singing party with friends.
Does anyone have suggestions for other similar tools?
He's in high school and pulls of a project like this. I thought I was slick convincing the 7-11 guy to give me my Twist-a-Pepper soda without charging me bottle deposit or tax.
I think it's really neat, but the results look like it could take more time to fix the output than using a manual approach (if really accurate results are required).
In fairness to the author, he is still at high school: https://matthew-bird.com/about.html
Amazing work for that age.
This is a partly solved problem right now. Some tracks and signal types can be unmixed easier than others, it depends on what the sources are and how much post-processing (reverb, side chaining, heavy brick wall limiting and so on)
I'd agree with the partly. I have yet to find one that either isolates an instrument as a separate file or removes one from the rest of the mix that does not negatively impact the sound. The common issues I hear are similar to the early internet low bit rate compression. The new "AI" versions are really bad at this, but even the ones available before the AI craze were still susceptible
That is: frequencies from one instrument will virtually always overlap with another one (including vocals), especially considering harmonics.
Any kind of separation will require some pretty sophisticated "reconstruction" it seems to me, because the operation is inherently destructive. And then the problem becomes one of how faithful the "reproduction" is.
This feels pretty similar to the inpainting/outpainting stuff being done in generative image editing (a la Photoshop) nowadays, but I don't think anywhere near the investment is being made in this field.
Very interested to hear anyone with expertise weigh in!
1) used PixBim AI to extract "stems" (drums, bass, piano, all guitars, vocals). Obviously a lossless source like FLAC works better than MP3 here
2) imported the stems to ProTools.
3) from there, I will usually re-record the bass, guitars, pianos and vocals myself. Occassionally the drums as well.
This is a pretty good way I found to record covers of tracks at home, re-using the original drums if I want to, keeping the tempo of the original track intact etc. I can embellish/replace/modify/simplify parts that I re-record obviously.
It's a bit like drawing using tracing paper, you're creating a copy to the best of your ability, but you have a guide underneath to help you with placement.
There's a bunch of other stuff that happens during and after summing which makes it much harder to reliably 100% reverse that process.
I was sincere when I said:
> I'm really not sure how one solves this with the current approaches.
I was hoping someone would come along and say it is, in fact, possible. :)
Audio Decomposition [Blind Source Seperation]
>Open source seperation of music into constituent instruments.
> The title is a bit confusing as open-source separation of ... reads like source separation, which this is not.
There have been several successful models for multi-track music transcription - see Google's MT3 project (https://research.google/pubs/mt3-multi-task-multitrack-music...). In the case of piano transcription, accuracy is nearly flawless at this point, even for very low-quality audio:
https://github.com/EleutherAI/aria-amt
Full disclaimer: I am the author of the above repo.
Scores are interpreted by musicians to create a performance, and MIDI is a capture of (some of) the data about that performance. Music engraving is full of implicit and explicit cultural rules, and getting it _right_ has parallels with handwritten kanji script in terms of both the importance of correctness to the reader, and the amount of traps for the unwary or uncultured.
All of which can be taken to mean "classical musicians are incredibly picky and anal about this stuff", or, "well-formed music notation conveys all sorts of useful contextual information beyond simply 'what note to play when'".
This is absolutely not easy, though, given all the cultural context. Things like picking up a "legato" or "cantabile" marking and choosing an accent vs a dagger or a marcato mark are going to be very difficult no matter what.
https://replicate.com/turian/multi-task-music-transcription
I ported their colab to runtime so I could use it more easily.
The MIDI output is... puzzling?
I've tried feeding it even simple stems and found the output unusable for some tracks, i.e. the MIDI output and audio were not well aligned and there were timing issues. On other audio it seemed to work fine.
Luckily for me, audio-to-seq approaches do work very well for piano, which turns out to be an amazing way of getting expressive MIDI data for training generative models.
It's just pitch bend?
I think trying to transcribe as MIDI is just a fundamentally flawed approach that has too many (well known) pitfalls to be useful.
A trained human can listen to a piece and transcribe it in seconds, but programming it as MIDI could take minutes/hours. If you're not trying to replicate how humans learn by ear, you're probably approaching this wrong.
https://magenta.tensorflow.org/datasets/maestro
Most current research involves refining deep learning based approaches to this task. When I worked on this problem earlier this year, I was interested in adding robustness to these models by training a sort of musical awareness into them. You can see a good example of it in this tweet:
https://x.com/loubbrad/status/1794747652191777049
https://hitnmix.com/ripx-daw-pro/
It can even export the separated tracks as midi files. It still has some problems but works very well. Stem separation is now standard in the musical software and almost every DAW provides it.
I find moises (https://moises.ai/) to be easy to use for the tasks I need to do. It allows transposing or time scaling the entire song. It does stem separation and has a simple interface for muting and changing the volume on a per-track basis. It auto-detects the beat and chords.
I'm not affiliated, just a happy nearly-daily user for learning and practicing songs. I boost the original bass part and put everything else at < 10% volume to hear the bass part clearly clearly (which often shows how bad online transcriptions are, even paid ones). Once once I know the part, I mute the bass part and play along with the original song as if I was the bass player.
I wonder why pricing information is so hard to find these days. Would like to get an idea of the same.
0: https://www.stemroller.com/
It's an up and coming feature that nearly every DAW should have, but most don't yet.
Ableton Live - No
Bigwig - No
Cubase - No
FL - Yes
Logic - Yes
Pro Tools - No
Reason - No
Reaper - No
Studio One - Yes
Mixcraft - Yes
Maschine3 - Yes
https://github.com/samim23/polymath
Polymath is effective at isolating and extracting individual instrument tracks from MP3s. It works very well.
Trumpets produce a rich harmonic series with strong overtones, meaning their Fourier transform would show prominent peaks at integer multiples of the fundamental frequency. Instruments like flutes have more pure tones, but brass instruments typically have stronger higher harmonics, which would lead to more complex partial derivatives in the matrix equation shown in the article
So this script uses bandpass filtering and cross-correlation of attack/release envelopes to identify note timing. Given that brass instruments can exhibit non-linear behavior where the harmonic content changes significantly with playing intensity (think of the brightness difference between pp and ff passages), not sure how would this algorithm could handle intensity-dependent timbral variations. I'd consider adding intensity-dependent Fourier templates for each instrument to improve accuracy
In these situations (experimental music) source separation will produce completely unpredictable results, thst may or may not be useful for musical rebalancing.
https://en.wikipedia.org/wiki/Audiosurf
Edit: to clarify, source separation in audio research means separating out the audio into separate clips.
While there are extensive libraries of fan-made song mappings, it's never enough, and there are very few mapped songs in languages other than English or Spanish (if you or your friends prefer your native language). Doing the entire mapping manually is time-consuming, not to mention that I am almost tone-deaf myself, which would make it even more difficult. I have been wondering for a long time what software I could use to make this process easier to automate. This seems like a great tool for capturing vocal timings and notes from original songs.
I have it on my bucket list to create a Singstar playlist in my native language and host a singing party with friends.
Does anyone have suggestions for other similar tools?
Sounds like the text file needs vocals and pitches along with time stamps. AI is getting there to allow automating it's creation.
For myself: Adding a link I just found for reading further.
https://www.reddit.com/r/karaoke/comments/x61kzy/modern_equi...