Removing 'um' from a recording is harder than it sounds

(doug.sh)

17 points | by dougcalobrisi 2 hours ago

7 comments

cadamsdotcom 7 minutes ago
What an awesome tool and idea. I’d be keen to see if it can integrate with video editing tools.
Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!
rindalir 43 minutes ago
This is fascinating! I'm going to try this on a certain clip from Jurassic Park.
sciencesama 15 minutes ago
there is a aah counter in toast master !! this is the software that helps !!
cryptoz 7 minutes ago
Really cool stuff and definitely going to try it; I’m also finding it wild that Google put effort into adding ums and erms into their text to speech model a while back. AI puts it in, AI helps take it out.
sublinear 2 minutes ago
Disfluencies are not necessarily "filler". They can convey mood or hesitation. Cutting them can change the meaning.
A trivial example is "umm... well... (sigh) okay" versus just "okay". Not okay!
dougcalobrisi 1 hour ago
This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn't a find and replace type thing, because Whisper's timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper's guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there's no click, with ffmpeg doing the splicing.
https://github.com/dougcalobrisi/erm
bagvader 50 minutes ago
[flagged]