What an awesome tool and idea. I’d be keen to see if it can integrate with video editing tools.
Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!
Really cool stuff and definitely going to try it; I’m also finding it wild that Google put effort into adding ums and erms into their text to speech model a while back. AI puts it in, AI helps take it out.
This post is mostly about how surprisingly hard it is to cut filler words out of speech cleanly. Apparently, stripping ums isn't a find and replace type thing, because Whisper's timestamps are off by up to a few hundred ms and cutting on them chops syllables or leaves stutters. So, I built a tool, erm, that starts from Whisper's guess, finds where each word actually starts and stops in the audio, and snaps the cuts to silence so there's no click, with ffmpeg doing the splicing.
Ideally it would slice the video in the timeline without actually removing anything, so you can scrub through your video and try with and without each disfluency (thank you - awesome word) & decide case by case which to keep!
A trivial example is "umm... well... (sigh) okay" versus just "okay". Not okay!
https://github.com/dougcalobrisi/erm