FFmpeg 8.0 adds Whisper support

(code.ffmpeg.org)

1033 points | by rilawa 294 days ago

56 comments

kmfrk 294 days ago
Whisper is genuinely amazing - with the right nudging. It's the one AI thing that has genuinely turned my life upside-down in an unambiguously good way.
People should check out Subtitle Edit (and throw the dev some money) which is a great interface for experimenting with Whisper transcription. It's basically Aegisub 2.0, if you're old, like me.
HOWTO:
Drop a video or audio file to the right window, then go to Video > Audio to text (Whisper). I get the best results with Faster-Whisper-XXL. Use large-v2 if you can (v3 has some regressions), and you've got an easy transcription and translation workflow. The results aren't perfect, but Subtitle Edit is for cleaning up imperfect transcripts with features like Tools > Fix common errors.
EDIT: Oh, and if you're on the current gen of Nvidia card, you might have to add "--compute_type float32" to make the transcription run correctly. I think the error is about an empty file, output or something like that.
EDIT2: And if you get another error, possibly about whisper.exe, iirc I had to reinstall the Torch libs from a specific index like something along these lines (depending on whether you use pip or uv):
```
    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
If you get the errors and the above fixes work, please type your error message in a reply with what worked to help those who come after. Or at least the web crawlers for those searching for help.
https://www.nikse.dk/subtitleedit
https://www.nikse.dk/donate
https://github.com/SubtitleEdit/subtitleedit/releases
[-]
- notatallshaw 294 days ago
  > uv pip install --system torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  uv has a feature to get the correct version of torch based on your available cuda (and some non-cuda) drivers (though I suggest using a venv not the system Python):
  > uv pip install torch torchvision torchaudio --torch-backend=auto
  More details: https://docs.astral.sh/uv/guides/integration/pytorch/#automa...
  This also means you can safely mix torch requirements with non-torch requirements as it will only pull the torch related things from the torch index and everything else from PyPI.
  [-]
  - xrd 294 days ago
    I love uv and really feel like I only need to know "uv add" and "uv sync" to be effective using it with python. That's an incredible feat.
    But, when I hear about these kinds of extras, it makes me even more excited. Getting cuda and torch to work together is something I have struggled countless times.
    The team at Astral should be nominated for a Nobel Peace Prize.
    [-]
    - danudey 293 days ago
      > "uv add"
      One life-changing thing I've been using `uv` for:
      System python version is 3.12:
      $ python3 --version Python 3.12.3
      A script that requires a library we don't have, and won't work on our local python:
      $ cat test.py #!/usr/bin/env python3 import sys from rich import print if sys.version_info < (3, 13): print("This script will not work on Python 3.12") else: print(f"Hello world, this is python {sys.version}")
      It fails:
      $ python3 test.py Traceback (most recent call last): File "/tmp/tmp/test.py", line 10, in <module> from rich import print ModuleNotFoundError: No module named 'rich'
      Tell `uv` what our requirements are
      $ uv add --script=test.py --python '3.13' rich Updated `test.py`
      `uv` updates the script:
      $ cat test.py #!/usr/bin/env python3 # /// script # requires-python = ">=3.13" # dependencies = [ # "rich", # ] # /// import sys from rich import print if sys.version_info < (3, 13): print("This script will not work on Python 3.12") else: print(f"Hello world, this is python {sys.version}")
      `uv` runs the script, after installing packages and fetching Python 3.13
      $ uv run test.py Downloading cpython-3.13.5-linux-x86_64-gnu (download) (33.8MiB) Downloading cpython-3.13.5-linux-x86_64-gnu (download) Installed 4 packages in 7ms Hello world, this is python 3.13.5 (main, Jun 12 2025, 12:40:22) [Clang 20.1.4 ]
      And if we run it with Python 3.12, we can see that errors:
      $ uv run --python 3.12 test.py warning: The requested interpreter resolved to Python 3.12.3, which is incompatible with the script's Python requirement: `>=3.13` Installed 4 packages in 7ms This script will not work on Python 3.12
      Works for any Python you're likely to want:
      $ uv python list cpython-3.14.0b2-linux-x86_64-gnu <download available> cpython-3.14.0b2+freethreaded-linux-x86_64-gnu <download available> cpython-3.13.5-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/bin/python3.13 cpython-3.13.5+freethreaded-linux-x86_64-gnu <download available> cpython-3.12.11-linux-x86_64-gnu <download available> cpython-3.12.3-linux-x86_64-gnu /usr/bin/python3.12 cpython-3.12.3-linux-x86_64-gnu /usr/bin/python3 -> python3.12 cpython-3.11.13-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.11.13-linux-x86_64-gnu/bin/python3.11 cpython-3.10.18-linux-x86_64-gnu /home/dan/.local/share/uv/python/cpython-3.10.18-linux-x86_64-gnu/bin/python3.10 cpython-3.9.23-linux-x86_64-gnu <download available> cpython-3.8.20-linux-x86_64-gnu <download available> pypy-3.11.11-linux-x86_64-gnu <download available> pypy-3.10.16-linux-x86_64-gnu <download available> pypy-3.9.19-linux-x86_64-gnu <download available> pypy-3.8.16-linux-x86_64-gnu <download available> graalpy-3.11.0-linux-x86_64-gnu <download available> graalpy-3.10.0-linux-x86_64-gnu <download available> graalpy-3.8.5-linux-x86_64-gnu <download available>
    - eigenvalue 293 days ago
      They’ve definitely saved me many hours of wasted time between uv and ruff.
    - j45 293 days ago
      Agreed, making the virtual environment management and so much else disappear lets so much more focus go to python itself.
  - spagettnet 292 days ago
    Of all the great things people say about UV, this is the one that sold me on it when I found this option in the docs. Such a nice feature.
- tossit444 294 days ago
  Aegisub is still actively developed (forked), and imo, both software can't really be compared to one another. They can complement each other, but SE is much better for actual transcription. Aegisub still does the heavy lifting for typesetting and the like.
- jokethrowaway 294 days ago
  whisper is definitely nice, but it's a bit too slow. Having subtitles and transcription for everything is great - but Nemo Parakeet (pretty much whisper by nvidia) completely changed how I interact with the computer.
  It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.
  The main limitation is being english only.
  [-]
  - threecheese 294 days ago
    Would you share the scripts?
    [-]
    - ec109685 293 days ago
      Or at least more details. Very cool!
  - forgingahead 293 days ago
    Yeah, mind sharing any of the scripts? I looked at the docs briefly, looks like we need to install ALL of nemo to get access to Parakeet? Seems ultra heavy.
    [-]
    - rhdunn 293 days ago
      You only need the ASR bits -- this is where I got to when I previously looked into running Parakeet:
      # NeMo does not run on 3.13+ python3.12 -m venv .venv source .venv/bin/activate git clone https://github.com/NVIDIA/NeMo.git nemo cd nemo pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128 pip install .[asr] deactivate
      Then run a transcribe.py script in that venv:
      import os import sys import nemo.collections.asr as nemo_asr model_path = sys.argv[1] audio_path = sys.argv[2] # Load from a local path... asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(restore_path=model_path) # Or download from huggingface ('org/model')... asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name=model_path) output = asr_moel.transcribe([audio_path]) print(output[0])
      With that I was able to run the model, but I ran out of memory on my lower-spec laptop. I haven't yet got around to running it on my workstation.
      You'll need to modify the python script to process the response and output it in a format you can use.
      [-]
      - forgingahead 287 days ago
        Thanks!
- pawelduda 294 days ago
  Can you give an example why it made your life that much better?
  [-]
  - 3036e4 293 days ago
    I used it like sibling commenter to get subtitles for downloaded videos. My hearing is bad. Whisper seems much better that YouTube's built-in auto-subtitles, so sometimes it is worth the extra trouble for me to download a video just to generate good subtitles and then watch it offline.
    I also used whisper.cpp to transcribe all my hoarded podcast episodes. Took days of my poor old CPU working at 100% on all cores (and then a few shorter runs to transcribe new episodes I have downloaded since). Worked as good as I could possibly hope. Of course it gets the spelling of names wrong, but I don't expect anything (or anyone) to do much better. It is great to be able to run ripgrep to find old episodes on some topic and sometimes now I read an episode instead of listen, or listen to it with mpv with subtitles.
    [-]
    - peterleiser 293 days ago
      You'll probably like Whisper Live and it's browser extensions: https://github.com/collabora/WhisperLive?tab=readme-ov-file#...
      Start playing a YouTube video in the browser, select "start capture" in the extension, and it starts writing subtitles in white text on a black background below the video. When you stop capturing you can download the subtitles as a standard .srt file.
    - theshrike79 293 days ago
      This, but I want a summary about the 3 hour video first before getting spending the time on it.
      Download -> generate subtitles -> feed to AI for summary works pretty well
  - kmfrk 294 days ago
    Aside from accessibility as mentioned, you can catch up on videos that are hours long. Orders of magnitude faster than watching on 3-4x playback speed. If you catch up through something like Subtitle Edit, you can also click on relevant parts of the transcript and replay it.
    But transcribing and passably translating everything goes a long way too. Even if you can hear what's being said, it's still less straining to hear when there's captions for it.
    Obviously one important factor to the convenience is how fast your computer is at transcription or translation. I don't use the features in real-time personally currently, although I'd like to if a great UX comes along through other software.
    There's also a great podcast app opportunity here I hope someone seizes.
  - shrx 294 days ago
    As a hard of hearing person, I can now download any video from the internet (e.g. youtube) and generate subtitles on the fly, not having to struggle to understand badly recorded or unintelligible speech.
    [-]
    - dylan604 294 days ago
      IF the dialog is badly recorded or unintelligible speech, how would a transcription process get it correct?
      [-]
      - gregoryl 294 days ago
        Because it can use the full set of information of the audio - people with hearing difficulties cannot. Also interesting, people with perfectly functional hearing, but whom have "software" bugs (i.e. I find it extremely hard to process voices with significant background nose) can also benefit :)
        [-]
        spauldo 293 days ago
        I have that issue as well - I can hear faint noises OK but if there's background noise I can't understand what people say. But I'm pretty sure there's a physical issue at the root of it in my case. The problem showed up after several practice sessions with a band whose guitarist insisted on always playing at full volume.
        [-]
        gregoryl 293 days ago
        I'd love your thoughts on why it might be hardware. I reason that my hearing is generally fine - there's no issue picking apart loud complex music (I love breakcore!).
        But play two songs at the same time, or try talking to me with significant background noise, and I seem to be distinctly impaired vs. most others.
        If I concentrate, I can sometimes work through it.
        My uninformed model is a pipeline of sorts, and some sort of pre-processing isn't turned on. So the stuff after it has a much harder job.
        [-]
        spauldo 293 days ago
        I don't have much beyond what I said. It happened to me after repeated exposure to dangerously loud sounds in a small room. I can hear faint sounds, but I have trouble with strong accents and I can't understand words if there's a lot of background noise. I noticed it shortly after I left that band, and I left because the last practice was so loud it felt like a drill boring into my ears.
        I don't think I have any harder time appreciating complex music than I did before, but I'm more of a 60s-70s rock kinda guy and a former bass player, so I tend to focus more on the low end. Bass tends to be less complex because you can't fit as much signal into the waveform without getting unpleasant muddling.
        And of course, just because we have similar symptoms doesn't mean the underlying causes are the same. My grandfather was hard of hearing so for all I know it's genetic and the timing was a coincidence. Who knows?
        [-]
        ddingus 293 days ago
        It seems to me your ability to discriminate has been impacted.
        I have always pictured it working this way:
        In the Cochlea, we have all the fine hair like sensors. The spread of them determines our range of frequencies, and this declines with age. Usually not too much, but could be as much as half. 10 to 12khz.
        Good news in that is all the good stuff we crave is below 10khz. Don't sweat age related hearing loss too much.
        The number of these sensors determines our ability to hear concurrent sounds, or complexity.
        The shape of them impacts how loud sounds need to be to be heard.
        Chances are, your loud exposure had harmonics that impacted many of these sensing hairs, but not in one place. The result is a loss of discrimination of concurrent sounds.
        There are plenty to cover the frequency range, so things do not seem muffled or low. Their shape is good, not worn so you hear faint sounds well.
        The lower number of them is the issue. Or, they are still there, just bent-- something prevents them from contrubuting.
        Another way to think of this is in reverse:
        Say you had 30 oscillators you could start at any frequency and time. How complex of a sound could you make? Now cut that in half.
        What is lost?
        The most complex, concurrent sound cases.
        dylan604 293 days ago
        > I have that issue as well
        You say issue, I say feature. It's a great way to just ignore boring babbling at parties or other social engagements where you're just not that engaged. Sort of like selective hearing in relationships, but used on a wider audience
        [-]
        enneff 293 days ago
        I don’t mean to speak for OP, but it strikes me as rude to make light of someone’s disability in this way. I’d guess it has caused them a lot of frustration.
        [-]
        dylan604 293 days ago
        Your assumption leads you to believe that I do not also suffer from the same issue. Ever since I was in a t-bone accident and the side airbag went off right next to my head, I have a definite issue hearing voices in crowded and noisy rooms with poor sound insulation. Some rooms are much worse than others.
        So when I say I call it a feature, it's something I actually deal with unlike your uncharitable assumption.
        [-]
        jhy 293 days ago
        Sometimes, late at night when I'm trying to sleep, and I hear the grumble of a Harley, or my neighbors staggering to their door, I wonder: why do we not have earflaps, like we do eyelids?
        spauldo 293 days ago
        It's not so great when I'm standing right next to my technician in a pumphouse and I can't understand what he's trying to say to me.
      - mschuster91 294 days ago
        The definition of "unintelligible" varies by person, especially by accent. Like, I got no problem with understanding the average person from Germany... but someone from the deep backwaters of Saxony, forget about that.
    - 3036e4 293 days ago
      I did this as recently as today, for that reason, using ffmpeg and whisper.cpp. But not on the fly. I ran it on a few videos to generate VTT files.
  - joshvm 293 days ago
    I don't know about much better, but I like Whisper's ability to subtitle foreign language content on YouTube that (somehow) doesn't have auto-generated subs. For example some relatively obscure comedy sketches from Germany where I'm not quite fluent enough to go by ear.
    10 years ago you'd be searching through random databases to see if someone had synchronized subtitles for the exact copy of the video that you had. Or older lecture videos that don't have transcripts. Many courses had to, in order to comply with federal funding, but not all. And lots of international courses don't have this requirement at all (for example some great introductory CS/maths courses from German + Swiss institutions). Also think about taking this auto generated output and then generating summaries for lecture notes, reading recommendations - this sort of stuff is what LLMs are great at.
    You can do some clever things like take the foreign sub, have Whisper also transcribe it and then ask a big model like Gemini to go line by line and check the translation to English. This can include accounting for common transcription errors or idiomatic difference between langauges. I do it in Cursor to keep track of what the model has changed and for easy rollback. It's often good enough to correct mis-heard words that would be garbled through a cheaper model. And you can even query the model to ask about why a particular translation was made and what would be a more natural way to say the same thing. Sometimes it even figures out jokes. It's not a fast or fully automatic process, but the quality can be extremely good if you put some time into reviewing.
    Having 90% of this be possible offline/open access is also very impressive. I've not tried newer OSS models like Qwen3 but I imagine it'd do a decent job of the cleanup.
    [-]
    - randomflyer20 293 days ago
      this is similar to what you are saying: https://x.com/thekrishdesai/status/1955390536422134109
      [-]
      - joshvm 290 days ago
        I forget which package I used, but it runs in Docker and can output a sub file directly (and it can auto-translate). Usually I generate the native language + English to compare, since the native generally has better transcription, but it helps the models if they have a decent translation to start from.
- taminka 294 days ago
  whisper is great, i wonder why youtube's auto generated subs are still so bad? even the smallest whisper is way better than google's solution? is it licensing issue? harder to deploy at scale?
  [-]
  - briansm 293 days ago
    I believe youtube still uses 40 mel-scale vectors as feature data, whisper uses 80 (which provides finer spectral detail but is computationally more intensive to process naturally, but modern hardware allows for that)
  - ec109685 293 days ago
    You’d think they’d use the better model for at least videos that have a large view counts (they already do that when deciding compression optimizations).
- BrunoJo 293 days ago
  Subtitle Edit is great if you have the hardware to run it. If you don't have GPUs available or don't want to manage the servers I built a simple to use and affordable API that you can use: https://lemonfox.ai/
- codedokode 294 days ago
  Kdeenlive also supports auto-generating subtitles which need some editing, but it is faster than create them from scratch. Actually I would be happy even with a simple voice detector so that I don't have to set the timings manually.
- kanemcgrath 293 days ago
  Subtitle edit is great, and their subtitle library libse was exactly what I needed for a project I did.
- throwoutway 294 days ago
  I found this online demo of it: https://www.nikse.dk/subtitleedit/online
- Morizero 293 days ago
  You don't happen to know a whisper solution that combines diarization with live audio transcription, do you?
  [-]
  - peterleiser 293 days ago
    Check out https://github.com/jhj0517/Whisper-WebUI
    I ran it last night using docker and it worked extremely well. You need a HuggingFace read-only API token for the Diarization. I found that the web UI ignored the token, but worked fine when I added it to docker compose as an environment variable.
  - jduckles 293 days ago
    WhipserX's diarization is great imo:
```
    whisperx input.mp3 --language en --diarize --output_format vtt --model large-v2
```
    Works a treat for Zoom interviews. Diarization is sometimes a bit off, but generally its correct.
    [-]
    - Morizero 293 days ago
      > input.mp3
      Thanks but I'm looking for live diarization.
  - kmfrk 293 days ago
    Proper diarization still remains a white whale for me, unfortunately.
    Last I looked into it, the main options required API access to external services, which put me off. I think it was pyannotate.audio[1].
    [1]: https://github.com/pyannote/pyannote-audio
    [-]
    - peterleiser 293 days ago
      I used diarization in https://github.com/jhj0517/Whisper-WebUI last night and once it downloads the model from HuggingFace it runs offline (it claims).
- hart_russell 293 days ago
  Is there a way to use it to generate a srt subtitle file given a video file?
  [-]
  - prurigro 293 days ago
    It generates a few formats by default including srt
- guluarte 293 days ago
  you can install suing winget or chocolately
```
    winget install --id=Nikse.SubtitleEdit  -e
```
Lio 294 days ago
Once local transcription is in more places hopefully we can persuade content creator not to burn bouncing sub-titles into their videos.
I've seen professionally produced recordings on dry and technical subjects with good sound quality where they've decided to use distracting sub-titles with no way to disable them.
It seems so unnecessary if you're not making novelty videos about cats.
Also local transcription allows for automatic translation and again overlaying subtitles on top of an existing burnt in set is a really poor reading experience.
[-]
- ambicapter 294 days ago
  They do that because it increases “engagement”, not because they care about the user’s experience with the subtitles.
  [-]
  - iAMkenough 294 days ago
    Also some social media platforms don't offer subtitle functionality, so burned-in is the only way if you want to serve your content to people that require subtitles or refuse to unmute their phones while they watch from their toilet.
  - anchpop 293 days ago
    I did that (distracting subtitles) on one of my videos and it had a very negative response. I won't do it again, but I was puzzled because I find it much nicer than the traditional subtitle format personally. It's easier for my brain to focus on. (And no one in my test audience minded.)
    [-]
    - appease7727 290 days ago
      Subtitles are very explicitly not something you're meant to engage with or focus on which is why people hate it when you make the subtitles more "engaging" than the content of the video. If you want people to focus on your subtitles, you should write a blog instead of make a video.
      Subtitles are an accessibility feature. They are meant to stay out of the way and add to, not detract from the video content. They are meant to be subtle and only visible if you need to look at them.
    - TsiCClawOfLight 293 days ago
      Do you happen to have ADHD? That might explain the discrepancy :)
- jiehong 293 days ago
  Those burned in subtitles still aren’t as cool as theme-matched anime subtitles during intro music sequences from fansubs 15 years ago.
  Those are still cool IMO
  [-]
  - trenchpilgrim 293 days ago
    Or how the fansubbers will create masks to translate diegetic text like signage and written notes
    [-]
    - mattxxx 293 days ago
      also love when a fansubber will just outright give you an asterisk explaining a joke that relies on nuance or wordplay
  - freddie_mercury 293 days ago
    I recently discovered that the Internet Archive has the Tomodachi fansubs of Fushigi Yugi which, at least in my experience, were the most famous example of that technique.
    https://archive.org/details/tomodachi-fushigi-yugi-vhsrip
- whywhywhywhy 294 days ago
  Algorithm boosts it that’s why they do it. Even if every device had real time 100% accurate subtitling built in they’d still do it if they video performs better with it.
- HPsquared 294 days ago
  The other problem with burned-in subtitles is you can't change the language.
  [-]
  - LorenDB 294 days ago
    The other other problem with burned-in subtitles is that they normally have horrible formatting. Who wants to try to read single words that only flash on-screen while they are being spoken?
  - rkomorn 294 days ago
    True, but (as someone who not infrequently has to rewind content on just about all streaming apps because it decided one particular subtitle only needed to be display for less than 200ms this time around) sometimes burned-in seems like a good idea.
    I don't understand why the problem seems so pervasive (I've seen it on Netflix, Viki, and Apple TV, at least) and so transient.
    [-]
    - t-3 294 days ago
      It's a newer problem IME, so I'd guess it's cause by people using auto-transcription/translation tools to generate subtitles. For eg. Chinese content, I'll see stuff on Viki where the OG Mandarin subs are formatted sanely and the English is piecemeal follow-the-audio style. I can't imagine this happening in any other way than use of a transcription+translation tool without review.
      [-]
      - rkomorn 293 days ago
        I don't think it's an automation-related thing. It happens even on big name shows on big apps.
        I think it's a toolkit thing where some sort of event or timer goes off at the wrong time and the subtitles get cleared when they shouldn't. And then if you rewind and replay, it doesn't happen again (because spurious event/timer issue).
        [-]
        t-3 293 days ago
        At least with vtt and srt, the chunk of text displayed is explicitly associated with a chunk of time, so something like that really shouldn't be happening. Maybe there is some sort of subtitle-writing on the fly like what is sometimes done with transcoding video, but that would be really strange for a plaintext format that is so light compared to the video and audio coming with it.
        [-]
        rkomorn 293 days ago
        > so something like that really shouldn't be happening
        I don't disagree, yet here we are. It's got race condition vibes.
        I don't know if it's related to the TV OS (LG WebOS in our case) but I guess that would be the common factor since it happens across multiple apps and languages.
        Anyway, it's quirky and occasionally annoying, but that's about it. :)
- absoflutely 294 days ago
  I think this trend is partially driven by the silent auto play that happens on YouTube. Baked in subtitles help draw people into the video.
- preisschild 294 days ago
  They could also just upload those transcriptions as normal closed-captioning srt subtitles...
  [-]
  - jimkleiber 294 days ago
    not all social media will show subtitles/captions tho, which is the challenge. YouTube Shorts, TikTok videos, IG reels, FB reels, Whatsapp statuses, and more. I think some allow cc but some don't, and if someone reshares to another platform, it may not be there, so some of us burn them in begrudgingly :-)
- dzhiurgis 294 days ago
  It's just so annyoing how someone like Netflix offers like 3-4 languages for most of its content when you can basically get it for free via browser extensions (if you watch on browser).
  Must be union thing.
  [-]
  - dewey 294 days ago
    That Netflix who would need to pay more to license more subtitles can't compete with pirated or unlicensed auto-generated subtitles shouldn't really be a surprise.
    It's also annoying that you have to pay for Netflix when you can get the same movies for free with less restrictions on a pirate site.
    [-]
    - sam_lowry_ 293 days ago
      You mean, a sharing site? That is a site where someone benevolently shared a movie with me?
  - thunderfork 294 days ago
    [dead]
londons_explore 294 days ago
Does this have the ability to edit historic words as more info becomes available?
Eg. If I say "I scream", it sounds phonetically identical to "Ice cream".
Yet the transcription of "I scream is the best dessert" makes a lot less sense than "Ice cream is the best dessert".
Doing this seems necessary to have both low latency and high accuracy, and things like transcription on android do that and you can see the adjusting guesses as you talk.
[-]
- yvdriess 294 days ago
  A good opportunity to point people to the paper with my favorite title of all time:
  "How to wreck a nice beach you sing calm incense"
  https://dl.acm.org/doi/10.1145/1040830.1040898
  [-]
  - abound 294 days ago
    For folks like me puzzling over what the correct transcription of the title should be, I think it's "How to recognize speech using common sense"
    [-]
    - strken 294 days ago
      Thank you! "Calm incense" makes very little sense when said in an accent where calm isn't pronounced like com.
      [-]
      - solardev 293 days ago
        How is calm pronounced in those accents?
        [-]
        strken 293 days ago
        In Australian English, calm rhymes with farm and uses a long vowel, while com uses a short vowel and would rhyme with prom. (I know this doesn't help much because some American accents also rhyme prom with farm).
        Consider the way "Commonwealth Bank" is pronounced in this news story: https://youtube.com/watch?v=MhkuHGRAAbg. An Australian English speaker would consider (most) Americans to be saying something like "Carmenwealth" rather "Commonwealth". See also the pronunciation of dog vs father in https://www.goalsenglish.com/lessons/2020/5/4/australian-eng....
        It really ruins some poetry.
        troad 293 days ago
        It's not 'calm' that differs, it's 'common'. Calm like palm, in all major accents.
        Traditionally, calm and com- have different vowels in English, but most North American accents merge com- into calm. All other major English accents retain the distinction.
        If you're American, try saying 'com' while rounding your lips. Or just listen to a recording of 'common' in an online dictionary from Britain or Australia. (Or lot, pot, spot, etc.)
        TLDR (simplified):
        US/Ca: (lot = palm) ≠ start
        UK/Au: lot ≠ (palm = start)
        drited 293 days ago
        Cahm
        [-]
        solardev 293 days ago
        Like the "cam" in "camera"?
        [-]
        yokljo 293 days ago
        I've been thinking about this for a minute, and I think if an American were to say "why", and take only the most open vowel sound from that word and put it between "k" and "m", you get a pretty decent Australian pronunciation. I am an Australian so I could be entirely wrong about how one pronounces "why".
        appease7727 290 days ago
        No, with a long vowel sound. Caaahm. The L is blended into the M so much that it's almost silent.
        Unless you're specifically enunciating it. The common usage lacks the L sound, but it is acceptable to intentionally add it back in for disambiguation
        Macha 293 days ago
        call-mm
    - wdaher 293 days ago
      This is the correct parsing of it. (I can't take credit for coming up with the title, but I worked on the project.)
    - codedokode 294 days ago
      I only got the "How to recognize" part. Also I think "using" should sound more like "you zinc" than "you sing".
    - efilife 294 days ago
      Thanks. Now I know that I'm not that stupid and this actually makes no sense
      [-]
      - chipsrafferty 294 days ago
        It actually does make sense. Not saying you're stupid, but in standard English, if you say it quickly, the two sentences are nearly identical.
        [-]
        mjw_byrne 294 days ago
        They're pretty different in British English, I struggled to figure it out until I started thinking about how it would sound with an American accent.
        codedokode 294 days ago
        But in "you sing", "s" is pronounced as "s", not as "z" from "using", right?
        [-]
        squeaky-clean 293 days ago
        I pronounce using with an S unless I'm saying it very slowly
    - fiatjaf 294 days ago
      Thank you very much!
  - fmx 294 days ago
    The paper: https://sci-hub.st/https://dl.acm.org/doi/10.1145/1040830.10...
    (Agree that the title is awesome, by the way!)
  - this_steve_j 285 days ago
    Direct PDF download link:
    https://web.media.mit.edu/~lieber/Publications/Wreck-a-Nice-...
  - Sophira 292 days ago
    Fun fact, I just could not work out what this was supposed to be, so I just used Whisper (indirectly, via the FUTO Voice Input app on my phone) and repeated the sentence into it, and it came out with the 'correct' transcription of "How to recognize speech using common sense." first time.
    Of course, this is nothing like what I actually said, so... make your own mind up whether that is actually a correct transcription or not!
    I have a British accent, for the record.
  - xyse53 294 days ago
    My favorite is:
    "Threesomes, with and without blame"
    https://dl.acm.org/doi/10.1145/1570506.1570511
    (From a professor I worked with a bit in grad school)
  - ThinkingGuy 293 days ago
    Also relevant: The Two Ronnies - "Four Candles"
    https://www.youtube.com/watch?v=gi_6SaqVQSw
  - brcmthrowaway 294 days ago
    Do AI voice recognition still use markov models for this?
    [-]
    - sva_ 294 days ago
      Whisper uses an encoder-decoder transformer.
- DiogenesKynikos 294 days ago
  This is what your brain does when it processes language.
  I find that in languages I don't speak well, my ability to understand degrades much more quickly as the audio quality goes down. But in my native language, even with piss poor audio quality, my brain fills in the garbled words with its prior expectation of what those words should be, based on context.
  [-]
  - mockingloris 294 days ago
    A slight segue to this; I was made aware of the phenomena that - The language in which you think in, sets the constraints to which you level of expanse the brain can think and parse information in.
    I think in English fortunately and it's an ever evolving language so, expanding as the world does. That is compared to the majority of people where I'm from; English was a second language they had to learn and the people that thought them weren't well equipped with the resources to do a good job.
    │
    └── Dey well; Be well
    [-]
    - cyphar 294 days ago
      This is called linguist relativity (nee. The Sapir-Whorf hypothesis) and the strong form you describe has fallen out of favour in modern linguistics.
      A surprising number of monolingual people think their own language is the most adaptable and modern language, but this is obviously untrue. All languages evolve to fit the needs of speakers.
      Also, the idea that people "think in language X" is heavily disputed. One obvious counterargument is that most people have experienced the feeling of being unable to express what they are thinking into words -- if you truly did think in the language you speak, how could this situation happen? My personal experience is that I do not actively hear any language in my head while unless I actively try to think about it (at least, since I was a teenager).
      (This is all ignoring the comments about ESL speakers that I struggle to read as anything but racism. As someone who speaks multiple languages, it astounds me how many people seem to think that struggling to express something in your non-native language means that you're struggling to think and are therefore stupid.)
      [-]
      - sigbottle 293 days ago
        I think it's more like, you have a thought X, that has so many dimensions to it, but the way you serialize it to something that's actually discussable and comparable to other thoughts is language. And sometimes that language naturally loves slicing one part of that thought one way or the other.
        (then there's also a feedback loop type of argument, that always happens when discussing any sort of perception-reality distinction, but let's ignore that for now)
        At least for me, my brain is so bad and it's hard for me to truly hold a single thought in my head for a long time. Maybe it eventually settles into my subconscious but I don't really have a way to verify that.
      - codedokode 294 days ago
        My experience is that sometimes, for example, when I watch a lecture in a foreign language, there could be some terms for which I don't know the correct translation so I cannot think about or mention them in my native language, while I understand what they mean.
        [-]
        cyphar 293 days ago
        I was more focused on the experience of monolinguals (where this kind of explanation is impossible), but yes I also experience this fairly often as someone who speaks more than one language.
      - numpad0 293 days ago
        > if you truly did think in the language you speak, how could this situation happen?
        As far as how it happens to me is concerned, either something closer to speech than raw thoughts reports back the data in shared memory is invalid for selected language, or I find there's no text representation exist for what I am trying to say.
        The "raw" thoughts work with the currently active language, for me, so at least for me, I just know strong Sapir-Whorf hypothesis is not even a hypothesis, but just a reasonable verbalization closely matching my own observations.
        I don't get why people can't take it, even in the age of LLMs. It is what it is and that old guy is just never correct even for once.
      - _puk 291 days ago
        The way a French friend put it to me when they were learning English through immersion..
        I dreamt in English last night! Now I know I can speak the language!
        Being monolingual, and then trying to pick up another language later in life, a big struggle is not trying to "map" sentences and structure to what one already knows.
        This is why idioms often don't translate well.
- Fluorescence 294 days ago
  It makes me curious about how human subtitlers or even scriptwriters choose to transcribe intentionally ambiguous speech, puns and narratively important mishearings. It's like you need to subtitle what is heard not what is said.
  Do those born profoundly deaf specifically study word sounds in order to understand/create puns, rhymes and such so they don't need assistance understanding narrative mishearings?
  It must feel like a form of abstract mathematics without the experiential component... but then I suspect mathematicians manufacture an experiential phenomena with their abstractions with their claims of a beauty like music... hmm!
  [-]
  - 0cf8612b2e1e 294 days ago
    The quality of subtitles implies that almost no effort is being put into their creation. Watch even a high budget movie/TV show and be aghast at how frequently they diverge.
    [-]
    - smallpipe 294 days ago
      A good subtitle isn't a perfect copy of what was said.
      [-]
      - kstrauser 293 days ago
        Hard disagree. When I'm reading a transcript, I want word-for-word what the people said, not a creative edit. I want the speakers' voice, not the transcriptionist's.
        And when I'm watching subtitles in my own language (say because I want the volume low so I'm not disturbing others), I hate when the words I see don't match the words I hear. It's the quickest way I can imagine to get sucked out of the content and into awareness of the delivery of the content.
        [-]
        crazygringo 293 days ago
        I mean, subtitles are mostly the same.
        Sometimes they're edited down simply for space, because there wouldn't be time to easily read all the dialog otherwise. And sometimes repetition of words or phrases is removed, because it's clearer, and the emphasis is obvious from watching the moving image. And filler words like "uh" or "um" generally aren't included unless they were in the original script.
        Most interestingly, swearing is sometimes toned down, just by skipping it -- removing an f-word in a sentence or similar. Not out of any kind of puritanism, but because swear words genuinely come across as more powerful in print than they do in speech. What sounds right when spoken can sometimes look like too much in print.
        Subtitles are an art. Determining when to best time them, how to split up long sentences, how to handle different speakers, how to handle repetition, how to handle limited space. I used to want subtitles that were perfectly faithful to what was spoken. Then I actually got involved in making subtitles at one point, and was very surprised to discover that perfectly faithful subtitles didn't actually do the best job of communicating meaning.
        Fictional subtitles aren't court transcripts. They serve the purpose of storytelling, which is the combination of a visible moving image full of emotion and action, and the subtitles. Their interplay is complex.
        [-]
        nomdep 293 days ago
        Hard and vehemently disagree. Subtitles are not commentary tracks.
        The artists are the writers, voice actors, and everyone else involved in creating the original media. Never, ever, a random stranger should contaminate it with his/her opinions or point of views.
        Subtitles should be perfect transcriptions or the most accurate translations, never reinterpretations
        [-]
        crazygringo 293 days ago
        Nobody said subtitles are commentary tracks.
        And official subtitles aren't made by random strangers. They're made by people who do it professionally.
        It's not "contamination" or "opinions", like somebody is injecting political views! And certainly not "reinterpretation". Goodness. It's about clarity, that's all.
        Also there's no such thing as the "most accurate" translations. Translations themselves are an art, hugely.
        creesch 293 days ago
        > When I'm reading a transcript
        That's the thing though, subtitles aren't intended as full transcripts. They are intended to allow a wide variety of people to follow the content.
        A lot of people read slower than they would hear speech. So subtitles often need to condense or rephrase speech to keep pace with the video. The goal is usually to convey meaning clearly within the time available on screen. Not to capture every single word.
        If they tried to be fully verbatim, you'd either have subtitles disappearing before most viewers could finish reading them or large blocks of text covering the screen. Subtitlers also have to account for things like overlapping dialogue, filler words, and false starts, which can make exact transcriptions harder to read and more distracting in a visual medium.
        I mean, yeah in your own native language I agree it sort of sucks if you can still hear the spoken words as well. But, to be frank, you are also the minority group here as far as subtitle target audiences go.
        And to be honest, if they were fully verbatim, I'd wager you quickly would be annoyed as well. Simply because you will notice how much attention they then draw, making you less able to actually view the content.
        [-]
        iczero 293 days ago
        I regularly enable YouTube subtitles. Almost always, they are a 100% verbatim transcription, excluding errors from auto-transcription. I am not annoyed in the slightest, and in fact I very much prefer that they are verbatim.
        If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
        [-]
        ben_w 293 days ago
        > If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
        And what are deaf people supposed to do in a cinema, or with broadcast TV?
        (And I'm ignoring other uses, e.g. learning a foreign language; for that, sometimes you want the exact words, sometimes the gist, but it's highly situational; but even once you've learned the language itself, regional accents even without vocabulary changes can be tough).
        creesch 293 days ago
        > If you are too slow at reading subtitles, you can either slow down the video or train yourself to read faster. Or you can just disable the subtitles.
        That's just plain tone deaf, plain and simple. I was not talking about myself, or just youtube. You are not everyone else, your use case is not everyone else their use case. It really isn't that difficult.
        [-]
        cwmoore 293 days ago
        You made a bet and lost. Things are difficult.
        [-]
        creesch 291 days ago
        What even is this random ass reply? Are you a bot, or just confused?
        stavros 293 days ago
        But then what about deliberate mishearings and ambiguous speech, like the GP said?
      - numpad0 293 days ago
        Aren't same-language subtitles supposed to be perfect literal transcripts, while cross-language subtitling is supposed to be compressed creative interpretations?
      - herbcso 293 days ago
        Tom Scott would agree with you. https://m.youtube.com/watch?v=pU9sHwNKc2c
  - dylan604 294 days ago
    I had similar thoughts when reading Huck Finn. It's not just phonetically spelled, it's much different. Almost like Twain came up with a list of words, and then had a bunch of 2nd graders tell him the spelling of words they had seen. I guess at some point, you just get good at bad spelling?
    [-]
    - spauldo 293 days ago
      Writing in the vernacular, I believe it's called. I do something like that if I'm texting.
      The book "Feersum Endjinn" by Iain M. Banks uses something like this for one of its characters to quite good effect.
      [-]
      - dylan604 293 days ago
        Except it forces me to slow down to "decypher" the text and makes the reading labored. I understand the point as it is part of the character, but it is easier to understand someone speaking in that vernacular vs reading the forced misspellings. I definitely don't want to get to the point of being good at reading it though. I wonder if this is how second grade teachers feel reading the class' schoolwork?
        [-]
        spauldo 293 days ago
        That's true. I'm sure Twain and Banks were aware of this, though. Apparently they considered the immersion to be worth a little extra work on the part of the reader. Whether the reader agrees is a different story.
        I try to limit my use of it to just enough for my accent and way of talking to bleed through. I don't go for full-on phonetics, but I'm often "droppin' my g's and usin' lotsa regional sayin's." It probably helps that the people I text have the same accent I do, though.
- ph4evers 294 days ago
  Whisper works on 30 second chunks. So yes it can do that and that’s also why it can hallucinate quite a bit.
  [-]
  - jeroenhd 294 days ago
    The ffmpeg code seems to default to three second chunks (https://ffmpeg.org/ffmpeg-filters.html#whisper-1):
```
    queue
    
         The maximum size that will be queued into the filter before processing the audio with whisper. Using a small value the audio stream will be processed more often, but the transcription quality will be lower and the required processing power will be higher. Using a large value (e.g. 10-20s) will produce more accurate results using less CPU (as using the whisper-cli tool), but the transcription latency will be higher, thus not useful to process real-time streams. Consider using the vad_model option associated with a large queue value. Default value: "3"
```
    [-]
    - londons_explore 294 days ago
      so if "I scream" is in one chunk, and "is the best dessert" is in the next, then there is no way to edit the first chunk to correct the mistake? That seems... suboptimal!
      I don't think other streaming transcription services have this issue since, whilst they do chunk up the input, past chunks can still be edited. They tend to use "best of N" decoding, so there are always N possible outputs, each with a probability assigned, and as soon as one word is the same in all N outputs then it becomes fixed.
      The internal state of the decoder needs to be duplicated N times, but that typically isn't more than a few kilobytes of state so N can be hundreds to cover many combinations of ambiguities many words back.
      [-]
      - miki123211 294 days ago
        The right way to do this would be to use longer, overlapping chunks.
        E.g. do thranscription every 3 seconds, but transcribe the most recent 15s of audio (or less if it's the beginning of the recording).
        This would increase processing requirements significantly, though. You could probably get around some of that with clever use of caching, but I don't think any (open) implementation actually does that.
        [-]
        superluserdo 294 days ago
        I basically implemented exactly this on top of whisper since I couldn't find any implementation that allowed for live transcription.
        https://tomwh.uk/git/whisper-chunk.git/
        I need to get around to cleaning it up but you can essentially alter the number of simultaneous overlapping whisper processes, the chunk length, and the chunk overlap fraction. I found that the `tiny.en` model is good enough with multiple simultaneous listeners to be able to have highly accurate live English transcription with 2-3s latency on a mid-range modern consumer CPU.
        dylan604 294 days ago
        If real-time transcription is so bad, why force it to be real-time. What happens if you give it a 2-3 second delay? That's pretty standard in live captioning. I get real-time being the ultimate goal, but we're not there yet. So working within the current limitations is piss poor transcription in real-time really more desirable/better than better transcriptions 2-3 second delay?
      - jeroenhd 293 days ago
        I don't know an LLM that does context based rewriting of interpreted text.
        That said, I haven't run into the icecream problem with Whisper. Plenty of other systems fail but Whisper just seems to get lucky and guess the right words more than anything else.
        The Google Meet/Android speech recognition is cool but terribly slow in my experience. It also has a tendency to over-correct for some reason, probably because of the "best of N" system you mention.
      - llarsson 294 days ago
        Attention is all you need, as the transformative paper (pun definitely intended) put it.
        Unfortunately, you're only getting attention in 3 second chunks.
      - abdullahkhalids 293 days ago
        Which other streaming transcription services are you referring to?
        [-]
        londons_explore 293 days ago
        Googles speech to text API: https://cloud.google.com/speech-to-text/docs/speech-to-text-...
        The "alternatives" and "confidence" field is the result of the N-best decodings described elsewhere in the thread.
      - no_wizard 294 days ago
        That’s because at the end of the day this technology doesn’t “think”. It simply holds context until the next thing without regard for the previous information
  - anonymousiam 294 days ago
    Whisper is excellent, but not perfect.
    I used Whisper last week to transcribe a phone call. In the transcript, the name of the person I was speaking with (Gem) was alternately transcribed as either "Jim" or "Jem", but never "Gem."
    [-]
    - JohnKemeny 294 days ago
      Whisper supports adding a context, and if you're transcribing a phone call, you should probably add "Transcribe this phone call with Gem", in which case it would probably transcribe more correctly.
      [-]
      - ctxc 294 days ago
        Thanks John Key Many!
    - t-3 294 days ago
      That's at least as good as a human, though. Getting to "better-than-human" in that situation would probably require lots of potentially-invasive integration to allow the software to make correct inferences about who the speakers are in order to spell their names correctly, or manually supplying context as another respondent mentioned.
      [-]
      - anonymousiam 293 days ago
        When she told me her name, I didn't ask her to repeat it, and I got it right through the rest of the call. Whisper didn't, so how is this "at least s good as a human?"
        [-]
        t-3 293 days ago
        I wouldn't expect any transcriber to know that the correct spelling in your case used a G rather than a J - the J is far more common in my experience. "Jim" would be an aberration that could be improved, but substitution "Jem" for "Gem" without any context to suggest the latter would be just fine IMO.
  - 0points 294 days ago
    So, yes, and also no.
- lgessler 294 days ago
  I recommend having a look at 16.3 onward here if you're curious about this: https://web.stanford.edu/~jurafsky/slp3/16.pdf
  I'm not familiar with Whisper in particular, but typically what happens in an ASR model is that the decoder, speaking loosely, sees "the future" (i.e. the audio after the chunk it's trying to decode) in a sentence like this, and also has the benefit of a language model guiding its decoding so that grammatical productions like "I like ice cream" are favored over "I like I scream".
- shaunpud 294 days ago
  I Scream in the Sun https://carmageddon.fandom.com/wiki/I_Scream_in_the_Sun
- ec109685 293 days ago
  The I is emphasized more in I scream than ice cream I think.
  But it’s great point that you need context to be sure.
- didacusc 294 days ago
  what would it make of this? https://www.youtube.com/watch?v=zyvZUxnIC3k
JohnKemeny 294 days ago
Related, a blog article by the author of the patch:
Run Whisper audio transcriptions with one FFmpeg command
https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
Posted here, with 0 comments: https://news.ycombinator.com/item?id=44869254
[-]
- eXpl0it3r 294 days ago
  Link is broken, full link: https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
- NiekvdMaas 294 days ago
  Correct URL: https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
hbn 293 days ago
I wonder if Apple's upcoming speech APIs can be added too. Would be cool to have it just work out of the box on Macs, without needing to source a model.
https://developer.apple.com/documentation/speech/speechtrans...
https://developer.apple.com/documentation/speech/speechanaly...
https://www.macstories.net/stories/hands-on-how-apples-new-s...
voxadam 294 days ago
Am I correct in understanding that Whisper is a speech recognition AI model originally created by OpenAI?
https://en.wikipedia.org/wiki/Whisper_(speech_recognition_sy...
[-]
- Maxious 294 days ago
  yep, there's a c++ implementation to run it https://github.com/ggml-org/whisper.cpp
  [-]
  - oezi 294 days ago
    Isn't WhisperX the canonical choice for running Whisper?
    [-]
    - 0points 294 days ago
      While whisper and whisperx is python implementations, the whisper.cpp wins the benchmarks.
    - sampullman 294 days ago
      Maybe for running locally? whisper.cpp is nice because you can embed it pretty easily in apps for various targets like iOS, OSX, Android, wasm, etc.
- johnisgood 294 days ago
  Yes.
  From the documentation:
  > It runs automatic speech recognition using the OpenAI's Whisper model.
  [-]
  - voxadam 294 days ago
    Thanks, I was being tripped up by DDOS protection on code.ffmpeg.org for a minute and couldn't read the patch. The combo of Firefox and the fact that Quantum/Lumen/CenturyLink seems to get off by rotating my dynamic IP for no reason occasionally triggers various DDOS protections schemes.
    [-]
    - johnisgood 294 days ago
      No problem. :) Yeah, it took me 8 seconds to get through. It seems your issue was worse.
- cess11 294 days ago
  Kind of, it's a family of audio transcription models.
  https://huggingface.co/search/full-text?q=whisper
- AlienRobot 294 days ago
  I think so, if I remember correctly PotPlayer also supports it for automatic subtitling.
- acidburnNSA 294 days ago
  Yes, according to the comments in the patch, you are correct.
- kwar13 294 days ago
  yes.
sorenjan 293 days ago
I hope this is the start of more ML filters in ffmpeg. They added the sr (super resolution) filter years ago, but it's old and it's difficult to get the weights so you can run it, since they're not included. They have added support for multiple inference libraries like libtorch, but again, it's difficult to even get started. Hopefully they can get behind a consistent ML strategy, ideally with a "models" directory with ready to use models for upscaling, temporal upscaling, noise cancelling, etc. A lot of audio and video filter research use ML now, new codecs will probably also use it soon.
porridgeraisin 294 days ago
I had a small bash pipeline for doing this until now.
```
  ffmpeg -f pulse -i "$(pactl get-default-source)" -t 5 -f wav -ar 16000 -ac 1 -c:a pcm_s16le - \
  | ./main - \
  | head -2 \
  | tail -1 \
  | cut -d] -f2 \
  | awk '{$1=$1};1'
```
The reading from mic part (-f pulse, pactl...) is linux-specific rest of it should be cross platform. The `main` executable is the whisper.cpp executable (see whisper.cpp github readme, it's just the output of `make base.en` from that).
Edit: -t 5 controls recording duration.
Oh and add 2>/dev/null to silence the debug output. I copied this from a pipe that further sends it into an LLM that then looks at the meaning and turns it into a variety of structured data (reminders, todo items, etc) which I then....
[-]
- dotancohen 294 days ago
```
  > which I then....
```
  Yes, please, go on...
  [-]
  - porridgeraisin 293 days ago
    The LLM turns my unstructured command into structured command (a limited set of commands hardcoded in the prompt) and a script takes that and executes it. I have it do stuff like interact with google keep/google calendar using the CLI. Those are the most used actions but there's a few others . Of course all actions can be scheduled.
    The LLM can screw up now and then and output absolute garbage. But I've got a knack now for figuring out what prompts it's gonna be hopeless on and I manually enter those.
    Example:
    Saying
    Remove makhana from shopping list
    Ends up running the command
    gkeep items edit shopping_list --check makhana
    There is a direct text interface too that skips the voice transcription.
    The main thing is it does in a background window without interrupting my screen or me needing to wait for whatever slow webpage to load. I had it do a few things on GitHub like remind me when checks pass on PRs. You could potentially connect it to various things like your amazon account to check on your order, etc,.. as I write this I now realise I did what basically amounts to what folks do with MCP today. Maybe I should update it to use the protocol.
    These days I have a little more idle time as a grad student than I did in a tech company, and I don't really need to manage home/cooking/... so I don't really use some of the more complicated features. I mostly just use it to schedule 1on1s with my guide and add reminders about assignments and TA work and talks and my music class.
    [-]
    - dotancohen 293 days ago
      That is fascinating, thank you very much for sharing. Good luck with the grad work.
      [-]
      - porridgeraisin 293 days ago
        Thank you:)
donatj 294 days ago
I know nothing about Whisper, is this usable for automated translation?
I own a couple very old and as far as I'm aware never translated Japanese movies. I don't speak Japanese but I'd love to watch them.
A couple years ago I had been negotiating with a guy on Fiver to translate them. At his usual rate-per-minute of footage it would have cost thousands of dollars but I'd negotiated him down to a couple hundred before he presumably got sick of me and ghosted me.
[-]
- ethan_smith 294 days ago
  Whisper can indeed transcribe Japanese and translate it to English, though quality varies by dialect and audio clarity. You'll need the "large-v3" model for best results, and you can use ffmpeg's new integration with a command like `ffmpeg -i movie.mp4 -af whisper=model=large-v3:task=translate output.srt`.
  [-]
  - waltbosz 294 days ago
    I wonder how the results of an AI Japanese-audio-to-English-subtitles would compare to a fansub-ed anime. I'm guessing it would be a more literal translation vs. contextual or cultural.
    I found an interesting article about trollsubs, which I guess are fansubs made with a contemptuous flare. https://neemblog.home.blog/2020/08/19/the-lost-art-of-fan-ma...
    Tangent: I'm one of those people who watch movies with closed captions. Anime is difficult because the subtitle track is often the original Japanese-to-English subtitles and not closed captions, so the text does not match the English audio.
    [-]
    - chazeon 294 days ago
      I do japanese transcription + gemini translations. It’s worse than fansub, but its much much better than nothing. First thing that could struggle is actually the vad, then is special names and places, prompting can help but not always. Finally it’s uniformity (or style). I still feel that I can’t control the punctuation well.
    - numpad0 294 days ago
      I was recently just playing around with Google Cloud ASR as well as smaller Whisper models, and I can say it hasn't gotten to that point: Japanese ASRs/STTs all generate final kanji-kana mixed text, and since kanji:pronunciation is n:n maps, it's non-trivial enough that it currently need hands from human native speakers to fix misheard texts in a lot of cases. LLMs should be theoretically good at this type of tasks, but they're somehow clueless about how Japanese pronunciation works, and they just rubber-stamp inputs as written.
      The conversion process from pronunciation to intended text is not deterministic either, so it probably can't be solved by "simply" generating all-pronunciation outputs. Maybe a multimodal LLM as ASR/STT, or a novel dual input as-spoken+estimated-text validation model could be made? I wouldn't know, though. It seemed like a semi-open question.
- neckro23 294 days ago
  In my experience it works ok. The "English" model actually knows a lot of languages and will translate directly to English.
  You can also transcribe it to Japanese and use a translator to convert to English. This can sometimes help for more semantically complex dialogue.
  For example, using faster-whisper-xxl [1]:
  Direct translation:
```
    faster-whisper-xxl.exe --language English --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>
```
  Use Japanese, then translate:
```
    faster-whisper-xxl.exe --language Japanese --task translate --model large-v2 --ff_vocal_extract mdx_kim2 --vad_method pyannote_v3 --standard <input>
```
  1. https://github.com/Purfview/whisper-standalone-win
- prmoustache 294 days ago
  My personnal experience trying to transcribe (not translate) was a complete failure. The thing would invent stuff. It would also be completely lost when more than one language is used.
  It also doesn't understand contexts so does a lot of errors you see in automatic translations from videos in youtube for example.
  [-]
  - okdood64 294 days ago
    It's curious how YouTube's is so bad still given the current state of the art; but it has got a lot better in the last 6 months.
- trenchpilgrim 294 days ago
  Whisper has quite bad issues with hallucination. It will inject sentences that were never said in the audio.
  It's decent for classification but poor at transcription.
  [-]
  - neckro23 294 days ago
    Pre-processing with a vocal extraction model (bs-rofomer or similar) helps a lot with the hallucinations, especially with poor quality sources.
    [-]
    - trenchpilgrim 293 days ago
      I'm working with fairly "clean" audio (voice only) and still see ridiculous hallucinations.
- BetterWhisper 294 days ago
  Hey, indeed Whisper can do the transcription of Japanese and even the translation (but only to English). For the best results you need to use the largest model which depending on your hardware might be slow or fast.
  Another option is to use something like VideoToTextAI which allows you to transcribe it fast and then translate it into 100+ languages which you can then export the subtitle (SRT) file for
- poglet 294 days ago
  Yep, whisper can do that. You can also try whisperx (https://github.com/m-bain/whisperX) for a possibly better experience with aligning of subtitles to spoken words.
- _def 294 days ago
  May I ask which movies? I'm just curious
jhatemyjob 293 days ago
I wish they worked with the mpv folks instead of shoehorning this in. Based on the docs it looks like getting live transcription for a video will involve running the demuxer/decoder on one thread, and this whisper filter on another thread, using ffmpeg's AVIO (or to a REST API [1].... shudders) to synchronize those two parallel jobs. It could have been way simpler.
Other than for the "live transcription" usecase (that they made unnecessarily complicated), I don't see how this is any better than running Whisper.cpp directly. Other people in this thread are basically saying "ffmpeg's interface is better understood" [2] but LLMs make that point moot since you can just ask them to do the drudgery for you.
[1] https://medium.com/@vpalmisano/run-whisper-audio-transcripti...
[2] https://news.ycombinator.com/item?id=44890067
webinar 294 days ago
I've been using FFmpeg and Whisper to record and transcribe live police scanner audio for my city, and update it in real-time to a live website. It works great, with the expected transcription errors and hallucinations.
[-]
- Xunjin 294 days ago
  Is this website open? Would love to see your work :P
  [-]
  - webinar 294 days ago
    somerville.votolab.com
    [-]
    - jaster 294 days ago
      All the "Thanks for watching!" gave me a good chuckle.
      Remind me of one of my own experiences with one of the Whisper model, where some random noise in the middle of the conversation was translated into "Don't forget to like and subscribe".
      Really illustrate where the training data is coming from.
    - mkayokay 294 days ago
      Looks like this is a nice case were the LLM thinks that silence is "thanks for watching" which was discussed on here a few days ago.
- waltbosz 294 days ago
  I wanted to do this for my local county council meetings. I think in this context speaker recognition would be important.
superkuh 293 days ago
"Making sure you're not a bot!" with no way to get to the actual document that is supposed to be at the URL. Anubis can be configured to be accessible for people without the latest computers by using the meta-refresh proof of work but very few people take any time to configure it and just deploy the defaults. Just like with cloudflare.
That said, I suppose I'm glad they're concentrating on making the ffmpeg code better rather than fixing bugs in the web interface for the development tracker. Having whisper integrated will be really useful. I'm already imagining automatic subtitle generation... imagining because I can't read the page or the code to know what it is.
manca 293 days ago
The only problem with this PR/diff is that it creates just a avfilter wrapper around whisper.cpp library and requires the user to manage the dependencies on their own. This is not helpful for novice users who will first need to:
1. git clone whisper.cpp
2. Make sure they have all dependencies for `that` library
3. Hope the build passes
4. Download the actual model
AND only then be able to use `-af "whisper=model...` filter.
If they try to use the filter without all the prereqs they'll fail and it'll create frustration.
It'd be better to natively create a Whisper avfilter and only require the user to download the model -- I feel like this would streamline the whole process and actually make people use it much more.
[-]
- slhck 293 days ago
  While that would be nicer from an end-user perspective, it's something hard to maintain for FFmpeg itself. Consider the velocity of the whisper-cpp project. I'm sure that – just like with filters such as vmaf, which also require building a dependency and downloading a model – precompiled versions will become available for novice users to directly download. Especially considering whisper-cpp is MIT-licensed.
instagraham 294 days ago
Does this mean that any software which uses ffmpeg can now add a transcription option? Audacity, Chrome, OBS etc
[-]
- ks2048 294 days ago
  If they want to support it out-of-the box, they'll still have to embed a model file (roughly 500 MB - 3GB, varying size and quality)
  [-]
  - einpoklum 294 days ago
    Can't you point ffmpeg to a model file using some preferences dialog?
boutell 294 days ago
Shut off the broken bot filter so we can read it please
[-]
- majewsky 294 days ago
  From experience, these bot filters are usually installed because the site would be down entirely without rejecting AI scrapers, so the argument to shut it off to improve usability is rather silly.
- superkuh 293 days ago
  They don't need to shut off Anubis, they just need to configure it beyond the defaults. If they turned on the meta-refresh based challenge then all browsers could access it while still keeping most of the bots away. But few people ever configure these things and just accept the broken defaults.
  With the current broken default config my browser can't even run the JS challenge due to it using unsupported bleeding edge JS features.
  [-]
  - xena 293 days ago
    Hi, can you please paste the error message you get? This should be using features that are supported widely as of 2022 and I regularly test on Firefox LTS.
    [-]
    - superkuh 291 days ago
      I don't get any errors in my JS browser console. It just tries to load the JS files and apparently they don't run.
      [12:03:18.699] GET https://code.ffmpeg.org/FFmpeg/FFmpeg/commit/13ce36fef98a3f4... [HTTP/1.1 200 OK 227ms] [12:03:19.483] GET https://code.ffmpeg.org/.within.website/x/xess/xess.min.css?... [HTTP/1.1 200 OK 611ms] [12:03:19.485] GET https://code.ffmpeg.org/.within.website/x/cmd/anubis/static/... [HTTP/1.1 200 OK 839ms] [12:03:19.486] GET https://code.ffmpeg.org/.within.website/x/cmd/anubis/static/... [HTTP/1.1 200 OK 391ms] [12:03:19.487] GET https://code.ffmpeg.org/.within.website/x/cmd/anubis/static/... [HTTP/1.1 200 OK 368ms]
      For clarification I am using a version of Firefox from about 2015. But this version of Firefox does work with the meta-refresh based Anubis option.
    - cpmsmith 293 days ago
      I'm just getting "invalid response." in a 500 response from the `anubis/api/pass-challenge` endpoint – weirdly, when I added breakpoints and stepped through the code myself, it worked, but if I load again, I get the error. Maybe there's a timing component? (Firefox stable)
- QuantumNomad_ 294 days ago
  Archived snapshots of the linked page:
  https://web.archive.org/web/20250813104007/https://code.ffmp...
  https://archive.is/dmj17
  You can read it on one of these without having to pass that specific bot check
- jeroenhd 294 days ago
  Check out commit 13ce36fef98a3f4e6d8360c24d6b8434cbb8869b from https://git.ffmpeg.org/ffmpeg.git if your web browser doesn't support Javascript. The linked page is just a git viewer for that specific commit.
  [-]
  - yorwba 294 days ago
    Or read the documentation for the new whisper filter: https://ffmpeg.org/ffmpeg-filters.html#whisper-1
    [-]
    - jeroenhd 294 days ago
      That also works, I assumed the ffmpeg website would also be behind Anubis if the git server is, but it doesn't actually seem to be.
      [-]
      - majewsky 294 days ago
        Anubis is not all that useful for static websites since serving them does not generate high load (unlike when a bot traverses a Git server UI).
- diggan 294 days ago
  Took my iPhone 12 Mini a whole of 0.1 seconds to pass it. What hardware/OS are you using?
  [-]
  - politelemon 294 days ago
    Took me zero seconds to be blocked with invalid response
    [-]
    - miloignis 294 days ago
      It also instantly blocks me on GrapheneOS, both Firefox and Vanadium. Very odd, as I've never had an issue with Anubis before.
      [-]
      - shaky-carrousel 294 days ago
        GrapheneOS here, with Vanadium in incognito, it doesn't block me, both in wifi and in mobile. Maybe it was a temporary hiccup.
        [-]
        miloignis 294 days ago
        Thanks for checking! Incognito blocks me too, no idea whats up. Maybe I'm getting tripped up by IP reputation or something (though I shouldn't, normal residential connection).
  - londons_explore 294 days ago
    Took about 30 secs for me (5 yr old intel cpu). Looked like there was a progress bar, but it didn't progress. Maybe the difficulty varies depending on IP address?
    [-]
    - jeroenhd 294 days ago
      Anubis has config for that: https://anubis.techaro.lol/docs/admin/policies#request-weigh...
      It's up to the site admin to configure it that way, but it's possible some IP ranges/user agents are more often used by bots and therefore have an increased weight.
      For old browsers there's also an option to use meta refresh instead of JS (https://anubis.techaro.lol/docs/admin/configuration/challeng...) but that's quite a recent addition and not enabled by default.
    - ta1243 294 days ago
      my i5-6200U with firefox/linux is about 10 years old. I used a variety of add blocking and fingerprint blocking techniques. Cloudflare often complains and blocks me.
      This page loaded pretty much instantly (certainly in the time it took to switch to the background tab I loaded in). But then ffmpeg is written by old school engineers with old school ways of working. Their social media accounts are a hilarity of trolling worthy of slashdot in its peak.
    - diggan 294 days ago
      > Maybe the difficulty varies depending on IP address?
      I'm currently roaming in Finland with a Spanish SIM so would have expected the opposite in that case.
  - blahyawnblah 294 days ago
    The stock chrome browser Google news uses
  - johnisgood 294 days ago
    Took me 8 seconds on my shitty desktop.
realxrobau 294 days ago
Annoyingly, something is broken with their anti not stuff, as it keeps refusing to let me see the page.
lawik 294 days ago
I wonder if they'll be satisfied there or add a chunk of others now that they've started. Parakeet is supposed to be good?
Should they add Voice Activity Detection? Are these separate filters or just making the whisper filter more fancy?
[-]
- shrx 294 days ago
  Voice Activity Detection support is already included.
- adi_kurian 293 days ago
  Parakeet is indeed really awesome.
zoobab 294 days ago
Not sure it will be packaged in Debian, with an external binary model god knows how it was produced...
[-]
- majewsky 294 days ago
  It looks like the model file needs to be supplied at invocation time, so the binary blob would not be required for packaging.
  [-]
  - zoobab 294 days ago
    so 'apt install ffmpeg' won't be enough to have the feature?
    [-]
    - SahAssar 294 days ago
      You'd have the feature, but you also need to supply the model. The feature seems to just be that ffmpeg has the ability to run the model, it does not include the model.
miladyincontrol 294 days ago
on an aside, my favorite whisper 'hack' is you can just speed up audio 10x to process it 10x faster, then adjust the timings after
[-]
- password4321 292 days ago
  This would be really cool to use to implement live transcripts
WanderPanda 294 days ago
Is Whisper still SOTA 3 years later? It does not seem there is a clearly better open model. Alec Radford really is a genius!
[-]
- generalizations 293 days ago
  Looks like there's a leaderboard: https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
- jiehong 293 days ago
  NVIDIA Nemo Parakeet for English. Mistral’s recent Voxtral is supposed to be nice and open source
- vitorgrs 293 days ago
  3 years later and Youtube CCs is still horrible lol
mkbkn 294 days ago
How can I run Whisper or this software in Linux or Android as a non-technical user?
Basically a simple audio-to-text for personal use?
[-]
- 3036e4 293 days ago
  I don't think installing (i.e. compiling) whisper.cpp and using it to do audio-to-text is very difficult. If the documentation is too technical I am sure you can ask some LLM to walk you through it. I have used it on Android in termux and on my FreeBSD desktop computer. Would not expect any difficulties on any modern Linux.
- password4321 291 days ago
  https://handy.computer "the free and open source app for speech to text" is available for Linux (and Windows/Mac).
bondarchuk 294 days ago
Can whisper do multilingual yet? Last time I tried it on some mixed dutch/english text it would spit out english translations for some of the dutch text. Strange bug/feature since from all appearances it had understood the dutch text perfectly fine.
[-]
- clarionbell 294 days ago
  I think the Dutch/English is probably the worst combination for this. Languages are rather close.
  [-]
  - bondarchuk 294 days ago
    I don't understand how this would happen, though. It's not like it will mishear a dutch sentence as if it's english; it will correctly pick up the dutch sentence, but (since the language is auto-detected as english at the start of the segment), seemingly auto-translate that (correct and correctly heard) dutch text to english. All we need is a way to get the dutch text that's surely somewhere in there, before the translation happens.
    Unless it was trained end-to-end on dutch-subtitled english text?? Which might make the translation a somewhat inextricable part of the model..? Does anyone know?
    [-]
    - busup 293 days ago
      Maybe try the turbo model which is transcription only. The other models were trained on x to en translations and they seem to emphasise the output language over the task token. You can get them to translate to any language even though it was never trained for that, comparatively nl-en translation is in the dataset so I'm not surprised it's doing that.
      [-]
      - bondarchuk 293 days ago
        Hey, good tip, thanks a lot!
- numpad0 294 days ago
  Isn't that a bit much for ASR models? Humans can't handle simultaneous multilingual dictation task either, I have to stop and reinitialize ears before switching languages between English and my primary one.
  [-]
  - abdullahkhalids 293 days ago
    In South Asia, it's quite common for people to speak a combination of their local language and English. Not just alternating sentences between the two languages, but in fact, constructing sentences using compound phrases from the two languages.
    "Madam, please believe me, maine homework kiya ha" [I did my homework].
    [-]
    - okwhateverdude 293 days ago
      This is common in the southwestern part of the US too. My partner and her friends she grew up with will have conversations that fluidly pick phrases and vocab from either Spanish or English depending on what words happen to be the easiest to pull from their brain. It's wild to listen to.
      [-]
      - numpad0 292 days ago
        Aren't those limited to specific words or phrases in specific forms? I doubt it works for arbitrary half-sentences.
  - bondarchuk 294 days ago
    Seems like it already has the capability somewhere in the model though - see my reply to clarionbell.
  - cenamus 294 days ago
    Isn't that exactly what intepreters do?
    [-]
    - numpad0 294 days ago
      If they're like what I am, they seem to just coordinate constant staggered resets for sub-systems of language processing pipeline while keeping internal representations of inputs in half-text state so that input come back out through the pipeline in the other configurations.
      That's how I anecdotally feel and interpret how my own brain appear to work, so it could be different from how interpreters work or how actual human brains work, but as far as I see it, professional simultaneous interpreters don't seem to be agnostic for relevant pairs of languages at all.
- jeroenhd 294 days ago
  I found that it works quite well for Dutch+English as long as you use one of the larger models. But that may just be luck, I imagine mixing Italian and Swedish will have very different results.
- kwar13 294 days ago
  Best for English, but I've found it pretty decent for Spanish.
  [-]
  - MaKey 294 days ago
    It's even better for some languages other than English (e. g. Spanish), see: https://github.com/openai/whisper?tab=readme-ov-file#availab...
- guilamu 294 days ago
  Whisper has been multilingual for 5 years at least.
  [-]
  - bondarchuk 294 days ago
    I know it is ostensibly multilingual, it's less than a year since I tried, but it does this thing where it then translates everything (or only some things) into a single language regardless with no way to turn it off.
    [-]
    - guilamu 294 days ago
      Sorry, I've been using it for French audio files since 5 years and never had this issues.
  - woodson 294 days ago
    Except it’s only been released in September 2022 (not even 3 years ago).
- ph4evers 294 days ago
  Whisper-v3 works well for multi-lingual. I tried it with Dutch, German and English
zzsshh 294 days ago
Does this finally enable dynamically generating subtitles for movies with AI?
[-]
- jeroenhd 294 days ago
  Docs say:
```
    If set, the transcription output will be sent to the specified file or URL
    (use one of the FFmpeg AVIO protocols); otherwise, the output will be logged as info messages.
    The output will also be set in the "lavfi.whisper.text" frame metadata.
    If the destination is a file and it already exists, it will be overwritten.

    @item format
    The destination format string; it could be "text" (only the transcribed text will be sent to the destination), "srt" (subtitle format) or "json".
    Default value: @code{"text"}
```
  I don't know if this can embed the subtitles, but it does support generating accompanying srt files.
  Of course, you could already do that by just manually calling whisper on files, but now you don't need to export parts or transformed media files to feed into whisper.
- regularfry 294 days ago
  If you have enough processing power. Without a GPU it's going to lag.
  [-]
  - jeroenhd 293 days ago
    In my experience, a small/tiny whisper model has pretty okay English decoding speed on something relatively modern even without GPU support. There's a bunch of latency in the process (because of technological limitations) but the optimised C++ version shouldn't pose too much of a problem unless you're running in power saving mode. Battery life may be a problem on older laptops, though.
  - KeplerBoy 294 days ago
    Whisper is pretty fast.
- diggan 294 days ago
  Finally? I think VLC demo'd this a while ago at some conference where they had a table, if I remember correctly.
  [-]
  - SSLy 294 days ago
    VLC and ffmpeg are unrelated projects
    [-]
    - demurgos 294 days ago
      I'm not very familiar with them, but I always assumed that there is a lot of overlap between the maintainers of both projects.
      [-]
      - SSLy 294 days ago
        Well, they are just unrelated. VLC has a plugin to access ffmpeg codecs via libav*, that's about it.
        [-]
        guipsp 293 days ago
        They are not completly unrelated. There is significant overlap. FFMPEG also uses libs from VLC.
  - mmmpetrichor 294 days ago
    I've been waiting a while now for automatic translated subtitles in vlc. I thought it would be here by now. I'm probably underestimating the difficulty but I'm surprised some video player hasn't done it by now. (as far as I know).
    [-]
    - jeroenhd 293 days ago
      A lot of subtitles from commercial media use a subtitle format that's essentially a bitmap that the video player overlays on top of the video. There are tools to decode this using OCR, but it's not something I'd enable by default.
      For text/srt subtitles, translation would probably be easier. There's a plugin for that already if you're okay with online translation services: https://github.com/nopium/vlc-trans-lua
re 294 days ago
I've been playing with whisper to try to do local transcription of long videos, but one issue I've found is that long (>15 seconds) spans without any speech tend to send it into a hallucination loops that it often can't recover from. I wonder if, with direct integration into ffmpeg, they will be able to configure it in a way that can improve that situation.
[-]
- franga2000 294 days ago
  Whisper is supposed to be used with voice activity detection and all production implementations that I've seen do that. The raw model is known to make up nonsense for silence because, as I understand it, it was never trained not to do that, assuming everyone will use VAD
  [-]
- BoredPositron 294 days ago
  You usually delete silence before using something like whisper.
  [-]
  - re 294 days ago
    I've heard that, but that doesn't sound like a useful approach for videos where (1) non-speech segments can have plenty of other sound (music, noise) and (2) you want timestamps to match up with the original video, like for subtitles. But maybe there are known mitigations for both of those issues that I'm not aware of. And if they do exist maybe they can be included in the ffmpeg whisper integration.
    [-]
    - miki123211 294 days ago
      By "delete", people mostly mean "detect", so that you can avoid processing such segments through Whisper. There's no reason to actually cut the silence out from the original audio file.
      [-]
  - hnlmorg 294 days ago
    This is designed for real time use too. And in such cases, you couldn’t delete the silence before use.
    [-]
    - BoredPositron 294 days ago
      The ffmpeg implementation might be the example was not.
yewenjie 294 days ago
I have recently found that parakeet from NVIDIA is way faster and pretty much as correct as Whisper, but it only works with English.
jd3 293 days ago
took me longer than i'd care to admit to figure out how to install whisper as a user/system package on macOS w/o brew (which pulls in all of llvm@16 during install)
```
    brew install uv
    uv tool install openai-whisper
    then add ~/.local/bin/ to $PATH
```
MaxikCZ 294 days ago
I tried to use whisper to generate non-english subs from english audio, but wasnt able to figure out. I know it can do english subs from non-english audio, and that earlier (less precise) versions could do any language audio -> any language subs, but latest whisper only to english subs.
Anyone found a way?
[-]
- abdusco 294 days ago
  I solved it by generating English subtitles, then passing those to an LLM in chunks that are ~20 entries in size. Include preceding and following subtitles as context for better translation. Make sure to replace the timestamps with simple integer ids, because LLMs like to mangle those, no matter how hard you prompt.
  I could share a python script that is working pretty reliably for me.
  [-]
  - vevoe 294 days ago
    I'd love to see that script, do you have a link?
    [-]
    - abdusco 294 days ago
      https://gist.github.com/abdusco/5bd5c909547f5f9b935dbd2fb2fe...
radiator 293 days ago
May I ask, if there is a movie where English people speak English, French people speak French, and German people speak German, is there a software that can generate subtitles in English, French and German without translating anything? I mean, just record what it hears.
[-]
- blarg-and-co 293 days ago
  [dead]
cheerioty 293 days ago
OH: "New changelog entries go to the bottom, @vpalmisano .. Didn't I tell you this once?"
kwar13 294 days ago
Fantastic! I am working on a speech-to-text GNOME extension that would immensely benefit from this.
https://github.com/kavehtehrani/gnome-speech2text
[-]
- dotancohen 294 days ago
  Why is this a Gnome extension? I would love to use this in KDE.
  [-]
  - guipsp 293 days ago
    Likely because they are a GNOME user and the APIs are DE specific.
  - kwar13 293 days ago
    I use Ubuntu 24.04 and comes with GNOME Shell.
shmerl 294 days ago
Did ffmpeg move their bug tracker to Forgejo?
https://code.ffmpeg.org/FFmpeg/FFmpeg/issues
I still see their old one too, but Forgejo one is nice.
dncornholio 294 days ago
I was expecting a lot more comments on if this is a necessary feature or if this even belongs in a library like ffmpeg. I think this is bloat, especially when the feature doesn't work flawless, whisper is very limited.
[-]
- MrGilbert 294 days ago
  The only item that was discussed was that the subtitle workflow does not seem to be that good, afaict:
  https://code.ffmpeg.org/FFmpeg/FFmpeg/pulls/20022#issuecomme...
- baxter001 293 days ago
  You'd be surprised what's in there, a few forms of NNs are already supported for denoising, speech detection.
  I think having this flow out to all of the deps of libav is a greater good than notions of lib purity.
atum47 293 days ago
It failed to identify me as a human twice before let me access the page
iambvk 294 days ago
Is anyone able to get streaming audio to text conversion working with whisper.cpp?
I tried several times to get this into a reasonable shape, but all have been failures. If anyone has pointers I really appreciate it.
varenc 293 days ago
Anyone got this to compile on macOS yet? The homebrew binary doesn't yet (and probably won't ever) include the --enable-whisper compile option.
almaight 293 days ago
"multi-modal feature extraction → semantic translation → cross-modal feature transfer → precise temporal alignment," is all we need
mockingloris 294 days ago
How could one in theory, use this to train on a new language? Say for a hubby project; I have recordings of some old folks stories in my local dialect.
│
└── Dey well; Be well
[-]
- notpublic 294 days ago
  https://huggingface.co/blog/fine-tune-whisper
baxter001 293 days ago
More precisely libavfilter, so it's also soon in mpv and other dependent players.
This is going to be great for real-time audio translation.
martzoukos 294 days ago
I guess that there is no streaming option for sending generated tokens to, say, an LLM service to process the text in real-time.
[-]
- nomad_horse 294 days ago
  Whisper has the encoder-decoder architecture, so it's hard to run streaming efficiently, though whisper-streaming is a thing.
  https://kyutai.org/next/stt is natively streaming STT.
  [-]
  - woodson 294 days ago
    There are many streaming ASR models based on CTC or RNNT. Look for example at sherpa (https://github.com/k2-fsa/sherpa-onnx), which can run streaming ASR, VAD, diarization, and many more.
maxlin 291 days ago
as someone who has a live application using whisper and ffmpeg, this does seem like just feature creep. ffmpeg and whisper both are otherwise well limited CLI tools adhering to the unix philosophy, this ... idk
XCSme 293 days ago
Unrelated, but can I use Whisper in DaVinci resolve to automatically transcribe my videos and add subs?
[-]
- cadamsdotcom 293 days ago
  Unrelated, but why isn’t Europe a country already. It’s been ages!
igorguerrero 293 days ago
Aww, I literally just implemented this using whisper.cpp and ffmpeg lib, code is even similar...
yieldcrv 294 days ago
Labeling multiple people talking is something i found lacking with whisper, is it better now?
dotancohen 294 days ago
Why would one use FFmpeg with Whisper support, instead of using Whisper directly?
[-]
- 3036e4 293 days ago
  At least whisper.cpp only supports a few input formats like WAV and MP3. To get subtitles for videos I always have to first run ffmpeg to get an audio file and then run whisper.cpp. Guess this new feature may mean that I can do it in just one step, so slightly more convenient?
  [-]
  - dotancohen 293 days ago
    I see, thanks. I actually do almost all my Whisper work with ogg files, and got into a snag recently with m4a files. Transcoding to an equivalent size ogg or mp3 killed the quality, and wav is too big. Maybe FFmpeg could be of service here.
- lbrito 294 days ago
  I run a service that does transcriptions as part of the pipeline, and I use ffmpeg for other parts (such as speeding up audio). Having it all on a single command might make sense for some people if the costs work out.
  [-]
  - dotancohen 294 days ago
    Terrific, thank you.
BiteCode_dev 293 days ago
What's the benefit VS using whisper as a separate tool?
de6u99er 294 days ago
That's great. How does Whisper compare to Google Gemini's transcription capabilities?

Can't view site. Some sort of misconfigured CAPTCHA bullshit.

    Oh noes!
    Sad Anubis
    invalid response.

    Go home

    Protected by Anubis From Techaro. Made with  in .

    Mascot design by CELPHASE.

thedangler 294 days ago
Does this whisper also do text-to-speech?
[-]
- dotancohen 294 days ago
  No
pmarreck 294 days ago
Now if it only did separate speaker identification (diarization)
[-]
- harryf 293 days ago
  It’s fairly easy to get diarizarion working with pyannote.audio and https://huggingface.co/pyannote/speaker-diarization-3.1 with ffmpeg converting the audio first to 16kHz mono WAV file but it really depends a lot on the audio - two person podcast where the speakers allow each other space works but lots of people with overlapping voices on the audio - not so great
correa_brian 294 days ago
hell yeah
hacker_88 294 days ago
[dead]
bestspharma 292 days ago
[dead]
bestspharma 292 days ago
[dead]
ggap 294 days ago
Very interesting to see this!
[-]
- sumaliqing 293 days ago
  [dead]