12 comments

  • racecar789 17 hours ago
    Imagine being able to tell an app to call the IRS during the day, endure the on-hold wait times, then ask the question to the IRS rep and log the answer. Then deliver the answer when you get home.

    Or, have the app call a pharmacy every month to refill prescriptions. For some drugs, the pharmacy requires a manual phone call to refill which gets very annoying.

    So many use cases for this.

    • TZubiri 17 hours ago
      As costs of humanlike communications decrease, so will Sybil attacks and spam.

      The IRS is notorious for resistance to tech change, don't be surprised if they unplug the phones and force you to walk in to ask your question.

      What is the value add here? Save sometime for technocrats and technoadjacents for a whole of 3 years before victims of spam adapt?

      Also this has been solved already just mail your question like the rest of mortals.

      • ensignavenger 17 hours ago
        It would be really nice if the IRS would ALLOW you to walk in and ask a question!
  • throw14082020 19 hours ago
    This is really helpful, thanks!

    OpenAI hired the ex fractional CTO of LiveKit, who created Pion, a popular WebRTC library/tool.

    I'd expect OpenAI to migrate off of LiveKit within 6 months. LiveKit is too expensive. Also, WebRTC is hard, and OpenAI now being a less open company will want to keep improvements to itself.

    Not affiliated with any competitors, but I did work at a PaaS company similar to LiveKit but used Websockets instead.

  • pj_mukh 1 day ago
    Super cool! Didn't realize OpenAI is just using LiveKit.

    Does the pricing breakdown to be the same as having a OpenAI Advanced Voice socket open the whole time? It's like $9/hr!

    It would be theoretically cheaper to use this without keeping the advanced voice socket open the whole time and just use the GPT4o streaming service [1] for whenever inference is needed (pay per token) and use livekits other components to do the rest (TTS, VAD etc.).

    What's the trade off here?

    [1]: https://platform.openai.com/docs/api-reference/streaming

    • davidz 1 day ago
      Currently it does: all audio is sent to the model.

      However, we are working on turn detection within the framework, so you won't have to send silence to the model when the user isn't talking. It's a fairly straight forward path to cutting down the cost by ~50%.

      • rukuu001 1 day ago
        Working on this for an internal tool - detecting no speech has been a PITA so far. Interested to see how you go with this.
    • npace12 1 day ago
      You dont get charged per hour with the openai realtime api, only for tokens from detected speech and response
  • solarkraft 1 day ago
    That’s some crazy marketing for a „our library happened to support this relatively simple use case“ situation. Impressive!

    By the way: The cerebras voice demo also uses LiveKit for this: https://cerebras.vercel.app/

    • russ 1 day ago
      There’s a ton of complexity under the “relatively simple use case” when you get to a global, 200M+ user scale.
  • 0x1ceb00da 18 hours ago
    This suggests that the AI "brain" receives the user input as text prompt (agent relays the speech prompt to GPT-4o) and generates audio as output (GPT-4o streams speech packets back to the agent).

    But when I asked advanced voice mode it said the exact opposite. That it receives input as audio and generates text as output.

    • mbrock 17 hours ago
      Both input and output are audio. This post is about bridging WebRTC audio I/O with an API that itself operates on simple TCP socket streams of raw PCM. For reliability and efficiency you want end users to connect with compressed loss-tolerant Zoom-style streams, and that goes through a middleman which relays to the model API.
    • meiraleal 18 hours ago
      Who did you ask? ChatGPT? Not sure if you understand LLMs but its knowledge is based on the training data, it can't reason about itself, it can only hallucinate in this case, sometimes correctly, most times incorrectly.
      • hshshshsvsv 18 hours ago
        This is also true for petty much all humans and bypassing this limitation is called enlightenment/self realization.

        LLMs don't even have a self so it can never be realized. Just the ego alone exists.

        • TZubiri 17 hours ago
          No, humans can self inspect just fine
          • mbrock 16 hours ago
            A lot of psychologists would quibble with that...
          • tempaccount420 17 hours ago
            How do you know that?
  • spuz 21 hours ago
    Is there anyone besides OpenAI working on a speech to speech model? I find it incredibly useful and it's the sole reason that I pay for their service but I do find it very limited. I'd be interested to know if any other groups are doing research on voice models.
    • Ey7NFZ3P0nzAe 21 hours ago
      Yes. Kyutai released an opened model called moshi : https://github.com/kyutai-labs/moshi

      There's also llama-omni and a few others. None of them are even close to 4o from an LLM standpoint. But moshi is called a "foundational" model and U'm hopeful it will be enhanced. Also there's not yet support for those on most backends like llamacpp / ollama etc. So I'd say we're in a trough but we'll get there.

    • 0x1ceb00da 18 hours ago
      When I asked advanced voice mode it said that it receives input as audio and generates text as output.
      • mbrock 17 hours ago
        It is mistaken because it has no particular insight into its own implementation. In fact the whole point is that it directly consumes and produces audio tokens with no text. That's why it's able to sing, make noises, do accents, and so on.
  • FanaHOVA 1 day ago
    Olivier, Michelle, and Romain gave you guys a shoutout like 3 times in our DevDay recap podcast if you need more testimonial quotes :) https://www.latent.space/p/devday-2024
    • russ 1 day ago
      I had no idea! <3 Thank you for sharing this, made my weekend.
    • shayps 1 day ago
      You guys are honestly the best
  • mycall 1 day ago
    I wonder when Azure OpenAI will get this.
    • davidz 1 day ago
      I'm working on a PR now :)
  • gastonmorixe 1 day ago
    Nice they have many partners on this. I see Azure as well.

    There is a common consensus that the new Realtime API is not actually using the same Advanced Voice model / engine - or however it works - since at least the TTS part doesn’t seem to be as capable as the one shipped with the official OpenAI app.

    Any idea on this?

    Source: https://github.com/openai/openai-realtime-api-beta/issues/2

    • russ 1 day ago
      It's using the same model/engine. I don't have knowledge of the internals, but a different subsystem/set of dedicated resources though for API traffic versus first-party apps.

      One thing to note is there is no separate TTS-phase here, it's happening internally within GPT-4o, in the Realtime API and Advanced Voice.

  • lolpanda 23 hours ago
    so the WebRTC helps with the unreliable network between the mobile clients and the server side. if the application is backend only, would it make sense to use WebRTC or should I go directly to realtime api?
  • willsmith72 1 day ago
    That was cool, but got up to $1 usage real quick
    • russ 1 day ago
      We had our playground (https://playground.livekit.io) up for a few days using our key. Def racked up a $$$$ bill!
      • wordpad25 1 day ago
        How much is it per minute of talking?
        • russ 1 day ago
          50% human speaking at $0.06/minute of tokens

          50% AI speaking at $0.24/minute of tokens

          we (LiveKit Cloud) charge ~$0.0005/minute for each participant (in this case there would be 2)

          So blended is $0.151/minute

        • shayps 1 day ago
          It shakes out to around $0.15 per minute for an average conversation. If history is a guide though, this will get a lot cheaper pretty quickly.
          • cdolan 1 day ago
            This is cheaper than old cellular calls, inflation adjusted