A2UI: A Protocol for Agent-Driven Interfaces

(a2ui.org)

132 points | by makeramen 9 hours ago

28 comments

  • awei 3 hours ago
    I see how useful a universal UI language working across platforms is, but when I look at some examples from this protocol, I have the feeling it will eventually converge to what we already have, html. Instead of making all platforms support this new universal markup language, why not make them support html, which some already do, and which llms are already trained on.

    Some examples from the documentation: { "id": "settings-tabs", "component": { "Tabs": { "tabItems": [ {"title": {"literalString": "General"}, "child": "general-settings"}, {"title": {"literalString": "Privacy"}, "child": "privacy-settings"}, {"title": {"literalString": "Advanced"}, "child": "advanced-settings"} ] } } }

    { "id": "email-input", "component": { "TextField": { "label": {"literalString": "Email Address"}, "text": {"path": "/user/email"}, "textFieldType": "shortText" } } }

    • epec254 3 hours ago
      A key challenge with HTML is client side trust. How do I enable an agent platform (say Gemini, Claude, OpenAI) to render UI from an untrusted 3p agent that’s integrated with the platform? This is a common scenario in the enterprise version of these apps - eg I want to use the agent from (insert saas vendor) alongside my company’s home grown agents and data.

      Most HTML is actually HTML+CSS+JS - IMO, accepting this is a code injection attack waiting to happen. By abstracting to JSON, a client can safely render UI without this concern.

      • lunar_mycroft 2 hours ago
        If the JSON protocol in question supports arbitrary behaviors and styles, then you still have an injection problem even over JSON. If it doesn't support them you don't need to support those in an HTML protocol either, and you can solve the injection problem the way we already do: sanitizing the HTML to remove all/some (depending on your specific requirements) script tags, event listeners, etc.
      • epicurean 2 hours ago
        Perhaps the protocol, is then html/css/js in a strict sandbox. Component has no access to anything outside of component bounds (no network, no dom/object access, no draw access, etc).
        • awei 1 hour ago
          I think you can do that with an iframe, but it always makes me nervous
      • awei 3 hours ago
        Right this makes sense, I wonder if it would then be a good idea to abstract html to JSON, making it impossible to include css and js into it
        • epec254 3 hours ago
          Curious to learn more what you are thinking?

          One challenge is you do likely want JS to process/capture the data - for example, taking the data from a form and turning it into json to send back to the agent

        • oooyay 2 hours ago
          If you play with A2UIs generator that's effectively what it does, just layer of abstraction or two above what you're describing.
          • awei 2 hours ago
            That's what I thought too skimming through the documentation, my thinking is that since it does that, which makes sense to avoid script injection, why not do it with "jsonized" html.
  • codethief 8 hours ago
    > A2UI lets agents send declarative component descriptions that clients render using their own native widgets. It's like having agents speak a universal UI language.

    (emphasis mine)

    Sounds like agents are suddenly able to do what developers have failed at for decades: Writing platform-independent UIs. Maybe this works for simple use cases but beyond that I'm skeptical.

    • observationist 13 minutes ago
      Nope, it's just a repackaging of the same problem, except in this case, the problem is solved with APIs and CLI and not jumping through hoops in order to get the AI to do what humans do.

      It's about accomplishing a task, not making a bot accomplish a task using the same tools and embodiment context as a human - there's no upside, unless the bot is actually using a humanoid embodiment, and even then, using a CLI and service API is going to be preferable to doing things with UI in nearly every possible case, except where you want to limit to human-ish capabilities, like with gaming, or you want to deceive any monitors into thinking that a human is operating.

      It's going to be infinitely easier to wrap a json get/push wrapper around existing APIs or automation interfaces than to universalize some sort of GUI interactions, because LLM's don't have the realtime memory you need to adapt to all the edge cases on the fly. It's incredibly difficult for humans, and hundreds of billions of dollars have been spent trying to make software universally accessible and dumbed down for users, and still ends up being either stupidly limited, or fractally complex in the tail, and no developer can ever account for all the possible ways in which users interact with a feature for any moderately complex piece of software.

      Just use existing automation patterns. This is one case where if an AI picks up this capability alongside other advances, then awesome, but any sort of middleware is going to be a huge hack that immediately gets obsoleted by frontier models as a matter of course.

    • giancarlostoro 44 minutes ago
      I've thought about how to write a platform independent UI framework that doesn't care what language you write it in, and every time I find myself reinventing X.org or at least my gut tells me I'm just reinventing a cross-platform X server implementation.
    • rockwotj 8 hours ago
      this isn’t the right way to look at it. It’s really server side rendering where the LLM is doing the markup language generation instead of a template. The custom UI is usually higher level. Airbnb has been doing this for years: https://medium.com/airbnb-engineering/a-deep-dive-into-airbn...
    • hurturue 3 hours ago
      platform independent UIs exist - HTML and Electron
      • kridsdale3 1 hour ago
        Sure. HTML is a Markup-Language (it's in the acronym). Markdown is also a Markup Language. LLMs are super good at Markdown and just about every chatbot frontend now has a renderer built in.

        A2UI is a superset, expanding in to more element types. If we're going to have the origin of all our data streams be string-output-generators, this seems like an ok way to go.

        I've joined an effort inside Google to work in this exact space, though what we're doing has no plan to become open source, other groups are working on stuff like A2UI and we collaborate with them.

        My career previous to this was nearly 20 years of native platform UI programming and things like Flutter, React Native, etc have always really annoyed me. But I've come around this year to accept that as long as LLMs on servers are going to be where the applications of the future live, we need a client-OS agnostic framework like this.

    • mentalgear 6 hours ago
      It still needs language-specific libraries [1] (and no sveltekit even announced yet :( ).

      [1] https://a2ui.org/renderers/

      • ddrdrck_ 4 hours ago
        Well it is open source and they expect the community to add more renderers. So if you are a sveltekit specialist this could actually be an opportunity.
        • epec254 3 hours ago
          Plus 1! We’d love community contributions here!
  • verdverm 2 minutes ago
    Am I reading (7) of the data flow correctly?

    1. Establish SSE connection

    ... user event

    7. send updates over origin SSE connection

    So the client is required to maintain an SSE capable connection for the entire chat session? What if my network drops or I switch to another agent?

    Seems an onerous requirement to maintain a connection for the life-time of a session, which can span days (as some people have told us they have done with agents)

  • mbossie 8 hours ago
    So there's MCP-UI, OpenAI's ChatKit widgets and now Google's A2UI, that I know of. And probably some more...

    How many more variants are we introducing to solve the same problem. Sounds like a lot of wasted manhours to me.

    • MrOrelliOReilly 8 hours ago
      I agree that it's annoying to have competing standards, but when dealing with a lot of unknowns it's better to allow divergence and exploration. It's a worse use of time to quibble over the best way to do things when we have no meaningful data yet to justify any decision. Companies need freedom to experiment on the best approach for all these new AI use cases. We'll then learn what is great/terrible in each approach. Over time, we should expect and encourage consolidation around a single set of standards.
      • pscanf 8 hours ago
        > when dealing with a lot of unknowns it's better to allow divergence and exploration

        I completely agree, though I'm personally sitting out all of these protocols/frameworks/libraries. In 6 months time half of them will have been abandoned, and the other half will have morphed into something very different and incompatible.

        For the time being, I just build things from scratch, which–as others have noted¹–is actually not that difficult, gives you understanding of what goes on under the hood, and doesn't tie you to someone else's innovation pace (whether it's higher or lower).

        ¹ https://fly.io/blog/everyone-write-an-agent/

        • kridsdale3 1 hour ago
          I recently heard that when automobiles were new the USA quickly ended up in a state with 80 competing manufacturing brands. In a couple decades, the market figured out what customers actually want and what styles and features mattered, and the competition ecosystem consolidated to 5 brands.

          The same happened with GPUs in the 90s. When Jensen formed Nvidia there were 70 other companies selling Graphics Cards that you could put in a PCI slot. Now there are 2.

    • shireboy 4 hours ago
    • mystifyingpoi 6 hours ago
      > Sounds like a lot of wasted manhours to me

      Sounds like a lot of people got paid because of it. That's a win for them. It wasn't their decision, it was company decision to take part in the race. Most likely there will be more than 1 winner anyway.

      • kridsdale3 1 hour ago
        I'm one of these people. We have to start working on the problem many months before the competition announces that they exist. So we are all just doing parallel evolution here. Everyone agrees that to sit and wait for a standard means you wouldn't waste energy, but you'd also have no influence.

        Like you mentioned, its a good time to be employed.

    • hobofan 5 hours ago
      MCP-UI and OpenAI Apps are converging into the MCP Apps extension specification: https://blog.modelcontextprotocol.io/posts/2025-11-21-mcp-ap...
    • p_v_doom 5 hours ago
      We should make one new standard for everyone to use ...
    • askl 7 hours ago
  • pedrozieg 7 hours ago
    We’ve had variations of “JSON describes the screen, clients render it” for years; the hard parts weren’t the wire format, they were versioning components, debugging state when something breaks on a specific client, and not painting yourself into a corner with a too-clever layout DSL.

    The genuinely interesting bit here is the security boundary: agents can only speak in terms of a vetted component catalog, and the client owns execution. If you get that right, you can swap the agent for a rules engine or a human operator and keep the same protocol. My guess is the spec that wins won’t be the one with the coolest demos, but the one boring enough that a product team can live with it for 5-10 years.

  • wongarsu 7 hours ago
    I wouldn't want this anywhere near production, but for rapid prototyping this seems great. People famously can't articulate what they want until they get to play around with it. This lets you skip right to the part where you realize they want something completely different from what was first described without having to build the first iteration by hand
    • turnsout 2 hours ago
      Honestly the point of this is not to help app developers—it's to replace the need for apps altogether.

      The vision here is that you can chat with Gemini, and it can generate an app on the fly to solve your problem. For the visualized landscaping app, it could just connect to landscapers via their Google Business Profile.

      As an app developer, I'm actually not even against this. The amount of human effort that goes into creating and maintaining thousands of duplicative apps is wasteful.

      • verdverm 12 minutes ago
        This sounds like they creators think that even more duplicative apps that no one knows how it works or what the code even looks like... is a better idea?

        How many times are users going to spin GPUs to create the same app?

  • jadelcastillo 1 hour ago
    I think this is a good and pragmatic way to approach the use of LLM systems. By translating to an intermediate language, and then processing further symbolically. But probably you can be prompt injected also if you expose sensible "tools" to the LLM.
  • alexgotoi 2 hours ago
    So we're reinventing SOAP but for AI agents. Not saying that's bad - sometimes you need to remake old mistakes before you figure out what actually works.

    The real question: do UIs even make sense for agents? Like the whole point of a UI is to expose functionality to humans with constraints (screens, mice, attention). Agents don't have those constraints. They can read JSON, call APIs directly, parse docs. Why are we building them middleware to click buttons?

    I think this makes sense as a transition layer while we figure out what agent-native architecture looks like. But long-term it's probably training wheels.

    Will include this in my https://hackernewsai.com/ newsletter.

    • kridsdale3 1 hour ago
      The need here is at some point an agent has to produce an output that is consumed by a human with eyes. A pixel grid on a screen is far more high bandwidth to send information to a human than a linear string of text.
  • tasoeur 8 hours ago
    In an ideal world, people would be implementing UI/UX accessibility in the first place, and a lot of those problems would be solved in the first place. But one can also hope that having the motivation to get agents running on those things could actually bring a lot of accessibility features to newer apps.
  • uptownhr 3 hours ago
    My approach/prototype using XState with websockets from an MCP server https://github.com/uptownhr/mcp-agentic-ui
  • oddrationale 3 hours ago
    Seems similar to [Adaptive Cards](https://adaptivecards.io/). Both have a JSON-based UI builder system.
  • ceuk 5 hours ago
    A few days ago I was predicting to some colleagues a revival of ideas around "server-driven UI" (which never really seemed to catch on) in order to facilitate agentic UIs.

    Feels good to have been on the money, but I'm also glad I didn't start a project only to be harpooned by Google straight away

    • kridsdale3 1 hour ago
      Server Driven UI has absolutely caught on. Not including all the Electron apps out there, things like Instagram's native mobile apps have about half of their screens being SDUI at this point because Meta needs to be able to change them instantly, not with a 3 week release cycle.
  • qsort 8 hours ago
    This is very interesting if used judiciously, I can see many use cases where I'd want interfaces to be drawn dynamically (e.g. charts for business intelligence.)

    What scares me is that even without arbitrary code generation, there's the potential for hallucinations and prompt injection to hit hard if a solution like this isn't sandboxed properly. An automatically generated "confirm purchase" button like in the shown example is... probably something I'd not make entirely unsupervised just yet.

  • barbazoo 3 hours ago
    This sounds like a way to have the LLM client render dynamic UI. Is this for use during the chat session or yet another way to build actual applications?
    • epec254 3 hours ago
      Google PM here. Right now, it’s designed for rendering UI widgets inline with a chat conversation - it’s an extension to a2a that lets you stream JSON defining UI components in addition to chat messages.
      • kridsdale3 1 hour ago
        Google SWE working in this space here. Look up my username (minus the digit) on Moma, let's talk. I can't ID you from your HN handle.
  • iristenteije 6 hours ago
    I think ultimately GenUI can be integrated into apps more seamlessly, but even if today it's more in context of chat interfaces with prompts, I think it's clear that a wall of text isn't always the best UX/output and it's already a win.
  • jy14898 8 hours ago
    I never want to unknowingly use an app that's driven this way.

    However, I'm happy it's happening because you don't need an LLM to use the protocol.

  • zwarag 3 hours ago
    Could this be the link that allows designers to design a UI in Figma and let an agent build it via A2UI?
  • _pdp_ 7 hours ago
    I am fan of using markdown to describe the UI.

    It is simple, effective and feels more native to me than some rigid data structure designed for very specific use-cases that may not fit well into your own problem.

    Honestly, we should think of Emacs when working with LLMs and kind of try to apply the same philosophy. I am not a fan of Emacs per-se but the parallels are there. Everything is a file and everything is a text in a buffer. The text can be rendered in various ways depending on the consumer.

    This is also the philosophy that we use in our own product and it works remarkably well for diverse set of customers. I have not encountered anything that cannot be modelled in this way. It is simple, effective and it allows for a great degree of flexibility when things are not going as well as planned. It works well with streaming too (streaming parsers are not so difficult to do with simple text structures and we have been doing this for ages) and LLMs are trained very well how to produce this type of output - vs anything custom that has not been seen or adopted yet by anyone.

    Besides, given that LLMs are getting good at coding and the browser can render iframes in seamless mode, a better and more flexible approach would be to use HTML, CSS and JavaScript instead of what Slack has been doing for ages with their block kit API which we know is very rigid and frustrating to work with. I get why you might want to have a data structures for UI in order to cover CLI tools as well but at the end of the day browsers and clis are completely different things and I don not believe you can meaningfully make it work for both of them unless you are also prepared to dumb it down and target only the lowest common dominator.

  • ChrisArchitect 2 hours ago
  • evalstate 8 hours ago
    I quite like the look of this one - seems to fit somewhere between the rigid structure of MCP Elicitations and the freeform nature of MCP-UI/Skybridge.
  • raybb 8 hours ago
    Is there a standard protocol for the way things like Cline sometimes give you multiple choice buttons to click on? Or how does that compare to something like this?
  • mentalgear 6 hours ago
    The way to do this would be to come together and design a common W3C-like standard.
  • lowsong 7 hours ago
    > A2UI lets agents send declarative component descriptions that clients render using their own native widgets. It's like having agents speak a universal UI language.

    Why the hell would anyone want this? Why on earth would you trust an LLM to output a UI? You're just asking for security bugs, UI impersonation attacks, terrible usability, and more. This is a nightmare.

    • vidarh 7 hours ago
      If done in chat, it's just an alternative to talking to you freeform. Consider Claude Code's multiple-choice questions, which you can trigger by asking it to invoke the right tool, for example.
      • DannyBee 6 hours ago
        None of the issues go away just because it's in chat?

        Freeform looks and acts like text, except for a set of things that someone vetted and made work.

        If the interactive diagram or UI you click on now owns you, it doesn't matter if it was inside the chat window or outside the chat window.

        Now, in this case, it's not arbitrary UI, but if you believe that the parsing/validation/rendering/two way data binding/incremental composition (the spec requires that you be able to build up UI incrementally) of these components: https://a2ui.org/specification/v0.9-a2ui/#standard-component...

        as transported/renderered/etc by NxM combinations of implementations (there are 4 renderers and a bunch of transports right now), is not going to have security issues, i've got a bridge to sell you.

        Here, i'll sell it to you in gemini, just click a few times on the "totally safe text box" for me before you sign your name.

        My friend once called something a babydoggle - something you know will be a boondoggle, but is still in its small formative stages.

        This feels like a babydoggle to me.

        • vidarh 4 hours ago
          > None of the issues go away just because it's in chat?

          There is a wast difference in risk between me clicking a button provided by Claude in my Claude chat, on the basis of conversations I have had with Claude, and clicking a random button on a random website. Both can contain a malicious. One is substantially higher risk. Separately, linking a UI constructed this way up to an agent and let third parties interact with it, is much riskier to you than to them.

          > If the interactive diagram or UI you click on now owns you, it doesn't matter if it was inside the chat window or outside the chat window.

          In that scenario, the UI elements are irrelevant barring a buggy implementation (yes, I've read the rest, see below), as you can achieve the same things as you can do that way with just presenting the user with a basic link and telling them to press it.

          > as transported/renderered/etc by NxM combinations of implementations (there are 4 renderers and a bunch of transports right now), is not going to have security issues, i've got a bridge to sell you.

          I very much doubt we'll see many implementations that won't just use a web view for this, and I very much doubt these issues will even fall in the top 10 security issues people will run into with AI tooling. Sure, there will be bugs. You can use this argument against anything that requires changes to client software.

          But if you're concerned about the security of clients, mcp and hooks is a far bigger rats nest of things that are inherently risky due to the way they are designed.

  • empath75 4 hours ago
    I couldn't get this to work with the default model because it's overloaded, but I tried flash-lite, which at least gave me a response, but it only presents an actual UI 1/3rd of the time that I tried the suggested questions in the demo, and otherwise it attempts to ask me a question which doesn't present a ui at all or even do anything in the app -- i had to look at the logs to see what it was trying to do.
  • nsonha 6 hours ago
    What's agent/AI specific about this? Seems just backend-driven UI
  • mannanj 2 hours ago
    I want instead of being told “here’s what I think you want to see, now look at it”, “what do you want to see?” And be shown that.

    Yes yes we claim the user doesn’t know what they want. I think that’s largely used as an excuse to avoid rethinking how things should meet the users needs and keep status quo where people are made to rely on systems and walled gardens. The goal of this article is UIs should work better for the user. What better way then to let them imagine (or even nudge them with example actions, buttons, text to click to render specific views) in the UI! I’ve been wanting to build something where I just ask in English from options I know I have or otherwise play and hit edges to discover what’s possible and not.

    Anyone else thinking along this direction or think I’m missing something obvious here?