Launch HN: Canary (YC W26) – AI QA that understands your code

Hey HN! We're Aakash and Viswesh, and we're building Canary (https://www.runcanary.ai). We build AI agents that read your codebase, figure out what a pull request actually changed, and generate and execute tests for every affected user workflow.

Aakash and I previously built AI coding tools at Windsurf, Cognition, and Google. AI tools were making every team faster at shipping, but nobody was testing real user behavior before merge. PRs got bigger, reviews still happened in file diffs, and changes that looked clean broke checkout, auth, and billing in production. We saw it firsthand. We started Canary to close that gap. Here's how it works:

Canary starts by connecting to your codebase and understands how your app is built: routes, controllers, validation logic. You push a PR and Canary reads the diff, understands the intent behind the changes, then generates and runs tests against your preview app checking real user flows end to end. It comments directly on the PR with test results and recordings showing what changed and flagging anything that doesn't behave as expected. You can also trigger specific user workflow tests via a PR comment.

Beyond PR testing, tests generated from the PR can be moved into regression suites. You can also create tests by just prompting what you want tested in plain English. Canary generates a full test suite from your codebase, schedules it, and runs it continuously. One of our construction tech customers had an invoicing flow where the amount due drifted from the original proposal total by ~$1,600. Canary caught the regression in their invoice flow before release.

This isn't something a single family of foundation models can do on its own. QA spans across many modalities like source code, DOM/ARIA, device emulators, visual verifications, analyzing screen recordings, network/console logs, live browser state etc. for any single model to be specialized in. You also need custom browser fleets, user sessions, ephemeral environments, on-device farms and data seeding to run the tests reliably. On top of that, catching second-order effects of code changes requires a specialized harness that breaks the application in multiple possible ways across different types of users that a normal happy path testing flow wouldn't.

To measure how well our purpose built QA agent works, we published QA-Bench v0, the first benchmark for code verification. Given a real PR, can an AI model identify every affected user workflow and produce relevant tests? We tested our purpose-built QA agent against GPT 5.4, Claude Code (Opus 4.6), and Sonnet 4.6 across 35 real PRs on Grafana, Mattermost, Cal.com, and Apache Superset on three dimensions: Relevance, Coverage, and Coherence. Coverage is where the gap was largest. Canary leads by 11 points over GPT 5.4, 18 over Claude Code, and 26 over Sonnet 4.6. For full methodology and per-repo breakdowns give our benchmark report a read: https://www.runcanary.ai/blog/qa-bench-v0

You can check out the product demo here: https://youtu.be/NeD9g1do_BU

We'd love feedback from anyone working on code verification or thinking about how to measure this differently.

56 points | by Visweshyc 18 hours ago

10 comments

  • blintz 17 hours ago
    I really want automated QA to work better! It's a great thing to work on.

    Some feedback:

    - I definitely don't want three long new messages on every PR. Max 1, ideally none? Codex does a great job just using emoji.

    - The replay is cool. I don't make a website, so maybe I'm not the target market, but I'd like QA for our backend.

    - Honestly, I'd rather just run a massive QA run every day, and then have any failures bisected, rather than per-PR.

    - I am worried that there's not a lot of value beyond the intelligence of the foundation models here.

    • monkpit 4 hours ago
      Isn’t the last point the case with every AI startup? Nobody has a moat and it’s tough to build one because the playing field is so level.
    • Bnjoroge 16 hours ago
      Agree on your last point and it's going to be a very bitter lesson. In any case, you probably wanna shift alot of the code verification as left as possible so doing review at PR time isnt the right strat imo. And claude/codex are well positioned to do the local review.
    • Visweshyc 16 hours ago
      Thanks for the feedback! - Agreed that the form factor can be condensed with a link to detailed information - With the codebase understanding, backend is where we are looking to expand and provide value - The intelligence of the models does lay out the foundation but combining the strength of these models unlocks a system of specialized agents that each reason about the codebase differently to catch the unknown unknowns
  • pastescreenshot 5 hours ago
    The interesting question to me is not whether the system can generate a plausible PR-time test, but whether the useful ones survive after the PR is gone. If Canary catches a real regression, how often can that check be promoted into a stable long-lived regression test without turning into a flaky, environment-coupled browser script? That conversion rate feels closer to the real moat than the generation demo.
    • Visweshyc 3 hours ago
      Good point. To keep the regression tests reliable as the app evolves, we run a reliability cascade. First, we generate and execute deterministic Playwright from the codebase. If execution fails then we fall back to DOM and aria tree. If that still fails, we fall back to vision agents that verify what the user actually sees before flagging a drift in the application behavior
  • recsv-heredoc 14 hours ago
    The market timing on this is perfect - it fills a major current gap I've seen emerging.

    I've heard a few stories of QA departments being near-burnout due to the increased rate developers are shipping at these days. Even we're looking for any available QA resources we can pull in here.

    No harm meant with the question - but what's the advantage over Claude Code + the GitHub integrations?

    • Visweshyc 13 hours ago
      We evaluated test generation using Claude code and our purpose built harness and measured the quality of tests in catching the unknown unknowns. We noticed Claude Code misses the second order effects that actually break applications. You also need infrastructure to execute the tests - browser fleets, ephemeral environments, data seeding need to be handled
  • warmcat 17 hours ago
    Good work. But what makes this different than just another feature in Gemini Code assist or Github copilot?
    • Visweshyc 15 hours ago
      Thanks! To execute these tests reliably you would need custom browser fleets, ephemeral environments, data seeding and device farms
      • mikestorrent 4 hours ago
        If that's what you guys are bringing, you should put that more up front; focus on making it clear you're providing ingredients that Claude et al will not be providing on their own without Real Actual Software to do it.
        • Visweshyc 2 hours ago
          Fair feedback. Will make that clearer. Appreciate it
  • solfox 16 hours ago
    Not a direct competitor but another YC company I use and enjoy for PR reviews is cubic.dev. I like your focus on automated tests.
    • Visweshyc 16 hours ago
      Thanks! We believe executing the scenarios and showing what actually broke closes the loop
  • Bnjoroge 16 hours ago
    what kinds of tests does it generate and how's this different from the tens of code review startups out there?
    • Visweshyc 15 hours ago
      The system focuses on going beyond the happy path and generating edge case tests that try to break the application. For example, a Grafana PR added visual drag feedback to query cards. The system came up with an edge case like - does drag feedback still work when there's only one card in the list, with nothing to reorder against?
  • solfox 17 hours ago
    Looks interesting! Looks like perhaps no support for Flutter apps yet?
    • Visweshyc 16 hours ago
      Yes we currently support web apps but plan to extend the foundation to test mobile applications on device emulators
  • tgtracing 4 hours ago
    [dead]
  • opensre 13 hours ago
    [flagged]
  • vivzkestrel 6 hours ago
    - there are atleast 10 dozen code review startups at this point and i see a new one on YC every week

    - what is your differentiator?

    • Visweshyc 2 hours ago
      We see this as different from review. The system generates tests to catch second-order effects and executes them against the live application to expose bugs