6 comments

  • cjbarber 8 minutes ago
  • stared 27 minutes ago
    A bare model may lack a lot.

    Yet a week ago I used Claude Code for my personal finances (not taxes) - I downloaded over a year’s worth of my bank account data. Since I pay for most things by card, if I buy lunch, it’s there.

    With a single prompt (and about 10 minutes), it produced an analysis. It solved all the technical issues by itself (e.g., realizing it wasn’t CSV but TSV) and ran quite a few different explorations with Pandas. It was able to write an overview, find items that were likely misclassified, etc.

    Everything I checked by hand was correct.

    So, instead of pursuing a project to write an AI tool for personal finance, I ended up concluding: “just use Claude Code.” As a side note, I used 14 months of data due to my mistake - I wanted to analyze 2 months of data, since I didn’t believe it would handle a larger set, but I misclicked the year. The file was 350 KB.

  • ofrzeta 3 hours ago
    "Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. ... Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set."

    Unsurprisingly. Sometimes I feel like I am in a madhouse. Or in an alchemist's laboratory.

  • Rudybega 1 hour ago
    I wonder if you could dramatically improve these results with some relatively simple scaffolding and tool access.

    If a ton of these mistakes are genuinely simple calculation errors, it seems like giving the models access to a calculator tool would help a fair bit.

    • sails 1 hour ago
      I feel like we are already there. I would imagine if you set Claude Code or Codex this task, running in the CLI, you would see a huge improvement, and that is before you start creating task specific guardrails.

      I’m surprised they haven’t tried this, I’m running my own in parallel against my accountant in this way.

    • Lionga 1 hour ago
      The problem is they do not understand what/how to calculate not the actual act of adding or multiplying. I tried asking ChatGPT to calculate some taxes for three countries, two of which I have been filing taxes already. For the two I know ChatGPT gave wildly wrong numbers (not even right ballpark), so I know I could not trust numbers for the third which was what I was mostly interested in.
  • hodgehog11 1 hour ago
    Am I missing something or did they only assess this on Google and Anthropic models? If so, all I can ascertain from this is that latest Gemini models outperformed Claude on this particular task, which should be surprising to no-one. What about GPT-5? Open weight models?
  • anticensor 2 hours ago
    Whereas almost every other country tries to make it easier to file taxes, even when the underlying tax schedule is complex.