TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

(arxiv.org)

36 points | by handfuloflight 5 hours ago

6 comments

cjbarber 8 minutes ago
Leaderboard: https://github.com/column-tax/tax-calc-bench
stared 27 minutes ago
A bare model may lack a lot.
Yet a week ago I used Claude Code for my personal finances (not taxes) - I downloaded over a year’s worth of my bank account data. Since I pay for most things by card, if I buy lunch, it’s there.
With a single prompt (and about 10 minutes), it produced an analysis. It solved all the technical issues by itself (e.g., realizing it wasn’t CSV but TSV) and ran quite a few different explorations with Pandas. It was able to write an overview, find items that were likely misclassified, etc.
Everything I checked by hand was correct.
So, instead of pursuing a project to write an AI tool for personal finance, I ended up concluding: “just use Claude Code.” As a side note, I used 14 months of data due to my mistake - I wanted to analyze 2 months of data, since I didn’t believe it would handle a larger set, but I misclicked the year. The file was 350 KB.
ofrzeta 3 hours ago
"Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. ... Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set."
Unsurprisingly. Sometimes I feel like I am in a madhouse. Or in an alchemist's laboratory.
Rudybega 1 hour ago
I wonder if you could dramatically improve these results with some relatively simple scaffolding and tool access.
If a ton of these mistakes are genuinely simple calculation errors, it seems like giving the models access to a calculator tool would help a fair bit.
[-]
- sails 1 hour ago
  I feel like we are already there. I would imagine if you set Claude Code or Codex this task, running in the CLI, you would see a huge improvement, and that is before you start creating task specific guardrails.
  I’m surprised they haven’t tried this, I’m running my own in parallel against my accountant in this way.
- Lionga 1 hour ago
  The problem is they do not understand what/how to calculate not the actual act of adding or multiplying. I tried asking ChatGPT to calculate some taxes for three countries, two of which I have been filing taxes already. For the two I know ChatGPT gave wildly wrong numbers (not even right ballpark), so I know I could not trust numbers for the third which was what I was mostly interested in.
hodgehog11 1 hour ago
Am I missing something or did they only assess this on Google and Anthropic models? If so, all I can ascertain from this is that latest Gemini models outperformed Claude on this particular task, which should be surprising to no-one. What about GPT-5? Open weight models?
anticensor 2 hours ago
Whereas almost every other country tries to make it easier to file taxes, even when the underlying tax schedule is complex.