Yet a week ago I used Claude Code for my personal finances (not taxes) - I downloaded over a year’s worth of my bank account data. Since I pay for most things by card, if I buy lunch, it’s there.
With a single prompt (and about 10 minutes), it produced an analysis. It solved all the technical issues by itself (e.g., realizing it wasn’t CSV but TSV) and ran quite a few different explorations with Pandas. It was able to write an overview, find items that were likely misclassified, etc.
Everything I checked by hand was correct.
So, instead of pursuing a project to write an AI tool for personal finance, I ended up concluding: “just use Claude Code.” As a side note, I used 14 months of data due to my mistake - I wanted to analyze 2 months of data, since I didn’t believe it would handle a larger set, but I misclicked the year. The file was 350 KB.
"Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. ... Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set."
Unsurprisingly. Sometimes I feel like I am in a madhouse. Or in an alchemist's laboratory.
I feel like we are already there. I would imagine if you set Claude Code or Codex this task, running in the CLI, you would see a huge improvement, and that is before you start creating task specific guardrails.
I’m surprised they haven’t tried this, I’m running my own in parallel against my accountant in this way.
The problem is they do not understand what/how to calculate not the actual act of adding or multiplying. I tried asking ChatGPT to calculate some taxes for three countries, two of which I have been filing taxes already. For the two I know ChatGPT gave wildly wrong numbers (not even right ballpark), so I know I could not trust numbers for the third which was what I was mostly interested in.
Am I missing something or did they only assess this on Google and Anthropic models? If so, all I can ascertain from this is that latest Gemini models outperformed Claude on this particular task, which should be surprising to no-one. What about GPT-5? Open weight models?
Yet a week ago I used Claude Code for my personal finances (not taxes) - I downloaded over a year’s worth of my bank account data. Since I pay for most things by card, if I buy lunch, it’s there.
With a single prompt (and about 10 minutes), it produced an analysis. It solved all the technical issues by itself (e.g., realizing it wasn’t CSV but TSV) and ran quite a few different explorations with Pandas. It was able to write an overview, find items that were likely misclassified, etc.
Everything I checked by hand was correct.
So, instead of pursuing a project to write an AI tool for personal finance, I ended up concluding: “just use Claude Code.” As a side note, I used 14 months of data due to my mistake - I wanted to analyze 2 months of data, since I didn’t believe it would handle a larger set, but I misclicked the year. The file was 350 KB.
Unsurprisingly. Sometimes I feel like I am in a madhouse. Or in an alchemist's laboratory.
If a ton of these mistakes are genuinely simple calculation errors, it seems like giving the models access to a calculator tool would help a fair bit.
I’m surprised they haven’t tried this, I’m running my own in parallel against my accountant in this way.