This article’s framing and title are odd. The author, in fact, found no bugs or errors in the proven code. She says so at the end of the article:
> The two bugs that were found both sat outside the boundary of what the proofs cover. The denial-of-service was a missing specification. The heap overflow was a deeper issue in the trusted computing base, the C++ runtime that the entire proof edifice assumes is correct.
Still an interesting and useful result to find a bug in the Lean runtime, but I’d argue that doesn’t justify the title. Or the claim that “the entire proof edifice” is somehow shaky.
It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to, since obviously no one’s running any compiled Lean code in any kind of production hot path.
Repeating myself, when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Second, there was a bug in the code. Maybe not a functional correctness bug, but I, along with many and most end users, would consider a crashing program buggy. Maybe we just have different tastes or different standards on what we consider an acceptable level of software quality.
W.r.t people running Lean in production, you'd be surprised...
We're not speaking about bugs in a verified system so much as writing articles making specific claims about that. Surely if we're at the level of precision of formal verification, it's incumbent upon us to be precise about the nature of a problem with it, no? "Lean proved this program correct and then I found a bug" heavily implies a flaw in the proof, not a flaw in the runtime (which to my mind would also be a compelling statement, for the reasons you describe).
Or it implies a bug in the specification. The spec differing from the intent is a frequent source of bugs, it doesn't matter what language the spec is written in. Most people have experience with (or have seen news stories about) specification bugs in natural-language specifications: legal loopholes!
This is the biggest risk with the rejuvenated interest in formal proof. That LLMs can generate proofs is useful. Proof assistants that can check them (Lean/FStar/Isabelle/...) similarly so.
But it just moves the question to whether the theorems covered in the proof are sufficient. Underlying it all is a simple question:
Does the system meet its intended purpose?
To which the next question is:
What is the intended purpose?
Describing that is the holy grail of requirements specification. Natural language, behaviour-driven development, test-driven development and a host of other approaches attempt to bridge the gap between implicit purpose and explicit specification. Proof assistants are another tool in that box.
It's also one of the key motivators for iterative development: putting software in front of users (or their proxies) is still the primary means of validation for a large class of systems.
None of which is implied criticism of any of those approaches. Equally, none completely solves the problem. There is a risk that formal proofs, combined with proof assistants, are trumpeted as "the way" to mitigate the risk that LLM-developed apps don't perform as intended.
They might help. They can show that code is correct with respect to some specification, and that the specification is self-consistent. They cannot prove that the specification is complete with regards its intended purpose.
Sorry, I'm not sure I follow. We are talking about bugs in a verified system, that is, in this case, a verified implementation of a zlib-based compression tool. Did it have bugs? yes. Several in fact. I'd recommend reading the article for a detailed listing of the bugs in the tool.
> The most substantial finding was a heap buffer overflow! but, not in lean-zip's code, but in the Lean runtime itself. (emphasis mine)
> The OOM denial-of-service is straightforward: the archive parser was never verified. (me again)
"Lean proved this program correct" vs "I found bugs in the parts that Lean didn't prove was correct". The failure was not in Lean's proof (which as I said is heavily implied by the title), but in ancillary or unverified code.
Do I misunderstand how Lean works? I am by no means an expert (or even an amateur) on Lean or formal systems in general. Surely the first class of bug could be found in any code that uses the framework, and the second class of bug could happen to any system that isn't proven? Does that make sense? Otherwise where's the boundary? If you find a bug in the OS, does that mean Lean failed somehow? How about the hardware? If your definition of a 'formally verified system' goes beyond the code being verified and the most important thing is whether or not you can make it crash, then the OS and the hardware are also part of the system.
Of course bugs are important to users regardless of the cause, but the audience for your article seems to be software engineers and as a software engineer, your article was interesting and you found something useful, but the title was misleading; that's all I'm saying.
Further to my earlier reply - a more succinct way to look at it might be:
- When they fix the run time, bug A goes away. So the proof still holds and the zlib code is still correct.
- When they add a system of proofs for the parser and modify that, then bug B goes away. So the proof still holds and the zlib code is still correct; and now more of the library is proven correct.
The formulation of the title is "I was told X but that's not true"... but that's not true. You were told X, and X is true, but you found Y and Z which are also important.
There are two different answers to this question, and which one is "correct" depends entirely on the context of who is asking it.
1. It's the code that is specific to this program that sits above the run-time layer (internal view, that most programmers would take).
2. It's the code in the binary that is executed (external view, that most users would take).
The key question does not seem to be "was the proof correct", rather "did the proof cover everything in the program". The answer depends on whether you are looking at it from the perspective of a programmer, or a user. Given the overly strong framing that the article is responding to - highlighting the difference in this way does seem to be useful. The title is correct from the perspective that most users would take.
Yes but, without wishing to be snarky, did you read the article? There is no program as such, in either sense - the announcement from Lean only mentions "a C compression library" (zlib). Not only that, but since we're talking about formal verification, a programmer would likely understand that that is about proving a bounded, specific codebase at source code level, and not operating on a binary along with its associated dependencies (again caveat my limited understanding of these things).
My feeling is that if you told the average non-technical user that some person/organisation had produced a formally verified version of a C compression library, you would likely get a blank look, so I think it's reasonable to assume that both Lean's intended audience, and the audience of the blog post linked here, correspond with number 1. in your list.
I am pretty sure you could tell a teenager "there's a ZIP compression program that's scientifically proven to have no bugs" and they'd understand you. People don't have to be CS experts to understand that. (Technically it's Gzip but that's mostly irrelevant to understanding the claim here)
The article describes fuzzing the library, this execution requires a program to be compiled. Typically fuzzing involves a minimal harness around the payload (a single call into the library in this case). There is clearly a bug in this program, and it does not exist in the minimal harness. It must be in the library code, which was covered by the proof.
The bounded, specific codebase that you refer to is typically the library *and all of its dependencies*, which in this case includes the Lean runtime. This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations. In this case the original gushing claim that everything was verified is incorrect and premature. The article seems like a good exposition of why.
Thank you, I understand what fuzzing is; that test harness was presumably provided either by the blog post author or generated by Claude somehow, and therefore would not have been part of the proven code, nor part of the original claim by the Lean devs. That's what I meant by saying there is no program as such.
> The bounded, specific codebase that you refer to is typically the library and all of its dependencies, which in this case includes the Lean runtime.
How does that work? I thought the main idea is to write code in the Lean language which has some specific shape conducive to mathematical analysis, along with mathematical proofs that operate on that code. How then does a system like this handle a third party dependency? I've searched around and I can't find any information about how it works. I assumed that the boundary of the proof was the source code - surely they can't also be proving things like, say, DirectX?
> This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations.
The difficulty is not in explosions of computational complexity due to problems of incompleteness, decidability, the halting problem, those kinds of things? As I said this is not something I know much about but it's surprising to me if 'analysing all the code' is really the difficult bit.
There are standard convex assumptions to handle incompleteness, decidability etc, i.e. the results are an over-approximation that terminates. Picking a approximation that is precise enough in the properties that you care about is part of the challenge, but it is an in-band problem. There are no hard edges between the theory and reality.
As with most engineering problems the out-of-band issues tend to be the hardest to solve. Models of the part underneath the interesting part need to be complete/accurate enough to make the results useful. Compare it to crypto where people do not usually try to break the scheme - they try to break the specific implementation of the scheme because the weakest points will be at the interface between the theoretical construction and the actual concrete instantiation of the device that it will run on.
Gentle reminder about this excerpt from HN Guidelines:
> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
Say you study some piece of software. And it happens that it has an automated suite of tests. And it happens that some files aren't covered by the test suite. And you happen to find a bug in one of those files that were not covered.
Would you publish a blog post titled "the XXX test suite proved there was no bug. And then I found one"?
These were my thoughts as well and it's nothing new, I think, regarding testing altogether:
- testing libraries (and in this case - language itself) can have bugs
- what is not covered by tests can have bugs
Additionally would add that tests verify the assumptions of coder, not expectations of the business.
To give benefit to the author - I'd read the article as: having tests for given thing ensures that it does the thing that you built the tests for. This doesn't mean that your application is free of bugs (unless you have 100% coverage, can control entire state of the system, etc) nor that it does the right thing (or that it does the thing the right way)
I like to differentiate between coverage by lines and semantic coverage: sometimes you need to exercise a single line multiple times to get full semantic coverage, and better semantic coverage usually beats larger line coverage for detecting problems and preventing regressions.
Yes, mutation testing and similar techniques like fuzzing can help with it, but sometimes you want to be more deterministic: there are usually a lot of hidden side effects that are not obvious, and probably a source of majority of software bugs today.
Eg. something as simple as
function foo(url) {
data = fetch(url);
return data;
}
has a bunch of exceptions that can happen in fetch() not covered with your tests, yet you can get 100% line coverage with a single deterministic happy path test.
Basically, any non-functional side-effect behavior is a source of semantic "hiding" when measuring test coverage by lines being executed, and there is usually a lot of it (logging is another common source of side-effects not tested well). Some languages can handle this better with their typing approaches, but ultimately, there will be things that can behave differently depending on external circumstances (even bit flips, OOMs, full disk...) that were not planned for and do not flow from the code itself.
You're technically right, but what things are versus how they're promoted or understood by most people (rightfully or not) often diverges, and therefore such "grounding" articles are useful, even if the wording addresses the perceived rather than the factual reality.
By way of analogy, if there was an article saying "I bought a 1Tb drive and it only came with 0.91 terabits", I think if you started explaining that technically the problem is the confusion between SI vs binary units and the title should really be "I bought a 0.91 terabit drive and was disappointed it didn't have more capacity", technically you'd be right, but people would rightfully eyeroll at you.
To be clear, I think the article was fine and the author did some useful work (finding a bug in the runtime of a supposedly provably correct system is indeed a valuable contribution!). I don't agree that it's pedantic to explain why the title feels like a bait-and-switch (and thus does a disservice to the article itself). It's just a bit of feedback for future reference.
I take some comfort from being technically correct though; it's widely accepted by all of us pedants that that is the best kind of correct ;-)
> Repeating myself, when we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
Yes, and that would be relevant if this was a verified software system. But it wasn't: the system consisted of a verified X and unverified Y, and there were issues in the unverified Y.
The article explicitly acknowledges this: "The two bugs that were found both sat outside the boundary of what the proofs cover."
1/ lean-zip is open source so it's much easier to have more Claude's eyes looking at it
2/ I don't think Claude could prove anything substantial about the zip algorithm. That's what lean is for. On the other side, lean could not prove much about what's around the zip algorithm but Claude can be useful there.
> I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Reminds of what some people in the Rust community do: they fight how safe this is or not. I always challenge that the code is composed of layers, from which unsafe is going to be one. So yes, you are righ to say that. Unsafe means unsafe, safe means safe and we should respect the meaning of those words instead of twisting the meaning for marketing (though I must say I heard this from people in the community, not from the authors themselves, ever).
Totally agreed. For instance that's why sel4 just throws the whole binary at the proof engine. That takes any runtime (minimal in sel4's case) and compiler (not nearly as minimal) out of the TCB.
Hi Kiran, thanks for following up. FWIW, I enjoy your blog and your work. And I do think it was a valuable bug you found; also nice to see how quickly Henrik fixed it.
Say more about people running Lean in production. I haven’t run into any. I know of examples of people using Lean to help verify other code (Cedar and Aeneas being the most prominent examples), but not the actual runtime being employed.
I took a quick scan of lean-lang.org just now, and, other than the two examples I mentioned, didn’t see a single reference to anything other than proving math.
I’m sure you’re in the Lean Zulup, based on what you’ve been up to. Are you seeing people talk about anything other than math? I’m not, but maybe I’m missing it.
Yes, here's a concrete example: https://github.com/leanprover/SampCert This is an implementation of a verified sampler, in lean. Not an embedding in some other language. The implementation itself is in lean, and a python ffi is used to call into the verified implementation. I don't know if AWS is big enough for your standards, but here is at least one example. Besides that, I'm more reporting on the general vibe I have observed from numerous talks at AI4maths workshops at Neurips, at the DARPA AI4Math ExpMath kickoff, etc. People are considering Lean as a serious programming language. Maybe that's surprising to the mathematicians, but as a PL person, I find the language really nicely designed and I can understand why people want to write in it.
You’re right, I should have been more careful in my reference to AWS. No need to be snarky about it.
Let me rephrase: aside from that one example from a couple years ago, I haven’t seen any examples of production code written in Lean. I’d be very interested in being proven wrong, this isn’t something I desire, just what I’ve observed. Have you seen any others?
More generally, you implicitly make a good point: writing important libraries in Lean and calling in from another language is probably the most likely use case. So not programs / apps / binaries written in Lean, but small critical components.
> if the software you were running was advertised as formally verified as free of bugs.
Nobody should be advertising that. Even ignoring the possibility of bugs in the runtime, there could also be bugs in the verifier and bugs or omissions in the specification. Formally verified never means guaranteed to be free of bugs.
Yeah, the title made me think the author found a bug in the Lean kernel, thus making an invalid proof pass Lean's checks. The article instead uncovers bugs in the Lean runtime and lean-zip, but these bugs are less damning than e.g. the kernel, which must be trusted to be correct, or else you can't trust any proof in Lean.
When the Lean runtime has bugs, all Lean applications using the Lean runtime also have those bugs. I can’t understand people trying to make a distinction here. Is your intent to have a bug free application or to just show the Lean proof kernel is solid?? The latter is only useful to Lean developers, end users should only care about the former!
The intent is to have a proof of some proposition. The Lean runtime crashing doesn't stop the lean-zip developers from formally modelling zlib and proving certain correctness statements under this model. On the other hand, the Lean kernel having a bug would mean we may discover entire classes of proofs that were just wrong; if those statements were used as corollaries/lemmas/etc. for other proofs, then we'd be in a lot of trouble.
When I see a title transitioning from "Lean said this proof is okay" to "I found a bug in Lean", I'm intuitively going to think the author just found a soundness (or consistency) issue in Lean.
There are no Lean applications other than Lean. This is an important point most of the comments are missing. Lean is for proving math. Yes, you can use it for other things; but no, no one is.
Still good to have found, but drawing conclusions past “someone could cheat at proving the continuum hypothesis” isn’t really warranted.
A missing specification in the proof of lean-zip, a lean component, is a real problem to the philosophy and practice of software verification.
To illustrate, let's say you want to verify a "Hello world" program. You'd think a verification involves checking that it outputs "Hello, world!".
However, if a contractor or AI hands you a binary, what do you need to verify? You will need to verify that it does exactly print "Hello, world!", no more, no less. It should write to stdout not stderr. It shouldn't somehow hold a lock on a system resource that it can't clean up. It cannot secretly install a root-kit. It cannot try to read your credentials and send it somewhere. So you will need to specify the proof to a sufficient level of detail to capture those potential deviations.
Broadly, with both bugs, you need to ask a question: does this bug actually invalidate my belief that the program is "good"? And here you are pulling up a fact that the bug isn't found in the Lean kernel, which makes an assumption that there's no side-effect that bleeds over the abstraction boundary between the runtime and the kernel that affects the correctness of the proof; that safety guarantee is probably true 99.99% of the time - but if the bug causes a memory corruption, you'd be much less confident in that guarantee.
If you're really serious about verifying an unknown program, you will really think hard "what is missing from my spec"? And the answer will depend on things that are fuzzier than the Lean proof.
Now, pragmatically, there many ways a proof of correctness adds a lot of value. If you have the source code of a program, and you control the compiler, you can check the source code doesn't have weird imports ("why do I need kernel networking headers in this dumb calculator program?"), so the scope of the proof will be smaller, and you can write a small specification to prove it and the proof will be pretty convincing.
All in all, this is a toy problem that tells you : you can't verify what you don't know you should verify, and what you need to verify depends on the prior distribution of what the program is that you need to verify, so that conditional on the proof you have, the probability of correctness is sufficiently close to 1. There's a lesson to learn here, even if we deem Lean is still a good thing to use.
> A missing specification in the proof of lean-zip, a lean component, is a real problem to the philosophy and practice of software verification.
Every time someone makes this point, I feel obliged to point out that all alternatives to software verification have this exact same problem, AND many, many more.
To me, saying that there is a bug in the lean runtime means lean-zip has a bug is like saying a bug in JRE means a Java app that uses the runtime has a bug, even though the Java app code itself does have a bug. It seems like the author is being intentionally misleading about his findings.
No. It would be like finding a memory unsafe caused bug in a Java application that is due to a bug in the JRE. That would absolutely warrant a title like “I found memory unsafe bug in my Java code” when everyone expects Java code to be memory safe, which is analogous to the article in question.
I do not think you are completely grasping what you are talking about (what is a 'memory unsafe bug'?). Even in the example you give, that title would be literally wrong, as there will be no bug in your Java code; there would be a bug in the execution due to a deviation in the runtime executing your program.
>I think it's ambiguous and fair game for the idea of answering the question "if we write programs in this manner, will there be exploitable bugs?
You're strawmanning the original authors' argument. The creator of lean-zip said that they proved there are no implementation bugs in the lean-zip program. A bug in lean-runtime does not contradict this claim.
> obviously no one’s running any compiled Lean code in any kind of production hot path
Ignorant question: why not? Is there an unacceptable performance penalty? And what's the recommended way in that case to make use of proven Lean code in production that keeps the same guarantees?
You’re right, there is that one example. Feels like we’re in exception that proves the rule territory. But I’d be very interested in being proven wrong! This isn’t a desire of mine, just what I’ve seen. Do you have other examples?
Also, part of my confidence comes from both having been a professional programmer for decades, across many languages, and also having programmed in Lean. It’s a great language for math, perhaps the best choice right now. But as a general purpose language it’s incredibly quirky.
> It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to
Well, Lean is written in Lean, so I am pretty sure a runtime bug like this could be exploited to prove `False`. Yes, since the kernel is written in C++, technically it's not the part affected by this bug, but since you cannot use the Lean kernel without the Lean compiler, this difference does not matter.
Sure, but are you worried about someone cheating on their arXiv submission by exploiting a buffer overflow? It’s a real bug, it’s just not very important.
I’ve had similar experiences with code I’ve proven correct, although my issues were of the more common variety than the overflow issue - subtle spec bugs. (I think the post mentions the denial of service issue as related to this: a spec gap)
If you have a spec that isn’t correct, you can certainly write code that conforms to that spec and write proofs to support it. It just means you have verified a program that does something other than what you intended. This is one of the harder parts of verification: clearly expressing your intention as a human. As programs get more complex these get harder to write, which means it isn’t uncommon to have lean or rocq proofs for everything only to later find “nope, it has a bug that ultimately traces back to a subtle specification defect.” Once you’ve gone through this a few times you quickly realize that tools like lean and rocq are tricky to use effectively.
I kinda worry that the “proof assistants will fix ai correctness” will lead to a false sense of assurance if the specs that capture human intention don’t get scrutinized closely. Otherwise we’ll likely have lots of proofs for code that isn’t the code the humans actually intended due to spec flaws.
But that's not saying the proofs are an issue - usually the spec you can reasonably prove in lean or another prover, say TLA+ or Z3 depending on your kind of program - has to be overly simplified and have a lot of assumptions.
However, that is powerful.
It doesn't mean your program doesn't have bugs.
It means this big scary complicated algorithm you think works but are skeptical doesn't have bugs - so when you encounter one, you know the bug is elsewhere, and you start really looking at the boundaries of what could be misspecified, if the assumptions given to the prover are actually true, etc.
It eliminates the big scary thing everyone will think is the cause of the bug as the actual cause.
This has been insanely valuable to me lately. It is also something I never really was able to do before the help of AI - vibe coding proofs about my programs is IMO one of the killer apps of AI, since there aren't a ton of great resources yet about how to do it well since it is rarely done.
This surprises me. Formal verification so far has been a very niche thing apart from conventional type systems. I didn't think lack of vibe coding was much of a bottleneck in the past. Where do you use it?
The problem is implementing anything approximately twice is a hard sell… this is no longer true, though - TLA+ models are cheap now. You should be using them when writing any sort of distributed systems, which is basically everything nowadays.
Roughly anything that, say, has the complexity of a leetcode medium level problem that isn't already an extremely well known algorithm.
Any moderately complex thread safety thing with a few moving parts (e.g. there are multiple mutexes involved in various parts of the system, verify no deadlocks).
The lack of vibe coding has been a bottleneck for literally everything before.
When I see people say the hate vibe coding, I think "why do you hate formal verification? Because you could be spending your time on formal verification instead of removing "code smells" that don't hurt anything from vibe code."
Whenever I read an article about formal verification systems there is always that nagging thought in the back of my head. Why can you trust your formal verification system to be bug free but you can't trust the program. should not the chance of bugs be about equal in both of them?
You have a program that does something and you write another program to prove it. What assurance do you have that one program has fewer bugs then the other? Why can one program have bugs but the other can't? How do you prove that you are proving the right thing? It all sort of ties into Heisenberg's uncertainty theorem. A system cannot be fully described from within that system.
Don't get me wrong, I think these are great systems doing great work. But I always feel there is something missing in the narrative.
I think a more practical view is that a program is already a sort of proof. there is a something to be solved and the program provides a mechanism to prove it. but this proof may be and probably is incorrect, as bugs are fixed it gets more and more correct. A powerful but time consuming tool to try and force correctness is to build the machine twice using different mechanisms. Then mismatched output indicates something is wrong with one of them. and your job as an engineer is to figure out which one. This is what formal verification brings to the table. The second mechanism.
> Whenever I read an article about formal verification systems there is always that nagging thought in the back of my head. Why can you trust your formal verification system to be bug free but you can't trust the program. should not the chance of bugs be about equal in both of them?
A bug in the formal verification tool could be potentially noticed by any user of that formal verification tool. (And indirectly by any of their users noticing a bug about which they say "huh, I thought the tool told me that was impossible.")
A bug in your program can only be potentially noticed by you and your users.
There are also entirely categories of bugs that may not be relevant. For instance, if I'm trying to prove correctness of a distributed concurrent system and I use a model+verifier that verifies things in a sequential, non-concurrent way, then I don't have to worry about the prover having all the same sort of race conditions as my actual code.
But yeah, if you try to write your own prover to prove your own software, you could screw up either. But that's not what is being discussed here.
> It all sort of ties into Heisenberg's uncertainty theorem. A system cannot be fully described from within that system.
Surely you are talking about Godel incompleteness, not Heisenberg's uncertainty principle; in which case they're actually not the same system - the verification/proof language is more like a metalanguage taking the implementation language as its object.
(Godel's observation for mathematics was just that for formal number systems of sufficient power, you can embed that metalanguage into the formal number system itself.)
I think formal verification brings a bit more to the table. The logical properties are not just a second implementation. They can be radically simpler. I think quantifiers are doing a lot of work here (forall/exists). They are not usable directly in regular code. For example, you can specify that a shortest path algorithm must satisfy:
forall paths P from A to B:
len(shortest(A,B)) <= len(P)
That's much simpler than any actual shortest path algorithm.
AWS has said that formal verification enables their engineers to implement aggressive performance optimizations on complex algorithms without the fear of introducing subtle bugs or breaking system correctness. It helped double the performance of the IAM ACL evaluation code
Formal verification is just an extra step up from the static types that you might have in a language such as Rust.
Common static types prove many of the important properties of a program. If I declare a variable of type String then the type checker ensures that it is indeed a String. That's a proof. Formal verification takes this further and proves other properties such as the string is never empty.
Common static types are very effective. Many users of Rust or Haskell will claim that if a program compiles then it usually works correctly when they go to run it.
However there is a non-linear relationship between probability of program correctness and the amount of types required to achieve it. Being almost certain requires vastly more types than just being confident.
That's the real issue with formal verification, being 75% sure and having less code is better than being 99% sure in most situations, though if I were programming a radiotherapy machine I might think differently.
The chances of significant bugs in lean which lead to false answers to real problems are extremely small (this bug still just caused a crash, but is still bad). Many, many people try very hard to break Lean, and think about how proofs work, and fail. Is it foolproof? No. It might have flaws, it might be logic itself is inconsistent.
I often think of the ‘news level’ of a bug. A bug in most code wouldn’t be news. A bug which caused lean to claim a real proof someone cared about was true, when it wasn’t, would in the proof community the biggest news in a decade.
> should not the chance of bugs be about equal in both of them?
Even if it is, the verification is still very useful. The verifier is going to run for a few minutes and probably not hit many edge cases. The chance it actually hits a bug is low, and the chance a bug makes it wrongly accept your program is a lot lower. Especially if it has to output a proof at the end. Meanwhile it's scrutinizing every single edge case your program has.
The tension between spec bugs vs. implementation bugs is real. But i will take a bug in a situation where the implementation has been verified any day.
Working over what we really want is problem solving in the problem domain.
As apposed to going into the never ending implementation not-what-we-were trying to solve weeds.
I think we need a way to verify the specs. A combo of formal logic and adversarial thinking (probably from LLMs) that will produce an exhaustive list of everything the program will do, and everything it won’t do, and everything that is underspecified.
Still not quite sure what it looks like, but if you stipulate that program generation will be provable, it pushes the correctness challenge up to the spec (and once we solve that, it’ll be pushed up to the requirements…)
I agree. It’s kind of like secure boot, in reverse: the high level stuff has to be complete and correct enough that the next level down has a chance to be complete and correct.
Clickbait title, the proved part of the program had no bugs?
As an aside, why can't people just write factually? This isn't a news site gamed for ad revenue. It's also less effort. I felt this post was mostly an insulting waste of time. I come to HN to read interesting stuff.
Is it fair when it comes to formally-verified software to only consider bugs that violate a proof, and ignore everything else?
Formally-verified software is usually advertised "look ma no bugs!" Not "look ma no bugs*" *As long as this complicated chain of lemmas appropriately represents the correctness of everything we care about.
In boating theres often debate of right of way rules in certain situations, and some people are quick to point out that giant tanker ships should be giving way to tiny sailboats and get all worked up about it*. The best answer I've heard: they're dead right! that is to say as right as they are dead (if they didnt yield) lol. In the same vein, I think someone who assumed that a formally-verified software was perfect and got hacked or whatever is going to be a bit wiggly about the whole thing.
* = Technically the rules prioritize the tankers if they are "restricted in ability to maneuver" but everyone loves to argue about that.
Nobody with experience in the field advertises formally-verified software like that, and it is understood that the spec may as well be wrong. It is also understood that the non-verified parts may have bugs (surprise). There is no news here.
Unless "with experience in the field" == academia, disagree. In particular I remember the early discourse & hype around Wireguard, it was discussed as if perfection was an achieved outcome.
Yeah, extremely misleading title even if it is technically true semantically. The phrasing gives the impression that a bug was found in `lean-zip` as part of the proof boundary when it was part of the unverified archive-handling code.
The archive-handling code was in lean-zip, it just seems the verifiers forgot to write proofs for it (still a bug).
Thats not the main finding of the article however. The main bug found was actually in the lean runtime, affecting all proofs using scalar arrays where the size of the array is not bounded.
Also, he even just created the second bug out of thin air. There is no code reference, and the reason why he downplays it is because he knows that if someone looks into it, they will realize he misrepresented the actual code.
LLM's are capable of producing code that passes formal verification.
The writing is on the wall: in the future more and more software on the abstract or platonic side of our computing base will be hermetically sealed against bugs and exploits. This quenching of bugs in the assured side will shift the mean location of bugs closer to the hardware side: at some point bugs and exploits will rely more and more on hardware quirks, and simply unspecified hardware.
Afterwards we can expect a long exponential decay of preventable safety violations: people mistakenly or surreptitiously disengaging the formal verification steps and shipping malicious or unverified code. Each such event will be its own big or small scandal, at some point there will be no deniability left: something must be on purpouse, either a malicious vulnerability or intentional disengagement of safety measures.
As the attack surface recedes towards the lower level hardware stack, it will open the debate that the community needs proper formal hardware descriptions (at least at the interface initially, not necessarily how the hardware has implemented it). As interface bugs get formalized 3 things can happen:
either vulnerabilities go extinct, and full formal hardware descriptions are not released
or vulnerabilities remain in each new generation of hardware, and malicious intent or negligence on behalf of the manufacturer can only be presumed, this will set up the community against manufacturers, as they demand full hardware descriptions (verilog, VHDL,...).
or vulnerabilities are tactically applied (vulnerabilities appear extinct to the bulk of the population, but only because manufactured vulnerabilities are sparingly exploited by the manufacturing block)
It is hard to predict what is more valuable: embedding HW vulnerabilities for the status quo and being able to exploit it for while before the public demands full hardware IP descriptions (verilog, VHDL) etc. or facing the end of vulnerabilities a little sooner but keeping hardware IP private (if the bugs stop with full interface descriptions).
The spec-completeness problem here is the same one that bites distributed systems verification: the proof holds inside an operating envelope (no adversarial inputs, trusted runtime, bounded sizes), and the interesting failures live at the boundary. TLA+ has the same property - you can prove liveness under a fairness assumption the deployment silently violates, and nothing in the proof tells you when reality drifted outside.
What I'd actually want from the tooling is a machine-checkable statement of the envelope itself, propagated as a runtime guard rather than a compile-time comment. Then "proof holds" and "we are still inside the proof's domain" are two separate, observable properties, and the unverified-parser / unverified-runtime cases stop being invisible.
As someone who has discovered a bug in a CPU that was previously unknown to our chip vendor, I would like to point out that the rabbit hole is deep.
On the other hand, I've discovered thousands of bugs that weren't hardware bugs, and dozens of bugs due to people not having read hardware errata documents, so just formally modeling what we can model will absurdly reduce the bug quantity.
Am I reading the article wrong? It appears that the author did not test the claims of the proof. Wouldn't a "bug" in this case mean she found an input that did not survive a round trip through the compression algorithm?
Update: Actually, I guess this may have been her point: "The two bugs that were found both sat outside the boundary of what the proofs cover." So then I guess the title might be a bit click baity.
Hi! Author here. When we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Secondly, I did find a bug in the algorithm. in Archive.lean, in the parsing of the compressed archive headers. That was the crashing input.
> I think it's fair to consider the entire binary a fair target.
Yes, it's still very much a bug. But it has nothing to do with your program being formally verified or not. Formal verification can do nothing about any unverified code you rely on. You would really need a formal verification of every piece of hardware, the operating system, the runtime, and your application code. Short of that, nobody should expect formal verification to ensure there are no bugs.
I read it as that’s also the point. Adding formal verification is not a strict defense against bugs. It is in a way similar to having 100% test coverage and finding bugs in your untested edge cases.
I don’t think the author is attempting to decry formal verification, but I think it a good message in the article everyone should keep in mind that safety is a larger, whole system process and bugs live in the cracks and interfaces.
You're right. It just seems as though it should be self-evident. Especially to those sophisticated enough to understand and employ formal verification.
It does seem that way doesn't it? But as software bugs are becoming easier to find and exploit, I'm expecting more and more people, including those not "sophisticated enough" to understand and employ formal verification to start using it
Then it would help to not introduce any confusion into the ecosystem by using a click-baity title that implies you found a bug which violated the formal specification.
> When we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
Yeah, I would actually agree. We wouldn't want to advertise that a system is formally verified in some way if that creates a false sense of security. I was just pointing out that, by my reading, the title appears to suggest that the core mechanism of the Lean proof is somehow flawed. When I read the title, I immediately thought, "Oooh. Looks like someone demonstrated a flaw in the proof. Neat." But that's not what is shown in the article. Just feels a bit misleading is all.
sorry to hijack the thread. Really cool post. How long did the whole exercise including porting zlib to lean take?
i have a hard real time system that i would love to try this on, but that's a lot of tools to learn and unclear how to model distributed systems in lean.
also, please add rss so i could subscribe to your blog
Lean-zip was not my project but one by others in the lean community. I'm not sure about the methodological details of their process - you might want to check with the original lean-zip authors (https://github.com/kim-em/lean-zip)
Compression/decompression is a good problem for proof of correctness. The specification is very simple (you must get back what you put in), while the implementation is complex.
What seems to have happened here is that the storage allocator underneath is unverified. That, too, has a relatively simple spec - all buffers disjoint, no lost buffers, no crashes.
I'll probably get a lot of hate mail for this but here goes nothing... Despite what many people like to claim, you cannot prove that a program has no bugs. That means proving the absence of bugs, and you cannot prove a negative. The best thing you can do is fail to find a bug, but that doesn't mean it isn't there.
Before everyone starts blabbing about formal verification, etc., consider this: how do you know that you didn't make a mistake in your formal verification? IOW, how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.
You can get asymptotically close to zero-bug proof, but you can never get there 100% of the way.
> That means proving the absence of bugs, and you cannot prove a negative. The best thing you can do is fail to find a bug, but that doesn't mean it isn't there.
You can conclusively (up my next point) prove a specific bug or class of bugs aren't there. But "entirely free of (all) bugs" is indeed a big misconception for what formal methods does.
> how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.
It's another misconception of formal methods to say that any result is established conclusively, without any caveats whatsoever. But then again neither is mathematics, or any other intellectual discipline. What formal methods does is reduce the surface area where mistakes could reasonably be expected to reside. Trusting the Rocq kernel, or a highly scrutinized model of computation and language semantics, is much easier than trusting the totality of random unannotated code residing in the foggiest depths of your average C compiler, for instance.
What is up with people saying you cannot prove a negative? Of course you can! (At least in formal settings)
For example it's extremely easy to prove there is no square with diagonals of different lengths. I'm the hard end, Andrew Wiles proved Fermat's Last Theorem which expresses a negative.
That's just a nit though, you're right about the infinite regress problem.
I believe it is rooted in legal proceedings where you are usually not required to in principle because it can be hard and/or impossible.
Eg. it is extremely hard to prove you "weren't there" (eg. at a crime site) if you cannot easily prove you were somewhere else (an affirmative): we do not keep court-admissible record of our whereabouts in case we get suspected of being in a place we were not in.
So it does hold in a number of cases where keeping evidence is required for proof. In software, that evidence would be formal specs and test reports which prove that cases covered with those are indeed working as specced, but provide no proof outside those "specs" (loosely considering an automated test a spec too).
Also, what even is "a negative"? The following statements are equivalent:
"There are no squares with diagonals of different lengths"
"All squares have diagonals of equal lengths"
Similarly, I can rephrase the statement about the absence of bugs. These are equivalent:
"This program has no bugs"
"This program always does exactly what it is supposed to do"
If you think you can't prove the first statement, then go ahead and prove the second one.
Are people thinking of falsification when talking about "proving negatives"? I.e. you can only falsify statements about the physical world, never prove them.
"This program has no bugs"
"This program always does exactly what it is supposed to do"
I believe these are not the same in software: bugs are not just wrongly implemented requirements, but also missed requirements or constraints (one can claim these are new features, but the fact that ID looped around at 65536 is going to be called a bug by users).
> That means proving the absence of bugs, and you cannot prove a negative.
You can prove that the program implements a specification correctly. That doesn't require proving a negative, but it does prove the absence of bugs. (I think you know this argument is weak, since your next paragraph is a complete non-sequitur)
> Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.
All proofs ultimately rest on axioms. That's normal and not really an argument unless you want to doubt everything, in which case what's the point in ever saying anything?
This comment conflates a number of ideas that obscures its own epistemological point. For example, yes, you can prove a negative (for instance, you can prove there is no largest integer).
The question of who validates the validator is real, but the abstraction of "formal verification" does serve a purpose, like any good mathematical or programming abstraction. Whole classes of bugs are removed; what's left to verify is usually much shorter.
This feels like a thought exercise rather than an argument.
By your logic, it's impossible to prove that a car is driving at 60mph. There could be an error in the speedometer which makes it impossible to verify that said car is going at the speed. You can get asymptomatically close to being sure that you're driving at 60 mph but you can never be 100% sure.
That's not correct. You can prove a car is driving 60mph as soon as you measure it doing that. "proving a negative" is for statements like "There are no purple zebra". You can never prove this because there is always the possibility the purple zebra is somewhere you haven't looked yet. As soon as you find one the statement becomes falsified, but until then it always remains unresolved even if almost certainly true.
Linking back to the parent statement, it's hard to prove a program has no bugs when there is always the possibility the bug just hasn't been found yet. On the flip side it's easy to prove something does have bugs as soon as you find one.
How do you know the one purple zebra wasn't just walking around in a way that meant they were always not where you were looking?
You can probabilistic say "it's extremely unlikely purple zebras exist" but you can never prove 100% they don't exist. And back to the real example, how can you prove there isn't a bug you just haven't found yet?
Even if all your claims are true (which I'm not 100% sold on but bear with me)... who cares? Sure, it won't get literally 100% of the way, but if we can get to a place where the only bugs in software are in the spec, in the proof assistants, and in weird hardware bugs, that would be an enormous win!
Tools like this (formal verification, sparse, etc.) and built-in mechanism (types, generics, RAII, rust's borrow checker) can only verify issues within the scope of that framework. There are also trade-offs and limitations with each type and each implementation.
Type checking allows you to (outside of type casting such as in languages like C/C++ and casting to object for generic containers in Java) verify that an object is of a given type. That allows you to be sure that a well-formed program isn't doing things like putting a random object in a list.
Languages like C#, Scala, and Kotlin improve Java generics by making the generic type of a container or other interface/type part of the type system. This allows generic types of a generic type to preserve the inner type. This makes it possible to implement things like monads and mapping functions to preserve the generic type.
A similar thing is possible with union types, sealed interfaces/traits, etc. that allow you to check and verify the return type instead of defaulting it to a generic object/any type.
Likewise with other features like nullable/non-null annotations (or corresponding nullable type annotations like in Kotlin and recent C# versions).
All of these can be abused/circumvented, but if you keep your code within that framework the compiler will stop that code compiling. Likewise, these solve a limited set of bugs. For example, nullable types can't verify memory management and related bugs.
Unfortunately, the discussion focused on the somewhat click baity title "proved this program correct". It's unclear what "this program" is. If it refers to the core algorithm with a proof, then there's no bug. If it includes the runtime and the header parser, then Lean didn't prove it correct.
That being said, using a coding agent to direct fuzzying and find bugs in the Lean kernel implementation is the big news here. (After all the kernel's implementation is not proved.)
The moral of the story is to push for more verified code not less and try AI bug hunting.
This is analogous to the fundamental problem of better automation in programming - eventually, the complexity and correctness of of the spec takes over, and if we don't manage that well, creating the spec is not that much less work than the programming part. If your program was for the wrong thing, a proof of it is also wrong.
everybody also ignores that even hello world isn't deterministic anymore. It just doesn't matter to execution if something broke unless it kicks back an error.
although, this is the best example of how quickly a trivality can knock so called "correct" programs over.
Fully agree. I started hitting this bottleneck when I combined a low-code backend I built with Claude Code to generate web applications.
I can build applications rapidly but the requirements and UX are the bottleneck. So much so that I often like to sit on a concept for multiple days to give myself the time to fully absorb the goal and refine the requirements. Then once I know what to build, it snaps together in like 4 hours.
There are a lot of ambiguities which need to be resolved ahead of time. Software engineering becomes a kind of detailed business strategy role.
I'm curious about your learning experience, but what was the nature of your bottleneck, exactly? Was the backend perfectly fine as a backend, but Claude struggled to wire it to a frontend gracefully?
This is very cool work but the author is labouring under a false premise about how axiomatic systems work:
> Every Lean proof assumes the runtime is correct.
No. Every valid Lean proof assumes that if the runtime/mathlib etc is correct, then it too is correct.
Tangentially also, most lean proofs are not dependent on whether or not the runtime has things like buffer overflows or denial of service against lean itself at all, because if I prove some result in Lean (without attacking the runtime) then a bug in the runtime doesn’t affect the validity of the result in general. It does mean however that it’s not ok to blindly trust a proof just because it only relies on standard axioms and has no “sorry”s. You also need to check that the proof doesn’t exploit lean itself.
So that's just one more win for formal verification despite the title it seems
I'm genuinely excited about AI agents and formal verification languages. To me it's obviously the way forward instead of moonshots trying to make agents that program in their own AI blackbox binary, or agents that code in current programming languages.
If we are heading in the direction of "huge codebases that nobody has written", or, "code is an artifact for the machine", I don't see a way out without making it proved.
If humans can review and edit the spec, then verify that the implementation matches it, suddenly leaving the implementation be an artifact for the machines seems okay
The downside of provers also being that they are a massive pain in the ass that very few want to use, this is also a complete win.
Nice work. Amusing that Lean's own standard library has a buffer overflow bug, which "leaked out" due to being exempted from the verification.
Regarding the DoS in the lean-zip application itself: I think this is a really neat example of the difficult problem of spec completeness, which is a subcase of the general problem (mentioned by porcoda in a sibling comment) of being sure that the spec is checking the right things. For a compression program, the natural, and I would also say satisfyingly beautiful thing to prove is that decomp(comp(x)) = x for all possible inputs x. It's tempting to think that at that point, "It's proven!" But of course the real world can call decomp() on something that has never been through comp() at all, and this simple, beautiful spec is completely silent on what will happen then.
This is a great reminder that ‘proved correct’ always has an invisible suffix: ‘with respect to the thing you actually specified.’ The hard part was never just proving things, it was pinning reality down tightly enough that the proof is about the right world.
Formal proofs can only ever work to validate against requirements... But a major issue with a lot of modern software is that the requirements themselves can be incorrect and some requirements can logically conflict with other requirements... So long as we have non-technical people formulating requirements, we can never have provably correct software.
I've come across issues in the past which weren't actually bugs. For example, the software was behaving exactly as intended but it looked like a bug to a user who didn't understand the nuances and complexities of what was happening.
I also cannot count how many times a non-technical person asked me to implement conflicting functionality or functionality which would have made the UX incredibly confusing for the user.
Alan Turing already proved with the Halting Problem that reasoning about program correctness is not possible. But we still try.
Wikipedia: [1] Turing proved no algorithm exists that always correctly decides whether, for a given arbitrary program and input, the program halts when run with that input. The essence of Turing's proof is that any such algorithm can be made to produce contradictory output and therefore cannot be correct.
> Alan Turing already proved with the Halting Problem that reasoning about program correctness is not possible.
This is so reductive a framing as to be essentially useless [0]. I think maybe you want to learn more about program correctness, formal verification, and programming language semantics before you make such statements in the future.
And Alonzo Church proved in 1940 that you can avoid this problem by using a typed language in which all programs halt. But sadly some programmers still resist this.
Some correct programs are supposed to run forever.
When is an OS supposed to halt? When you shut it down, or when you power down the hardware, and no other times. So if you don't do either of those things, then the OS is supposed to run forever. Does that, by itself, mean that the program is incorrect, or that the language is inadequate? No, it means that the definition is worthless (or at least worthless for programs like OSes).
Two different meanings of "forever" there. An OS runs for an arbitrarily large finite time, which is different from an infinite time.
Same way you can count to any finite integer with enough time, but you can never count to infinity.
Those kinds of interactive programs take in a stream of input events, which can be arbitrarily long, but eventually ends when the computer is shut down.
Termination checkers don't stop you from writing these interactive loops, they only stop non-interactive loops
you can still verify arbitrarily long running programs - there are instances of such software, such as sel4 (https://sel4.systems/) and certikos (https://flint.cs.yale.edu/certikos/), you simply model them as finite programs that run on an infinite stream of events.
> finite programs that run on an infinite stream of events
This requires coinduction, right? (That's my understanding of the formal representation of infinite streams.) If so, that does limit your options, since most of the proof assistants don't handle coinductive data, as I understand it.
> reasoning about program correctness is not possible
Not possible for all problems. We cannot decide correctness (ie adherence to a specification) for all programs, but we can definitely recognize a good chunk of cases (both positive and negative) that are useful.
The Halting Problem itself is recognizable. The surprising result of Turing’s work was that we can’t decide it.
We care only about a very small and narrow subset of possible programs, not any arbitrary one. It's possible to solve this kind of problem in large enough classes of program to be useful.
> The two bugs that were found both sat outside the boundary of what the proofs cover. The denial-of-service was a missing specification. The heap overflow was a deeper issue in the trusted computing base, the C++ runtime that the entire proof edifice assumes is correct.
Still an interesting and useful result to find a bug in the Lean runtime, but I’d argue that doesn’t justify the title. Or the claim that “the entire proof edifice” is somehow shaky.
It’s important to note that this is the Lean runtime that has a bug, not the Lean kernel, which is the part that actually does the verification (aka proving). [1] So it’s not even immediately clear what this bug would really apply to, since obviously no one’s running any compiled Lean code in any kind of production hot path.
[1] https://lean-lang.org/doc/reference/latest/Elaboration-and-C...
If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Second, there was a bug in the code. Maybe not a functional correctness bug, but I, along with many and most end users, would consider a crashing program buggy. Maybe we just have different tastes or different standards on what we consider an acceptable level of software quality.
W.r.t people running Lean in production, you'd be surprised...
But it just moves the question to whether the theorems covered in the proof are sufficient. Underlying it all is a simple question:
Does the system meet its intended purpose?
To which the next question is:
What is the intended purpose?
Describing that is the holy grail of requirements specification. Natural language, behaviour-driven development, test-driven development and a host of other approaches attempt to bridge the gap between implicit purpose and explicit specification. Proof assistants are another tool in that box.
It's also one of the key motivators for iterative development: putting software in front of users (or their proxies) is still the primary means of validation for a large class of systems.
None of which is implied criticism of any of those approaches. Equally, none completely solves the problem. There is a risk that formal proofs, combined with proof assistants, are trumpeted as "the way" to mitigate the risk that LLM-developed apps don't perform as intended.
They might help. They can show that code is correct with respect to some specification, and that the specification is self-consistent. They cannot prove that the specification is complete with regards its intended purpose.
> The most substantial finding was a heap buffer overflow! but, not in lean-zip's code, but in the Lean runtime itself. (emphasis mine)
> The OOM denial-of-service is straightforward: the archive parser was never verified. (me again)
"Lean proved this program correct" vs "I found bugs in the parts that Lean didn't prove was correct". The failure was not in Lean's proof (which as I said is heavily implied by the title), but in ancillary or unverified code.
Do I misunderstand how Lean works? I am by no means an expert (or even an amateur) on Lean or formal systems in general. Surely the first class of bug could be found in any code that uses the framework, and the second class of bug could happen to any system that isn't proven? Does that make sense? Otherwise where's the boundary? If you find a bug in the OS, does that mean Lean failed somehow? How about the hardware? If your definition of a 'formally verified system' goes beyond the code being verified and the most important thing is whether or not you can make it crash, then the OS and the hardware are also part of the system.
Of course bugs are important to users regardless of the cause, but the audience for your article seems to be software engineers and as a software engineer, your article was interesting and you found something useful, but the title was misleading; that's all I'm saying.
- When they fix the run time, bug A goes away. So the proof still holds and the zlib code is still correct.
- When they add a system of proofs for the parser and modify that, then bug B goes away. So the proof still holds and the zlib code is still correct; and now more of the library is proven correct.
The formulation of the title is "I was told X but that's not true"... but that's not true. You were told X, and X is true, but you found Y and Z which are also important.
There are two different answers to this question, and which one is "correct" depends entirely on the context of who is asking it.
1. It's the code that is specific to this program that sits above the run-time layer (internal view, that most programmers would take).
2. It's the code in the binary that is executed (external view, that most users would take).
The key question does not seem to be "was the proof correct", rather "did the proof cover everything in the program". The answer depends on whether you are looking at it from the perspective of a programmer, or a user. Given the overly strong framing that the article is responding to - highlighting the difference in this way does seem to be useful. The title is correct from the perspective that most users would take.
My feeling is that if you told the average non-technical user that some person/organisation had produced a formally verified version of a C compression library, you would likely get a blank look, so I think it's reasonable to assume that both Lean's intended audience, and the audience of the blog post linked here, correspond with number 1. in your list.
The bounded, specific codebase that you refer to is typically the library *and all of its dependencies*, which in this case includes the Lean runtime. This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations. In this case the original gushing claim that everything was verified is incorrect and premature. The article seems like a good exposition of why.
> The bounded, specific codebase that you refer to is typically the library and all of its dependencies, which in this case includes the Lean runtime.
How does that work? I thought the main idea is to write code in the Lean language which has some specific shape conducive to mathematical analysis, along with mathematical proofs that operate on that code. How then does a system like this handle a third party dependency? I've searched around and I can't find any information about how it works. I assumed that the boundary of the proof was the source code - surely they can't also be proving things like, say, DirectX?
> This is why formal verification is difficult: the proof chain needs to extend all the way down to the foundations.
The difficulty is not in explosions of computational complexity due to problems of incompleteness, decidability, the halting problem, those kinds of things? As I said this is not something I know much about but it's surprising to me if 'analysing all the code' is really the difficult bit.
As with most engineering problems the out-of-band issues tend to be the hardest to solve. Models of the part underneath the interesting part need to be complete/accurate enough to make the results useful. Compare it to crypto where people do not usually try to break the scheme - they try to break the specific implementation of the scheme because the weakest points will be at the interface between the theoretical construction and the actual concrete instantiation of the device that it will run on.
> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
Would you publish a blog post titled "the XXX test suite proved there was no bug. And then I found one"?
It would be a bit silly, right?
But yes, it would be equivalent to stating "No tests failed but I found a bug" and one would correctly deduce that test coverage is insufficient.
- testing libraries (and in this case - language itself) can have bugs
- what is not covered by tests can have bugs
Additionally would add that tests verify the assumptions of coder, not expectations of the business.
To give benefit to the author - I'd read the article as: having tests for given thing ensures that it does the thing that you built the tests for. This doesn't mean that your application is free of bugs (unless you have 100% coverage, can control entire state of the system, etc) nor that it does the right thing (or that it does the thing the right way)
Eg. something as simple as
has a bunch of exceptions that can happen in fetch() not covered with your tests, yet you can get 100% line coverage with a single deterministic happy path test.Basically, any non-functional side-effect behavior is a source of semantic "hiding" when measuring test coverage by lines being executed, and there is usually a lot of it (logging is another common source of side-effects not tested well). Some languages can handle this better with their typing approaches, but ultimately, there will be things that can behave differently depending on external circumstances (even bit flips, OOMs, full disk...) that were not planned for and do not flow from the code itself.
By way of analogy, if there was an article saying "I bought a 1Tb drive and it only came with 0.91 terabits", I think if you started explaining that technically the problem is the confusion between SI vs binary units and the title should really be "I bought a 0.91 terabit drive and was disappointed it didn't have more capacity", technically you'd be right, but people would rightfully eyeroll at you.
I take some comfort from being technically correct though; it's widely accepted by all of us pedants that that is the best kind of correct ;-)
I agree, but it’s not fair to imply that the verification was incorrect if the problem lies elsewhere.
This is a nice example of how careful you have to be to build a truly verified system.
Yes, and that would be relevant if this was a verified software system. But it wasn't: the system consisted of a verified X and unverified Y, and there were issues in the unverified Y.
The article explicitly acknowledges this: "The two bugs that were found both sat outside the boundary of what the proofs cover."
1/ lean-zip is open source so it's much easier to have more Claude's eyes looking at it
2/ I don't think Claude could prove anything substantial about the zip algorithm. That's what lean is for. On the other side, lean could not prove much about what's around the zip algorithm but Claude can be useful there.
So in the end lean-zip is now stronger!
Reminds of what some people in the Rust community do: they fight how safe this is or not. I always challenge that the code is composed of layers, from which unsafe is going to be one. So yes, you are righ to say that. Unsafe means unsafe, safe means safe and we should respect the meaning of those words instead of twisting the meaning for marketing (though I must say I heard this from people in the community, not from the authors themselves, ever).
Say more about people running Lean in production. I haven’t run into any. I know of examples of people using Lean to help verify other code (Cedar and Aeneas being the most prominent examples), but not the actual runtime being employed.
I took a quick scan of lean-lang.org just now, and, other than the two examples I mentioned, didn’t see a single reference to anything other than proving math.
I’m sure you’re in the Lean Zulup, based on what you’ve been up to. Are you seeing people talk about anything other than math? I’m not, but maybe I’m missing it.
Let me rephrase: aside from that one example from a couple years ago, I haven’t seen any examples of production code written in Lean. I’d be very interested in being proven wrong, this isn’t something I desire, just what I’ve observed. Have you seen any others?
More generally, you implicitly make a good point: writing important libraries in Lean and calling in from another language is probably the most likely use case. So not programs / apps / binaries written in Lean, but small critical components.
Nobody should be advertising that. Even ignoring the possibility of bugs in the runtime, there could also be bugs in the verifier and bugs or omissions in the specification. Formally verified never means guaranteed to be free of bugs.
> ... converted zlib (a C compression library) to Lean, passed the test suite, and then proved that the code is correct.
> Not tested. Proved. For every possible input. lean-zip
When I see a title transitioning from "Lean said this proof is okay" to "I found a bug in Lean", I'm intuitively going to think the author just found a soundness (or consistency) issue in Lean.
Still good to have found, but drawing conclusions past “someone could cheat at proving the continuum hypothesis” isn’t really warranted.
To illustrate, let's say you want to verify a "Hello world" program. You'd think a verification involves checking that it outputs "Hello, world!".
However, if a contractor or AI hands you a binary, what do you need to verify? You will need to verify that it does exactly print "Hello, world!", no more, no less. It should write to stdout not stderr. It shouldn't somehow hold a lock on a system resource that it can't clean up. It cannot secretly install a root-kit. It cannot try to read your credentials and send it somewhere. So you will need to specify the proof to a sufficient level of detail to capture those potential deviations.
Broadly, with both bugs, you need to ask a question: does this bug actually invalidate my belief that the program is "good"? And here you are pulling up a fact that the bug isn't found in the Lean kernel, which makes an assumption that there's no side-effect that bleeds over the abstraction boundary between the runtime and the kernel that affects the correctness of the proof; that safety guarantee is probably true 99.99% of the time - but if the bug causes a memory corruption, you'd be much less confident in that guarantee.
If you're really serious about verifying an unknown program, you will really think hard "what is missing from my spec"? And the answer will depend on things that are fuzzier than the Lean proof.
Now, pragmatically, there many ways a proof of correctness adds a lot of value. If you have the source code of a program, and you control the compiler, you can check the source code doesn't have weird imports ("why do I need kernel networking headers in this dumb calculator program?"), so the scope of the proof will be smaller, and you can write a small specification to prove it and the proof will be pretty convincing.
All in all, this is a toy problem that tells you : you can't verify what you don't know you should verify, and what you need to verify depends on the prior distribution of what the program is that you need to verify, so that conditional on the proof you have, the probability of correctness is sufficiently close to 1. There's a lesson to learn here, even if we deem Lean is still a good thing to use.
Every time someone makes this point, I feel obliged to point out that all alternatives to software verification have this exact same problem, AND many, many more.
https://kirancodes.me/posts/log-who-watches-the-watchers.htm...
You can see the clickbaitiness increases over time.
You're strawmanning the original authors' argument. The creator of lean-zip said that they proved there are no implementation bugs in the lean-zip program. A bug in lean-runtime does not contradict this claim.
If they e.g. specially crafted a .zip file that caused it to flip bits via row-hammer like memory accesses, the same would be true.
Ignorant question: why not? Is there an unacceptable performance penalty? And what's the recommended way in that case to make use of proven Lean code in production that keeps the same guarantees?
Also, part of my confidence comes from both having been a professional programmer for decades, across many languages, and also having programmed in Lean. It’s a great language for math, perhaps the best choice right now. But as a general purpose language it’s incredibly quirky.
Well, Lean is written in Lean, so I am pretty sure a runtime bug like this could be exploited to prove `False`. Yes, since the kernel is written in C++, technically it's not the part affected by this bug, but since you cannot use the Lean kernel without the Lean compiler, this difference does not matter.
It's called clickbait.
If you have a spec that isn’t correct, you can certainly write code that conforms to that spec and write proofs to support it. It just means you have verified a program that does something other than what you intended. This is one of the harder parts of verification: clearly expressing your intention as a human. As programs get more complex these get harder to write, which means it isn’t uncommon to have lean or rocq proofs for everything only to later find “nope, it has a bug that ultimately traces back to a subtle specification defect.” Once you’ve gone through this a few times you quickly realize that tools like lean and rocq are tricky to use effectively.
I kinda worry that the “proof assistants will fix ai correctness” will lead to a false sense of assurance if the specs that capture human intention don’t get scrutinized closely. Otherwise we’ll likely have lots of proofs for code that isn’t the code the humans actually intended due to spec flaws.
But that's not saying the proofs are an issue - usually the spec you can reasonably prove in lean or another prover, say TLA+ or Z3 depending on your kind of program - has to be overly simplified and have a lot of assumptions.
However, that is powerful.
It doesn't mean your program doesn't have bugs.
It means this big scary complicated algorithm you think works but are skeptical doesn't have bugs - so when you encounter one, you know the bug is elsewhere, and you start really looking at the boundaries of what could be misspecified, if the assumptions given to the prover are actually true, etc.
It eliminates the big scary thing everyone will think is the cause of the bug as the actual cause.
This has been insanely valuable to me lately. It is also something I never really was able to do before the help of AI - vibe coding proofs about my programs is IMO one of the killer apps of AI, since there aren't a ton of great resources yet about how to do it well since it is rarely done.
This surprises me. Formal verification so far has been a very niche thing apart from conventional type systems. I didn't think lack of vibe coding was much of a bottleneck in the past. Where do you use it?
Any moderately complex thread safety thing with a few moving parts (e.g. there are multiple mutexes involved in various parts of the system, verify no deadlocks).
The lack of vibe coding has been a bottleneck for literally everything before.
When I see people say the hate vibe coding, I think "why do you hate formal verification? Because you could be spending your time on formal verification instead of removing "code smells" that don't hurt anything from vibe code."
You have a program that does something and you write another program to prove it. What assurance do you have that one program has fewer bugs then the other? Why can one program have bugs but the other can't? How do you prove that you are proving the right thing? It all sort of ties into Heisenberg's uncertainty theorem. A system cannot be fully described from within that system.
Don't get me wrong, I think these are great systems doing great work. But I always feel there is something missing in the narrative.
I think a more practical view is that a program is already a sort of proof. there is a something to be solved and the program provides a mechanism to prove it. but this proof may be and probably is incorrect, as bugs are fixed it gets more and more correct. A powerful but time consuming tool to try and force correctness is to build the machine twice using different mechanisms. Then mismatched output indicates something is wrong with one of them. and your job as an engineer is to figure out which one. This is what formal verification brings to the table. The second mechanism.
A bug in the formal verification tool could be potentially noticed by any user of that formal verification tool. (And indirectly by any of their users noticing a bug about which they say "huh, I thought the tool told me that was impossible.")
A bug in your program can only be potentially noticed by you and your users.
There are also entirely categories of bugs that may not be relevant. For instance, if I'm trying to prove correctness of a distributed concurrent system and I use a model+verifier that verifies things in a sequential, non-concurrent way, then I don't have to worry about the prover having all the same sort of race conditions as my actual code.
But yeah, if you try to write your own prover to prove your own software, you could screw up either. But that's not what is being discussed here.
Surely you are talking about Godel incompleteness, not Heisenberg's uncertainty principle; in which case they're actually not the same system - the verification/proof language is more like a metalanguage taking the implementation language as its object.
(Godel's observation for mathematics was just that for formal number systems of sufficient power, you can embed that metalanguage into the formal number system itself.)
Common static types prove many of the important properties of a program. If I declare a variable of type String then the type checker ensures that it is indeed a String. That's a proof. Formal verification takes this further and proves other properties such as the string is never empty.
Common static types are very effective. Many users of Rust or Haskell will claim that if a program compiles then it usually works correctly when they go to run it.
However there is a non-linear relationship between probability of program correctness and the amount of types required to achieve it. Being almost certain requires vastly more types than just being confident.
That's the real issue with formal verification, being 75% sure and having less code is better than being 99% sure in most situations, though if I were programming a radiotherapy machine I might think differently.
Why?
Are you saying that all the programs ever written have the exact same chance of bugs? A hello world is as buggy as a vibe-coded Chromium clone?
If you accept the premise that different programs have different chances to have bugs, then I'd say:
1. Simpler programs are likely less buggy.
2. Programs used by more people are likely less buggy.
3. Programs maintained by experts who care about correctness are likely less buggy.
4. Programs where the stakes are higher are likely less buggy.
All things considered, I think it's fair to say Lean is likely less buggy then a random program written by me at weekend.
> Heisenberg's uncertainty theorem
It has nothing to do with the uncertainty principle. If you think otherwise, it means your understanding of uncertainty principle comes from sci-fi :)
I often think of the ‘news level’ of a bug. A bug in most code wouldn’t be news. A bug which caused lean to claim a real proof someone cared about was true, when it wasn’t, would in the proof community the biggest news in a decade.
Even if it is, the verification is still very useful. The verifier is going to run for a few minutes and probably not hit many edge cases. The chance it actually hits a bug is low, and the chance a bug makes it wrongly accept your program is a lot lower. Especially if it has to output a proof at the end. Meanwhile it's scrutinizing every single edge case your program has.
It runs (maybe crashes), therefore … it exists.
The tension between spec bugs vs. implementation bugs is real. But i will take a bug in a situation where the implementation has been verified any day.
Working over what we really want is problem solving in the problem domain.
As apposed to going into the never ending implementation not-what-we-were trying to solve weeds.
I notice two classes of bugs in my own programs:
- I meant the code to do X, but it does Y
- I meant the code to do X, and it does X, but X causes problems I didn't foresee
A proof assistant can help you prove the code does X, but it can't help you prove doing X doesn't cause problems you didn't foresee.
In other words, it can prove the soundness of the implementation, but not the design?
Is that right? Though I imagine if there are internal contradictions in the design, lean would catch those too.
So the issue would be "internally consistent yet incorrect designs"?
In the case of TFA, the issue was exhaustiveness, right? "What happens if..."
That sounds like a pretty important quality as far as security goes. Is there a way to make sure everything is actually verified?
I heard actually verifying everything is prohibitively expensive though, like it took 10-20 years to verify seL4 (10K LoC).
I think we need a way to verify the specs. A combo of formal logic and adversarial thinking (probably from LLMs) that will produce an exhaustive list of everything the program will do, and everything it won’t do, and everything that is underspecified.
Still not quite sure what it looks like, but if you stipulate that program generation will be provable, it pushes the correctness challenge up to the spec (and once we solve that, it’ll be pushed up to the requirements…)
As an aside, why can't people just write factually? This isn't a news site gamed for ad revenue. It's also less effort. I felt this post was mostly an insulting waste of time. I come to HN to read interesting stuff.
Formally-verified software is usually advertised "look ma no bugs!" Not "look ma no bugs*" *As long as this complicated chain of lemmas appropriately represents the correctness of everything we care about.
In boating theres often debate of right of way rules in certain situations, and some people are quick to point out that giant tanker ships should be giving way to tiny sailboats and get all worked up about it*. The best answer I've heard: they're dead right! that is to say as right as they are dead (if they didnt yield) lol. In the same vein, I think someone who assumed that a formally-verified software was perfect and got hacked or whatever is going to be a bit wiggly about the whole thing.
* = Technically the rules prioritize the tankers if they are "restricted in ability to maneuver" but everyone loves to argue about that.
Thats not the main finding of the article however. The main bug found was actually in the lean runtime, affecting all proofs using scalar arrays where the size of the array is not bounded.
> Not tested. Proved. For every possible input.
Finding inputs that crashed and then saying, "be clear about what is in the scope of what you proved" is interesting and factual.
LLM's are capable of producing code that passes formal verification.
The writing is on the wall: in the future more and more software on the abstract or platonic side of our computing base will be hermetically sealed against bugs and exploits. This quenching of bugs in the assured side will shift the mean location of bugs closer to the hardware side: at some point bugs and exploits will rely more and more on hardware quirks, and simply unspecified hardware.
Afterwards we can expect a long exponential decay of preventable safety violations: people mistakenly or surreptitiously disengaging the formal verification steps and shipping malicious or unverified code. Each such event will be its own big or small scandal, at some point there will be no deniability left: something must be on purpouse, either a malicious vulnerability or intentional disengagement of safety measures.
As the attack surface recedes towards the lower level hardware stack, it will open the debate that the community needs proper formal hardware descriptions (at least at the interface initially, not necessarily how the hardware has implemented it). As interface bugs get formalized 3 things can happen:
either vulnerabilities go extinct, and full formal hardware descriptions are not released
or vulnerabilities remain in each new generation of hardware, and malicious intent or negligence on behalf of the manufacturer can only be presumed, this will set up the community against manufacturers, as they demand full hardware descriptions (verilog, VHDL,...).
or vulnerabilities are tactically applied (vulnerabilities appear extinct to the bulk of the population, but only because manufactured vulnerabilities are sparingly exploited by the manufacturing block)
It is hard to predict what is more valuable: embedding HW vulnerabilities for the status quo and being able to exploit it for while before the public demands full hardware IP descriptions (verilog, VHDL) etc. or facing the end of vulnerabilities a little sooner but keeping hardware IP private (if the bugs stop with full interface descriptions).
What I'd actually want from the tooling is a machine-checkable statement of the envelope itself, propagated as a runtime guard rather than a compile-time comment. Then "proof holds" and "we are still inside the proof's domain" are two separate, observable properties, and the unverified-parser / unverified-runtime cases stop being invisible.
On the other hand, I've discovered thousands of bugs that weren't hardware bugs, and dozens of bugs due to people not having read hardware errata documents, so just formally modeling what we can model will absurdly reduce the bug quantity.
Update: Actually, I guess this may have been her point: "The two bugs that were found both sat outside the boundary of what the proofs cover." So then I guess the title might be a bit click baity.
If a buffer overflow causes the system to be exploited and all your bitcoins to be stolen, I don't think the fact that the bug being in the language runtime is going to be much consolation. Especially if the software you were running was advertised as formally verified as free of bugs.
Secondly, I did find a bug in the algorithm. in Archive.lean, in the parsing of the compressed archive headers. That was the crashing input.
Yes, it's still very much a bug. But it has nothing to do with your program being formally verified or not. Formal verification can do nothing about any unverified code you rely on. You would really need a formal verification of every piece of hardware, the operating system, the runtime, and your application code. Short of that, nobody should expect formal verification to ensure there are no bugs.
I don’t think the author is attempting to decry formal verification, but I think it a good message in the article everyone should keep in mind that safety is a larger, whole system process and bugs live in the cracks and interfaces.
Then it would help to not introduce any confusion into the ecosystem by using a click-baity title that implies you found a bug which violated the formal specification.
> When we speak of bugs in a verified software system, I think it's fair to consider the entire binary a fair target.
Yeah, I would actually agree. We wouldn't want to advertise that a system is formally verified in some way if that creates a false sense of security. I was just pointing out that, by my reading, the title appears to suggest that the core mechanism of the Lean proof is somehow flawed. When I read the title, I immediately thought, "Oooh. Looks like someone demonstrated a flaw in the proof. Neat." But that's not what is shown in the article. Just feels a bit misleading is all.
i have a hard real time system that i would love to try this on, but that's a lot of tools to learn and unclear how to model distributed systems in lean.
also, please add rss so i could subscribe to your blog
They used an AI agent sending ideas to a fuzzer and discovered a heap buffer overflow in Lean. This is big.
What seems to have happened here is that the storage allocator underneath is unverified. That, too, has a relatively simple spec - all buffers disjoint, no lost buffers, no crashes.
"This is genuinely one of the most memory-safe codebases I've analyzed."
Before everyone starts blabbing about formal verification, etc., consider this: how do you know that you didn't make a mistake in your formal verification? IOW, how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.
You can get asymptotically close to zero-bug proof, but you can never get there 100% of the way.
> That means proving the absence of bugs, and you cannot prove a negative. The best thing you can do is fail to find a bug, but that doesn't mean it isn't there.
You can conclusively (up my next point) prove a specific bug or class of bugs aren't there. But "entirely free of (all) bugs" is indeed a big misconception for what formal methods does.
> how do you know your formal verification is bug-free? Answer: you don't. Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.
It's another misconception of formal methods to say that any result is established conclusively, without any caveats whatsoever. But then again neither is mathematics, or any other intellectual discipline. What formal methods does is reduce the surface area where mistakes could reasonably be expected to reside. Trusting the Rocq kernel, or a highly scrutinized model of computation and language semantics, is much easier than trusting the totality of random unannotated code residing in the foggiest depths of your average C compiler, for instance.
For example it's extremely easy to prove there is no square with diagonals of different lengths. I'm the hard end, Andrew Wiles proved Fermat's Last Theorem which expresses a negative.
That's just a nit though, you're right about the infinite regress problem.
Eg. it is extremely hard to prove you "weren't there" (eg. at a crime site) if you cannot easily prove you were somewhere else (an affirmative): we do not keep court-admissible record of our whereabouts in case we get suspected of being in a place we were not in.
So it does hold in a number of cases where keeping evidence is required for proof. In software, that evidence would be formal specs and test reports which prove that cases covered with those are indeed working as specced, but provide no proof outside those "specs" (loosely considering an automated test a spec too).
"There are no squares with diagonals of different lengths"
"All squares have diagonals of equal lengths"
Similarly, I can rephrase the statement about the absence of bugs. These are equivalent:
"This program has no bugs"
"This program always does exactly what it is supposed to do"
If you think you can't prove the first statement, then go ahead and prove the second one.
Are people thinking of falsification when talking about "proving negatives"? I.e. you can only falsify statements about the physical world, never prove them.
You can prove that the program implements a specification correctly. That doesn't require proving a negative, but it does prove the absence of bugs. (I think you know this argument is weak, since your next paragraph is a complete non-sequitur)
> Or if you try to formally verify your formal verification then you're just translating the problem to a new layer. It's just a chain of proofs that is always ultimately based on an unproven one, which invalidates the whole chain.
All proofs ultimately rest on axioms. That's normal and not really an argument unless you want to doubt everything, in which case what's the point in ever saying anything?
The question of who validates the validator is real, but the abstraction of "formal verification" does serve a purpose, like any good mathematical or programming abstraction. Whole classes of bugs are removed; what's left to verify is usually much shorter.
Even with the addition of two numbers the execution of a program can be wrong if the CPU has a fault or if the runtime of the program has a bug.
I think you just need to look at why formal verification exists.
By your logic, it's impossible to prove that a car is driving at 60mph. There could be an error in the speedometer which makes it impossible to verify that said car is going at the speed. You can get asymptomatically close to being sure that you're driving at 60 mph but you can never be 100% sure.
This is useless and serves no purpose.
Linking back to the parent statement, it's hard to prove a program has no bugs when there is always the possibility the bug just hasn't been found yet. On the flip side it's easy to prove something does have bugs as soon as you find one.
You can probabilistic say "it's extremely unlikely purple zebras exist" but you can never prove 100% they don't exist. And back to the real example, how can you prove there isn't a bug you just haven't found yet?
Type checking allows you to (outside of type casting such as in languages like C/C++ and casting to object for generic containers in Java) verify that an object is of a given type. That allows you to be sure that a well-formed program isn't doing things like putting a random object in a list.
Languages like C#, Scala, and Kotlin improve Java generics by making the generic type of a container or other interface/type part of the type system. This allows generic types of a generic type to preserve the inner type. This makes it possible to implement things like monads and mapping functions to preserve the generic type.
A similar thing is possible with union types, sealed interfaces/traits, etc. that allow you to check and verify the return type instead of defaulting it to a generic object/any type.
Likewise with other features like nullable/non-null annotations (or corresponding nullable type annotations like in Kotlin and recent C# versions).
All of these can be abused/circumvented, but if you keep your code within that framework the compiler will stop that code compiling. Likewise, these solve a limited set of bugs. For example, nullable types can't verify memory management and related bugs.
That being said, using a coding agent to direct fuzzying and find bugs in the Lean kernel implementation is the big news here. (After all the kernel's implementation is not proved.)
The moral of the story is to push for more verified code not less and try AI bug hunting.
although, this is the best example of how quickly a trivality can knock so called "correct" programs over.
I can build applications rapidly but the requirements and UX are the bottleneck. So much so that I often like to sit on a concept for multiple days to give myself the time to fully absorb the goal and refine the requirements. Then once I know what to build, it snaps together in like 4 hours.
There are a lot of ambiguities which need to be resolved ahead of time. Software engineering becomes a kind of detailed business strategy role.
Are we baiting people with headlines now?
That sounds like quite the monthly bill. o_O
> Every Lean proof assumes the runtime is correct.
No. Every valid Lean proof assumes that if the runtime/mathlib etc is correct, then it too is correct.
Tangentially also, most lean proofs are not dependent on whether or not the runtime has things like buffer overflows or denial of service against lean itself at all, because if I prove some result in Lean (without attacking the runtime) then a bug in the runtime doesn’t affect the validity of the result in general. It does mean however that it’s not ok to blindly trust a proof just because it only relies on standard axioms and has no “sorry”s. You also need to check that the proof doesn’t exploit lean itself.
I'm genuinely excited about AI agents and formal verification languages. To me it's obviously the way forward instead of moonshots trying to make agents that program in their own AI blackbox binary, or agents that code in current programming languages.
If we are heading in the direction of "huge codebases that nobody has written", or, "code is an artifact for the machine", I don't see a way out without making it proved.
If humans can review and edit the spec, then verify that the implementation matches it, suddenly leaving the implementation be an artifact for the machines seems okay
The downside of provers also being that they are a massive pain in the ass that very few want to use, this is also a complete win.
"Lean proves other program correct but not itself"
Regarding the DoS in the lean-zip application itself: I think this is a really neat example of the difficult problem of spec completeness, which is a subcase of the general problem (mentioned by porcoda in a sibling comment) of being sure that the spec is checking the right things. For a compression program, the natural, and I would also say satisfyingly beautiful thing to prove is that decomp(comp(x)) = x for all possible inputs x. It's tempting to think that at that point, "It's proven!" But of course the real world can call decomp() on something that has never been through comp() at all, and this simple, beautiful spec is completely silent on what will happen then.
Tsk, tsk.
I've come across issues in the past which weren't actually bugs. For example, the software was behaving exactly as intended but it looked like a bug to a user who didn't understand the nuances and complexities of what was happening.
I also cannot count how many times a non-technical person asked me to implement conflicting functionality or functionality which would have made the UX incredibly confusing for the user.
https://news.ycombinator.com/item?id=12761986 (being this link more than 10yrs old is not surprising)
Wikipedia: [1] Turing proved no algorithm exists that always correctly decides whether, for a given arbitrary program and input, the program halts when run with that input. The essence of Turing's proof is that any such algorithm can be made to produce contradictory output and therefore cannot be correct.
[1] https://en.wikipedia.org/wiki/Halting_problem
This is so reductive a framing as to be essentially useless [0]. I think maybe you want to learn more about program correctness, formal verification, and programming language semantics before you make such statements in the future.
[0] See, e.g., type-checking.
When is an OS supposed to halt? When you shut it down, or when you power down the hardware, and no other times. So if you don't do either of those things, then the OS is supposed to run forever. Does that, by itself, mean that the program is incorrect, or that the language is inadequate? No, it means that the definition is worthless (or at least worthless for programs like OSes).
Same way you can count to any finite integer with enough time, but you can never count to infinity.
Those kinds of interactive programs take in a stream of input events, which can be arbitrarily long, but eventually ends when the computer is shut down.
Termination checkers don't stop you from writing these interactive loops, they only stop non-interactive loops
This requires coinduction, right? (That's my understanding of the formal representation of infinite streams.) If so, that does limit your options, since most of the proof assistants don't handle coinductive data, as I understand it.
Not possible for all problems. We cannot decide correctness (ie adherence to a specification) for all programs, but we can definitely recognize a good chunk of cases (both positive and negative) that are useful.
The Halting Problem itself is recognizable. The surprising result of Turing’s work was that we can’t decide it.
Keyword emphasis mine.