How to wrangle non-deterministic AI outputs into conventional software? (2025)

(domainlanguage.com)

23 points | by druther 16 hours ago

6 comments

ironbound 2 hours ago
Use one of these structured output libraries:
https://github.com/outlines-dev/outlines
https://github.com/jxnl/instructor
https://github.com/guardrails-ai/guardrails
https://www.askmarvin.ai/docs/text/transformation/
Some of them allow a JSON schema, others a Pydantic model (which you can transform to/from JSON).
an0malous 2 hours ago
Aren’t transformers intrinsically deterministic? I thought the randomness was intentional to make chatbots seem more natural, and OpenAI used to have a seed parameter you could set for deterministic output. I don’t know why that feature isn’t more popular, for the reasons this article outlines
[-]
- jkaptur 1 hour ago
  (I'm not an expert. I'd love to be corrected by someone who actually knows.)
  Floating-point arithmetic is not associative. (A+B)+C does not necessarily equal A+(B+C), but you can get a performance improvement by calculating A, B, and C in parallel, then adding together whichever two finish first. So, in theory, transformers can be deterministic, but in a real system they almost always aren't.
  [-]
  - 10000truths 48 minutes ago
    Not an expert either, but my understanding is that large models use quantized weights and tensor inputs for inference. Multiplication and addition of fixed-point values is associative, so unless there's an intermediate "convert to/from IEEE float" step (activation functions, maybe?), you can still build determinism into a performant model.
    [-]
    - kimixa 35 minutes ago
      Fixed point arithmetic isn't truly associative unless they have infinite precision. The second you hit a limit or saturate/clamp a value the result very much depends on order of operations.
- janalsncm 1 hour ago
  Transformers are just a special kind of binary which are run by inference code. Where the rubber meets the road is whether the inference setup is deterministic. There’s some literature on this: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
  I don’t think the issue is determinism per se but chaotic predictions that are difficult to rely on.
  [-]
  - an0malous 52 minutes ago
    I agree they could be chaotic but I think that’s an important distinction
- solsane 1 hour ago
  Well, you could say that about computers in general. I'm assuming you're referring to temperature (or something similar) which can be set to always pick the most probable token. Floats aside, this should be deterministic. But practically I don't think that changes much since adjusting the input slightly can lead to very different output. Also back in the day the temperature helped it avoid cyclic loops
  [-]
  - an0malous 53 minutes ago
    Yes but chaotic is very different than non deterministic, and not just in an academic way because e.g. I can write tests against chaotic outputs but not really against non deterministic outputs.
- esafak 1 hour ago
  The models generate a token distribution. Which one to pick is a choice. One can sample from the distribution, hence the randomness.
- bpodgursky 1 hour ago
  Strict deterministic output for a given prompt prevents the use of RAG, which increasingly limits the relative utility of a LLM within an organization.
- ares623 2 hours ago
  Maybe it allowed spitting out copyrighted works verbatim
galaxyLogic 2 hours ago
I wonder if the same problem exists with AI based code-development more specifically.
To produce an application you should have some good unit-tests. So AI produces some unit-tests for us. Then we ask it again, and the unit-tests will be different. Where does that leave us? Can we be confident that the generated unit-tests are in some sense "correct"? How can they be correct if they are different every time we ask??
gulugawa 2 hours ago
This page has some advice: https://p-nand-q.com/programming/languages/java2k/
ramity 3 hours ago
Let me first start off by saying I and many others have stepped in this pitfall. This is not an attack, but a good faith attempt to share painfully acquired knowledge. I'm actively using AI tooling, and this comment isn't a slight on the tooling but rather how we're all seemingly putting the circle in the square hole and it fits.
Querying an LLM to output its confidence in its output is a misguided pattern despite being commonly applied by many. LLMs are not good at classification tasks as the author states. They can "do" it, yes. Perhaps better than random sampling can, but random sampling can "do" it as well. Don't get too tied to that example. The idea here is that if you are okay with something getting the answer wrong every so often, LLMs might be your solve, but this is a post about conforming non-deterministic AI into classical systems. Are you okay if your agentic agent picks the red tool instead of the blue tool 1%, 10%, etc of the time? If so, you're never not going to be wrangling, and that's the reality often left unspoken when integrating these tools.
While tangential to this article, I believe its worth stating that when interacting with an LLM in any capacity, remember your own cognitive biases. You often want the response to work, and while generated responses may look good and fit your mental model, it requires increasingly obscene levels of critical evaluation to see through the fluff.
For some, there will be inevitable dissonance reading this, but consider that these experiments are local examples. Its lack of robustness will become apparent with large scale testing. The data spaces these models have been trained on are unfathomably large in both quantity and depth, but under/over sampling bias will be ever present (just to name one).
Consider the the following thought experiment: You are an applicant for a job submitting your resume with knowledge it will be fed into an LLM. Let's confine your goal into something very simple. Make it say something. Let's oversimplify for the sake of the example and say complete words are tokens. Consider "collocations". [Bated] breath, [batten] down, [diametrically] opposed, [inclement] weather, [hermetically] sealed. Extend this to contexts. [Oligarchy] government, [Chromosome] biology, [Paradigm] technology, [Decimate] to kill. With this in mind, consider how each word of your resume "steers" the model's subsequent response, and consider how the data each model is trained on can subtly influence its response.
Now let's bring it home and tie the thought experiment into confidence scoring in responses. Let's say its reasonable to assume that the results of low accuracy/low confidence models are less commonly found on the internet than higher performing ones. If that can be entertained, extend the argument to confidence responses. Maybe the term "JSON" or any other term used in the model input is associated with high confidences.
Alright, wrapping it up. The end point here is that the model output provided confidence value is not the likelihood of the answer provided in the response but rather the most likely value following the stream of tokens in the combined input and output. The real sampled confidence values exist closer to code, but they are limited to each token. Not series of tokens.
[-]
- arduanika 2 hours ago
  "when interacting with an LLM in any capacity, remember your own cognitive biases. You often want the response to work, and while generated responses may look good and fit your mental model, it requires increasingly obscene levels of critical evaluation to see through the fluff."
  100% this.
  Idk about the far-out takes where "AI is an alien lifeform arrived into our present", but the first thing we know about how humans relate to extraterrestrials is: "I want to believe".
ironbound 16 hours ago
[flagged]
[-]
- dang 4 hours ago
  "Don't be snarky."
  "Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
  https://news.ycombinator.com/newsguidelines.html