One thing extremely worth noting that the article does not:
The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the probability distribution of energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).
Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.
Now to drow another analogy for what happens at high temperatures: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of energies under this softmax distribution, that mean goes up with temperature, and when the energy gets high enough, the particles start shaking off energy in the form of photons that are now high energy enough to be in the visible spectrum. Incandescent bulbs with tungsten filaments are even hotter than that heating element, and glow white because as temperature T is even higher, the softmax distribution's mean energy moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. In the case of the bulb, photons of all sorts of wavelengths are being spewed out, that's white light. Likewise, if you set an LLM's temperature to an absurdly high number, it spews out a very wide spectrum of mostly nonsense tokens.
What happens if you use an integer like 2 or 3 instead of e in the softmax equation? Is e what makes it so they end up summing to 1? (I have not done real math in yearssss.)
It works the same way:
softmax is essentially just applying the normalization to the vector exp(x).
From an "engineering" POV this effectively ensures that the vector you normalize has strictly positive entries, so the result ends up being a proper distribution.
From a theory POV you get softmax like distributions (Gibbs distributions) by trying to balance following some energy E(x) and the entropy of the distribution.
In essence the softmax is the answer to "I try to follow the maximum of a function E(x) but I need to maintain some level of uncertainy".
The balancing coefficient between entropy and picking the maximum of the function is called "temperature" (following the behavior of particles in a physical system: The colder the system, the lower the chance of having particles randomly walk away from the minimal energy state).
specifically, the temperature is
softmax(x/temp)
if you draw temp->0, your softmax slowly becomes an argmax (with temp=0 being a literal argmax). If you increase the temperature, you are closer to the "random fluctuations" leaving more room for sampling x values that are not the maximum of x. (this is why e.g. LLMs become deterministic as you decrease temp->0)
Using a different base other than e implicitly changes the temperature:
N^x = exp(ln(N) x)
The normalization works the same since you are still dividing a positive value N^x by the sum of all alternatives sum(N^x_i), which is a normalization by design
It's equivalent to multiplying all inputs by log b. And multiplying all inputs by a value changes how much the probabilities are extremized. This is easy to see because adding a value to everything doesn't change the output, so the biggest input can be assumed to be 0 and others negative. So multiplying by 0 makes all outputs equal while as the multiplier tends to infinity, all other inputs tend to -infinity and thus the biggest output tends to 1 and others to 0. Multiplying by negative numbers results in the lowest becoming the highest.
"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."
Mathematically, it is literally a probability distribution, because it fits the definition of a measure whose total mass is one, so I think the language is just imprecise. What they may be trying to say is that semantically it doesn't arise in a principled way from an uncertainty model, such as from Bayesian or frequentist statistics.
Hogwash. If you get into deriving maximum entropy distributions via the calculus of variations, the multinomial is the maximum entropy distribution among categorical distributions.
This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.
Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.
The comment in parenthesis mentions "they're not derived from a probability space" [1]. I don't know about probability spaces nor softmax to know what part of a probability space this is missing compared to other probability distributions, nor how other probability distributions satisfy probability spaces.
Sounds like they're saying that since the distribution doesn't come from measuring or calculating the probability of something, it has the form of a probability distribution but isn't really one. Like saying 5 feet is a height that a person can have, but since I just made up that number it's not actually a person's height.
The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.
The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.
Softmax isn't a loss function. It is used to transform model outputs into positive numbers that sum to 1, so that they can be interpreted as probabilities, and then those numbers are passed into (typically) the cross entropy loss function. I think you mean, which models are trained using some function other than softmax to transform the model outputs. There are a number of alternatives to softmax, such as the ones described here https://www.emergentmind.com/topics/sparsemax
They’re not. Cross entropy loss is E[-log q] where q is a probability. You could convert the model outputs x into probabilities using some other function like q = 1/Z x^2, and compute cross entropy loss just fine.
iirc, there is a bunch of formal
machinery you need to define probability distributions for situations such as infinite outcomes (eg what is the probability that a random real number between 0 and 10 is less than 3?)
> The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimates
Actually I believe that most of the times even after softmax, sampling is ways too permissive, seldom accepting low quality candidates. We all have the experience of seeing frontier LLMs sometimes putting a word in a different language that is really off-putting and almost impossible to explain, or other odd errors in just a single word of the output: most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token. I believe a better approach is to have a strong filter on which candidates are acceptable, like in the example here: https://antirez.com/news/142
Yeah, softmax may have useful applications, but anytime you find yourself using the same hammer for everything that looks like a nail it's a bit of a red flag.
If you take instances of softmax that you find in training / inference and there turn out to be a few, and use other things like entmax or sparsemax you see across the board improvements. And like top1 often is just the best answer too, there's a reason why when you're doing tool calls temp=0 is the way to go. Like do you really want creative unicode tokens when writing bash commands. From what I can tell, most of the time softmax is the worst answer that works.
Regret analysis in bandit and similar algorithms shows how inference is connected to loss function. If your loss function is good, greedy inference is as good as joint inference.
Training on cost-to-go loss is good enough.
Perfect cost-to-go eliminates the need for global algorithms and allows local decision making. Given “natural” datasets it is probably the best thing to attempt to learn.
The fact that probabilistic graphical models never really worked proves it somewhat.
On a tangential note, I keep noticing
"why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".
I've seen many posts on Reddit in this AI-induced 'psychosis' when people end up believing the words that get generated for them without applying sufficient critical thought.
This sycophancy is a serious problem and exploits a weakness in the human psyche (flattery) which may be easier for the RLHF to find reward in than genuinely correct responses.
This problem is super pervasive in companies where the less technical individuals (that also happen to be decision makers) are using AI to fight/challenge the technical knowledge of their SMEs. It's super annoying. SMEs have some real gold in the form of niche/tribal knowledge that, by the grace of html Jesus, is not always sufficiently documented for an AI to absorb it into its pseudo-aggregate data sphere.
It maps (-inf, inf) to (0, inf) in about as nice a way as you could expect (addition turns into multiplication). When you want to constrain a value to be positive, parameterizing it with exp is usually a good option.
Something that really helped me grasp the foundational relevance of the softmax is to justify from first principles why e^x shows up in the preferred mapping function in the numerator (1). The stated problem of mapping raw inputs/scores/logits to a probability distribution can be solved by a bunch of arbitrary functions, and the usual justification given for a softmax is "it has nice derivatives" which is empirically useful but not satisfying.
The sketch of the justification is something like this. We first need a function that maps from (-inf, inf) to a unique positive value, and then we need to normalize the resulting values. Setting aside the normalizing step, we imagine a f(x) that needs to fit the following properties:
1. It should be strictly positive, so that we can normalize it into a (0, 1) probability.
2. It should preserve the relative ordering of the logits to allow them to be interpreted as scores. Thus $f(x)$ should be monotonically increasing.
3. It should be continuous and differentiable everywhere, since we are interested in learning through this function via backpropagation.
4. It should have shift-invariance with respect to the input, as we don't want the model to have to learn some preferred logit-space where there is a stronger learning signal. For example, applying softmax on the values `(-1, 1, 3, 5)` would yield the same result as applying it to `(9, 11, 13, 15)`. This property can also be restated as a "scale invariance of probability ratios", where the ratio between $f(x)$ and $f(x+c)$ for a given $c$ is a constant. One useful interpretation of this property is that the learning domain or "gradient-learning surface" is stable, and high-magnitude initializations won't impede the learning process.
Taken at face value, these properties uniquely define e^x. The last property is actually pretty debatable, because in the context of machine learning, we actually do have a "preferred logit-space", namely closer to zero, for numerical stability. But there are other ways to enforce this in a post-hoc manner (e.g. weight initialization, normalization layers, etc.)
Another property that is uniquely justifies e^x and thus softmax is IIA (independence of irrelevant alternatives), which states that the odds for two classes, p_i / p_j, only depend on the logits/inputs for i and j, and an irrelevant class k has no impact. For example, for Softmax([5, 7, 1]) and Softmax([5, 7, 10]), the resulting odds for the first two values (p_i/p_j) should be the same from both distributions, regardless of the third value.
Finally, if the "desired properties" approach is not satisfying, a more theoretical route for justifying the form of the softmax uses the framework of maximum entropy (E. T. Jaynes published this in 1957 to justify the Boltzmann distribution).
TL;DR, softmax is not a the only solution to mapping function of unnormalized values to a probability distribution, but it can be justified through axiomatic properties.
(1) one could say that the exponential shows up from the Boltzmann distribution, but then the same question applies.
The reason for exp(x) is that its derivative is exp(x), which makes it possible to express the gradient of s(x) in terms of s(x), or both in terms of exp(x). This simplifies the computation of backward pass.
> The stated problem of mapping raw inputs/scores/logits to a probability distribution can be solved by a bunch of arbitrary functions, and the usual justification given for a softmax is "it has nice derivatives" which is empirically useful but not satisfying.
Often there isn't any more to it than that. For example, the entire justification for least-squares error measurement is that it has convenient derivatives.
The central limit theorem is an extremely powerful justification. That doesn't mean it's considered whenever it's used, but it absolutely can be strongly justified (to the degree that other error measurements are only needed in relatively small samples of the feature space where errors will not yet converge to Gaussian)
Softmax is defined over an arbitrary vector of raw real numbers. Stating that those inputs are "logits" is applying post-hoc semantics to what the model is learning. One of the key properties of a softmax is scale invariance, (e.g. softmax([-1, 1, 3, 5]) == softmax([9, 11, 13, 15])) and so it is easiest to just think of it as operating on a vector of unnormalized raw scores, which is the more colloquial definition of logit.
(also, log(p) is not the formal definition of a logit)
It's still true that softmax transforms arbitrary vectors into probability vectors.
In your example you'll also get the original `p` with just `exp(logits)`. Softmax normalizes the output to sum to one, so it can output a probability vector even if the input is _not_ simply `log(p)`.
The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the probability distribution of energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).
Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.
Now to drow another analogy for what happens at high temperatures: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of energies under this softmax distribution, that mean goes up with temperature, and when the energy gets high enough, the particles start shaking off energy in the form of photons that are now high energy enough to be in the visible spectrum. Incandescent bulbs with tungsten filaments are even hotter than that heating element, and glow white because as temperature T is even higher, the softmax distribution's mean energy moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. In the case of the bulb, photons of all sorts of wavelengths are being spewed out, that's white light. Likewise, if you set an LLM's temperature to an absurdly high number, it spews out a very wide spectrum of mostly nonsense tokens.
[1] https://en.wikipedia.org/wiki/Boltzmann_distribution
https://cavendishlabs.org/blog/negative-temperature/
From a theory POV you get softmax like distributions (Gibbs distributions) by trying to balance following some energy E(x) and the entropy of the distribution. In essence the softmax is the answer to "I try to follow the maximum of a function E(x) but I need to maintain some level of uncertainy".
The balancing coefficient between entropy and picking the maximum of the function is called "temperature" (following the behavior of particles in a physical system: The colder the system, the lower the chance of having particles randomly walk away from the minimal energy state).
specifically, the temperature is
softmax(x/temp)
if you draw temp->0, your softmax slowly becomes an argmax (with temp=0 being a literal argmax). If you increase the temperature, you are closer to the "random fluctuations" leaving more room for sampling x values that are not the maximum of x. (this is why e.g. LLMs become deterministic as you decrease temp->0)
Using a different base other than e implicitly changes the temperature:
N^x = exp(ln(N) x)
The normalization works the same since you are still dividing a positive value N^x by the sum of all alternatives sum(N^x_i), which is a normalization by design
Also, so long as the function is non-negative for all inputs and positive for at least one you'll always get a valid probability distribution.
"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."
Why is this a "pseudo-probability distribution?"
This is exactly the sense that it comes up for old school LMs and why it appears in thermodynamics.
Of course it is entirely possible that newfangled ML people use it without understanding that it is derived from first principles - i.e. see article.
[1] https://en.wikipedia.org/wiki/Probability_space
The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.
It's true that the PyTorch API conflates cross entropy and softmax, but they are separate concepts.
Actually I believe that most of the times even after softmax, sampling is ways too permissive, seldom accepting low quality candidates. We all have the experience of seeing frontier LLMs sometimes putting a word in a different language that is really off-putting and almost impossible to explain, or other odd errors in just a single word of the output: most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token. I believe a better approach is to have a strong filter on which candidates are acceptable, like in the example here: https://antirez.com/news/142
If you take instances of softmax that you find in training / inference and there turn out to be a few, and use other things like entmax or sparsemax you see across the board improvements. And like top1 often is just the best answer too, there's a reason why when you're doing tool calls temp=0 is the way to go. Like do you really want creative unicode tokens when writing bash commands. From what I can tell, most of the time softmax is the worst answer that works.
Training on cost-to-go loss is good enough. Perfect cost-to-go eliminates the need for global algorithms and allows local decision making. Given “natural” datasets it is probably the best thing to attempt to learn. The fact that probabilistic graphical models never really worked proves it somewhat.
How do you identify what the model wanted to say?
On a tangential note, I keep noticing "why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".
This sycophancy is a serious problem and exploits a weakness in the human psyche (flattery) which may be easier for the RLHF to find reward in than genuinely correct responses.
The sketch of the justification is something like this. We first need a function that maps from (-inf, inf) to a unique positive value, and then we need to normalize the resulting values. Setting aside the normalizing step, we imagine a f(x) that needs to fit the following properties:
1. It should be strictly positive, so that we can normalize it into a (0, 1) probability.
2. It should preserve the relative ordering of the logits to allow them to be interpreted as scores. Thus $f(x)$ should be monotonically increasing.
3. It should be continuous and differentiable everywhere, since we are interested in learning through this function via backpropagation.
4. It should have shift-invariance with respect to the input, as we don't want the model to have to learn some preferred logit-space where there is a stronger learning signal. For example, applying softmax on the values `(-1, 1, 3, 5)` would yield the same result as applying it to `(9, 11, 13, 15)`. This property can also be restated as a "scale invariance of probability ratios", where the ratio between $f(x)$ and $f(x+c)$ for a given $c$ is a constant. One useful interpretation of this property is that the learning domain or "gradient-learning surface" is stable, and high-magnitude initializations won't impede the learning process.
Taken at face value, these properties uniquely define e^x. The last property is actually pretty debatable, because in the context of machine learning, we actually do have a "preferred logit-space", namely closer to zero, for numerical stability. But there are other ways to enforce this in a post-hoc manner (e.g. weight initialization, normalization layers, etc.)
Another property that is uniquely justifies e^x and thus softmax is IIA (independence of irrelevant alternatives), which states that the odds for two classes, p_i / p_j, only depend on the logits/inputs for i and j, and an irrelevant class k has no impact. For example, for Softmax([5, 7, 1]) and Softmax([5, 7, 10]), the resulting odds for the first two values (p_i/p_j) should be the same from both distributions, regardless of the third value.
Finally, if the "desired properties" approach is not satisfying, a more theoretical route for justifying the form of the softmax uses the framework of maximum entropy (E. T. Jaynes published this in 1957 to justify the Boltzmann distribution).
TL;DR, softmax is not a the only solution to mapping function of unnormalized values to a probability distribution, but it can be justified through axiomatic properties.
(1) one could say that the exponential shows up from the Boltzmann distribution, but then the same question applies.
Often there isn't any more to it than that. For example, the entire justification for least-squares error measurement is that it has convenient derivatives.
Not really, softmax transforms logits (logariths of probabilities) into probabilities.
Probabilities → logits → back again.
Start with p = [0.6, 0.3, 0.1]. Logits = log(p) = [-0.51, -1.20, -2.30]. Softmax(logits) = original p.
NN prefer to output logits because they are linear and go from -inf to +inf.
(also, log(p) is not the formal definition of a logit)
In your example you'll also get the original `p` with just `exp(logits)`. Softmax normalizes the output to sum to one, so it can output a probability vector even if the input is _not_ simply `log(p)`.