How to Improve LLM Responses With Better Sampling Parameters

Example Python code taken from the OpenAI Python SDK where the chat completion API is called with the parameters temperature and top_p.

When calling the OpenAI API with the Python SDK, have you ever wondered what exactly the temperature and top_p parameters do?

When you ask a Large Language Model (LLM) a question, the model outputs a probability for every possible token in its vocabulary.

After sampling a token from this probability distribution, we can append the selected token to our input prompt so that the LLM can output the probabilities for the next token.

This sampling process can be controlled by parameters such as the famous temperature and top_p.

In this article, I will explain and visualize the sampling strategies that define the output behavior of LLMs. By understanding what these parameters do and setting them according to our use case, we can improve the output generated by LLMs.

For this article, I'll use VLLM as the inference engine and Microsoft's new Phi-3.5-mini-instruct model with AWQ quantization. To run this model locally, I'm using my laptop's NVIDIA GeForce RTX 2060 GPU.

· Understanding Sampling With Logprobs ∘ LLM Decoding Theory ∘ Retrieving Logprobs With the OpenAI Python SDK · Greedy Decoding · Temperature · Top-k Sampling · Top-p Sampling · Combining Top-p and Temperature · Min-p Sampling · Conclusion · References

Understanding Sampling With Logprobs

LLM Decoding Theory

LLMs are trained on a finite vocabulary V, which contains all possible tokens x that the model can see and output. A token can be a word, a character, or anything in between.

LLMs take a sequence of tokens x = (x_1, x_2, x_3, ..., x_n) as an input prompt, where each x is an element of V.

Then, the LLM outputs a probability distribution P for the next token, given the input prompt. Mathematically, this can be described as P(x_t | x_1, x_2, x_3, ..., x_t-1).

Which token we choose from this distribution is up to us to decide. Here we can choose between different sampling strategies and sampling parameters.

Sampling is a fancy word for rolling the dice. Photo by Jonathan Petersson on Unsplash

After selecting the next token, we append the selected token to the input prompt and repeat the cycle.

In the world of LLMs, these probability distributions are often described using logarithmic probabilities, called logprobs.

As shown in the figure below, the logarithmic function is always negative between zero and one.

A chart showing the logarithmic function. The log of one is zero and the log between 0 and 1 is negative.

Plots of logarithm functions with three commonly used bases. Image source from Richard F. Lyon licensed under CC BY-SA 3.0

This is why logprobs are negative numbers. A logprob close to zero is a very high probability close to 100%, and a large negative logprob is a probability close to 0%.

Retrieving Logprobs With the OpenAI Python SDK

Here is a simple Python script that uses the official OpenAI Python client for chat completion. With these parameters, the model will try to predict the next token given an input prompt.

from openai import OpenAI

client = OpenAI()

completion = client.completions.create(
    model="jester6136/Phi-3.5-mini-instruct-awq",
    prompt="The quick brown fox jumps over the",
    logprobs=10,
    temperature=1.0,
    top_p=1.0,
    max_tokens=1
)

Below is the token probability distribution for the given sentence. The model chose tokens=[' lazy'] as the next token. We can also see the top 10 logprobs in the list top_logprobs.

logprobs = completion.choices[0].logprobs

print(logsprobs)
>> Logprobs(text_offset=[0], 
>>  token_logprobs=[-0.0013972291490063071], 
>>  tokens=[' lazy'], 
>>  top_logprobs=[{' lazy': -0.0013972291490063071, 
>>                 ' sleep': -7.501397132873535, 
>>                 ' l': -9.095147132873535, 
>>                 ' log': -9.126397132873535, 
>>                 ' la': -9.220147132873535, 
>>                 ' le': -9.313897132873535, 
>>                 ' lay': -9.720147132873535, 
>>                 ' Laz': -9.938897132873535, 
>>                 ' dog': -9.970147132873535, 
>>                 ' ': -9.970147132873535}])

We can convert the logprobs to percentages using the following formula:

import numpy as np
probability = 100 * np.exp(logprob)

After converting all logprobs, we can see that the token "lazy" has a 99.86% chance of being sampled. This also means that there is a 0.14% chance that something else will be sampled.

The chart shows the probability of the top 10 tokens. Here, the word lazy is at almost 100% and the rest is at almost 0%.

Token probabilities for the top 10 tokens given the completion prompt "The quick brown fox jumps over the" with temperature = 1. Image by author

Greedy Decoding

The most basic sampling strategy is called greedy decoding. Greedy decoding simply chooses the most likely token each time.

Greedy decoding is a deterministic sampling strategy because there is no randomness involved. Given the same probability distribution, the model will always choose the same next token.

The downside of greedy decoding is that the resulting text is repetitive and not very creative. Imagine writing a text where you always choose the most common words and standard phrases you can think of.

The formula for greedy decoding [1]

The opposite of deterministic greedy decoding is called stochastic decoding. This is where we start sampling from our distribution to get more creative texts based on chance. Stochastic decoding is what we usually do with LLM outputs.

Temperature

Most people who use LLMs have heard of the temperature parameter.

OpenAI gives the following definition for the temperature parameter:

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

After our LLM outputs the probability distribution for the next token, we can modify its shape. The temperature T is a parameter that controls how sharp or flat the probability distribution for the next token becomes.

Mathematically, the new probability distribution P_T after temperature scaling can be calculated as shown in the formula below. z_v is the logprob or logit from the LLM for token v which gets divided by the temperature parameter T.

The temperature scaling formula is essentially a softmax with the additional scaling parameter T.

The formula for temperature scaling [1]

Instead of sampling from the original probability distribution P, we now sample from the new probability distribution P_T.

In the figure below, I have plotted four different probability distributions for the input prompt "1+1=". Obviously, the next token should be 2.

(These are not the full distributions, because I only plotted a few values).

Four charts showing the probability distributions for different values of the temperature. The charts shows T=0.1, T=1.0, T=1.5 and T=2.0. The higher the temperature, the flatter the distribution.

Four different probability distribution plots from Microsoft's Phi-3.5-mini-instruct LLM with varying temperature T parameters for the input prompt "1+1=". Image by author

As you can see, the higher the temperature T goes, the flatter the distribution becomes. Where "flat" means that everything becomes more equally likely.

Interestingly, with the default temperature of T=1.0, the obvious answer 2 has a chance of about 85%, which means it is not that unlikely to get a different answer if you sample the output enough times.

With a temperature of T=2.0 the token "2" is still the most likely, but it has a probability of only about 5%. This means that the sampled token is almost completely random at this point. In my opinion, there is no good argument for ever using a temperature of two for this reason.

As stated in the OpenAI documentation, the temperature is a way to make the output more random or more focused. However, since we are only shaping the probability distribution, there is always a non-zero chance of sampling a completely nonsensical token, which can lead to factually incorrect statements or grammatical nonsense.

Top-k Sampling

Top-k sampling simply takes the best k tokens and ignores the rest.

If you set k=1 , you get greedy decoding.

More formally, top-k sampling truncates the probability distribution after the top-k most likely tokens. This means that we set the probability to zero for all tokens in the vocabulary V that are not part of the top-k tokens.

After setting the non-top-k tokens to zero, the remaining distribution is rescaled to sum to one again:

Formula to re-scale the probability distribution to sum to one [2]

Here is a visualization of top-k sampling with k=4:

Two charts visualizing top-k sampling. The first chart shows the original probability distribtution. The second chart shows the modified distribution after truncation at k=4 and rescaling.

Visualization of top-k sampling with k=4 for the input prompt "My name is". The top figure shows the output probabilities from the model, and the bottom figure shows the probabilities after top-k truncation and rescaling. Image by the author

The top figure shows the output distribution of the model. I have drawn a red line to divide the plot into the top-k token on the left side and the rest on the right.

The bottom figure shows the new top-k distribution P_K after truncation and rescaling.

Top-k sampling is not currently a parameter in the OpenAI API, but other APIs, such as Anthropic's, do have a top_k parameter.

One problem with top-k sampling is setting the parameter k. If we have a very sharp distribution, we would like to have a low k to avoid including a lot of highly unlikely tokens in the truncated vocabulary. But if we have a very flat distribution, we would like to have a high k to include a lot of reasonable token possibilities.

Top-p Sampling

Top-p sampling, also called nucleus sampling, is another stochastic decoding method that truncates the probability distribution by removing unlikely tokens from the vocabulary.

OpenAI gives the following definition for the top_p parameter:

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.

In top-p sampling, we define a threshold p for the cumulative probability mass of our tokens. Mathematically, we sum the (sorted) probabilities to their cumulative probability. We include all tokens until the cumulative probability is greater than the threshold p.

Formula to get the top-p vocabulary for top-p sampling [2]

We then rescale the truncated probabilities into a new probability distribution that sums to one.

Here is a visualization for the prompt "I love to" with top-p:

Two charts visualizing top-p sampling. The first chart shows the original probability distribution with token probabilities and the cumulative token probabilities. The second chart shows the modified distribution after top-p truncation and rescaling.

Visualization of top-p sampling with top_p=0.2 for the input prompt "I love to". The top figure shows the output probabilities from the model and the bottom figure shows the probabilities after top-p truncation and rescaling. Image by the author

The top figure shows the output distribution of the model, where I have drawn a red line at the 20% threshold, which divides the plot into the top-p token at the bottom and the rest at the top. After rescaling, the bottom figure shows that we are left with only four tokens, the rest have been zeroed out.

I think it always makes sense to remove highly unlikely tokens from the long tail of the logprobs. So, top_p should always be less than 1.0.

But top-p sampling is not perfect either. The figure below shows a case where top-p sampling includes a lot of low-probability tokens in order to reach the cumulative threshold.

A chart showing a case where top-p sampling fails. The text says “Lots of low-probability tokens leading to incoherence.”

Top-p sampling can include too many low-probability tokens. Image taken from [1] licensed under CC BY 4.0

Combining Top-p and Temperature

While I could not find any official documentation from OpenAI, based on community testing, it looks like top_p is used before temperature.

While the OpenAI documentation generally recommends against setting both top_p and temperature, I think there are good reasons to do so.

If you change only the temperature parameter, you flatten or sharpen the probability distribution. This makes the output either more deterministic (low temperature) or more creative (high temperature).

However, the model still samples randomly from the probability distribution. Thus, there is always a chance that the model will sample a highly unlikely token.

For example, the model might sample a character from a foreign language. Or a strange Unicode character. Anything in the model's vocabulary (which is now very large and multilingual) can be sampled.

As a solution, top-p sampling can first eliminate these random tokens, so that a higher temperature can then creatively sample from the remaining reasonable pool of tokens.

Min-p Sampling

There is a relatively new sampling method called min-p sampling from the paper "Min P Sampling: Balancing Creativity and Coherence at High Temperature" [1].

Min-p is also a stochastic decoding method based on truncation that tries to solve the problems of top-p sampling by having a more dynamic threshold p.

The formula to calculate the dynamic minimum threshold for min-p sampling [1]

First, we find the maximum probability p_max from the top token in our probability distribution.

This maximum probability is then multiplied by a parameter p_base to give us a minimum threshold p_scaled. We then sample all tokens with a probability greater than or equal to p_scaled. Finally, we rescale the truncated probabilities to a new probability distribution.

A visualization of min-p sampling. The text says “Low-probability tokens filtered out from sampling, increasing coherence”.

Min-p sampling uses a relative minimum probability. Image taken from [1] licensed under CC BY 4.0

Min-p sampling is already implemented for some backends, such as VLLM and llama.cpp, and I would not be surprised to see it added to the OpenAI API soon.

I have visualized min-p sampling in the figure below with p_base = 0.1 for the input prompt "I love to read a".

Two charts showing min-p sampling. The first chart shows the original probability distribution. The second chart shows the truncated and rescaled distribution for min-p=0.1. A red line shows the minimum probability where everything under that line is changed to zero.

Visualization of min-p sampling with p_base=0.1 for the input prompt "I love to read a". The top figure shows the output probabilities from the model and the bottom figure shows the probabilities after min-p truncation and re-scaling. Image by the author

Since the token "good" has a probability of 32% and I set p_base = 0.1 we get a minimum probability threshold of p_scaled = 3.2%, which is the red line in the graph.

Conclusion

Now that we know exactly what the sampling parameters do, we can try to find values that work for our LLM use case.

The most important parameters at this point are temperature and top_p.

The temperature parameter flattens or sharpens the model's output probabilities. However, there is always a chance to sample a completely nonsensical token. In the end, we are still rolling the dice when sampling the next token, given our probability distribution.

The top_k and top_p parameters truncate the probability distribution, i.e., they remove unlikely tokens from the set of possible candidates. However, if we are unlucky, top_k and top_p may remove either too many good candidates or not enough bad candidates.

If your LLM framework supports min-p sampling, you should try it.

References

[1] N. N. Minh, A. Baker, A. Kirsch, C. Neo, Min P Sampling: Balancing Creativity and Coherence at High Temperature (2024), arXiv:2407.01082

[2] A. Holtzman, J. Buys, L. Du, M. Forbes, Y. Choi, The Curious Case of Neural Text Degeneration (2019), Proceedings of International Conference on Learning Representations (ICLR) 2020

A deep dive into stochastic decoding with temperature, top_p, top_k, and min_p

Table Of Contents