Why Care About Prompt Caching in LLMs?
Optimizing the cost and latency of your LLM calls with Prompt Caching
In the past, we’ve talked a lot about what an incredible tool RAG is for leveraging the power of AI on custom data. But, whether we are talking about plain LLM API requests, RAG applications, or more complex AI agents, there is one common question that remains the same. How do all these things scale? In particular, what happens with cost and latency as the number of requests in such apps grows? Especially for more advanced AI agents, which may contain multiple calls to an LLM for processing a single user query, these questions become of particular importance.
Fortunately, in reality, when making calls to an LLM, the same input tokens are usually repeated across multiple requests. Users are going to ask some specific questions much more than others, system prompts and instructions integrated in AI-powered applications are repeated in every user query, and even for a single prompt, models perform recursive calculations to generate an entire response (remember how LLMs produce text by predicting words one by one?). Similar to other applications, the use of caching mechanisms can significantly help optimize LLM request costs and latency. For instance, according to OpenAI documentation, Prompt Caching can reduce latency by up to an impressive 80% and input token costs by up to 90%.
What about caching?
In general, caching in computing is no new idea. At its core, a cache is a component that stores data temporarily so that future requests for the same data can be served faster. In this way, we can distinguish between two basic cache states – a cache hit and a cache miss. In particular:
A cache hit occurs when the requested data is found in the cache, allowing for a quick and cheap retrieval.
A cache miss occurs when the data is not in the cache, forcing the application to access the original source, which is more expensive and time-consuming.
One of the most typical implementations of cache is in web browsers. When visiting a website for the first time, the browser checks for the URL in its cache memory, but finds nothing (that will be a cache miss). Since the data we are looking for isn’t locally available, the browser has to perform a more expensive and time-consuming request to the web server across the internet, in order to find the data in the remote server where they originally exist. Once the page finally loads, the browser typically copies that data into its local cache. If we try to reload the same page 5 minutes later, the browser will look for it in its local storage. This time, it will find it (a cache hit) and load it from there, without reaching back to the server. This makes the browser work more quickly and consume fewer resources.
As you may imagine, caching is particularly useful in systems where the same data is requested multiple times. In most systems, data access is rarely uniform, but rather tends to follow a distribution where a small fraction of the data accounts for the vast majority of requests. A large portion of real-life applications follows the Pareto principle, meaning that about of 80% of the requests are about 20% of the data. If not for the Pareto principle, cache memory would need to be as large as the primary memory of the system, rendering it very, very expensive.
Prompt Caching and a Little Bit about LLM Inference
The caching concept – storing frequently used data somewhere and retrieving it from there, instead of obtaining it again from its primary source – is utilized in a similar manner for improving the efficiency of LLM calls, allowing for significantly reduced costs and latency. Caching can be utilised in various elements that may be involved in an AI application, most important of which is Prompt Caching. Nevertheless, caching can also provide great benefits by being applied to other aspects of an AI app, such as, for instance, caching in RAG retrieval or query-response caching. Nonetheless, this post is going to solely focus on Prompt Caching.
To understand how Prompt Caching works, we must first understand a little bit about how LLM inference – using a trained LLM to generate text – functions. LLM inference is not a single continuous process, but is rather divided into two distinct stages. Those are:
Pre-fill, which refers to processing the entire prompt at once to produce the first token. This stage requires heavy computation, and it is thus compute-bound. We may picture a very simplified version of this stage as each token attending to all other tokens, or something like comparing every token with every previous token.
Decoding, which appends the last generated token back into the sequence and generates the next one auto-regressively. This stage is memory-bound, as the system must load the entire context of previous tokens from memory to generate every single new token.
For example, imagine we have the following prompt:
What should I cook for dinner? From which we may then get the first token:
Hereand the following decoding iterations:
Here
Here are
Here are 5
Here are 5 easy
Here are 5 easy dinner
Here are 5 easy dinner ideasThe issue with this is that in order to generate the complete response, the model would have to process the same previous tokens over and over again to produce each next word during the decoding stage, which, as you may imagine, is highly inefficient. In our example, this means that the model would process the tokens again
In practice, with prompt caching, we save the repeated parts of a prompt after the first time it is requested. These repeated parts of a prompt usually have the form of large
Prompt 1
What should I cook for dinner? and then if we input the prompt:
Prompt 2
What should I cook for launch? The shared tokens:
Prompt 1
Dinner time! What should I cook? and then
Prompt 2
Launch time! What should I cook? This would be a cache miss, since the first token of each prompt is different. Since the prompt prefixes are different, we cannot hit cache, even if their semantics are essentially the same.
Getting our hands dirty with the OpenAI API
Nowadays, most of the frontier foundation models, like
In-memory prompt cache retention, where the cached prefixes are maintained for like 5-10 minutes and up to 1 hour, and
Extended prompt cache retention (only available for specific models), allowing for a longer retention of the cached prefix, up to a maximum of 24 hours.
But let’s take a closer look!
from openai import OpenAI
api_key = "your_api_key"
client = OpenAI(api_key=api_key)
prefix = """
You are a helpful cooking assistant.
Your task is to suggest simple, practical dinner ideas for busy people.
Follow these guidelines carefully when generating suggestions:
General cooking rules:
- Meals should take less than 30 minutes to prepare.
- Ingredients should be easy to find in a regular supermarket.
- Recipes should avoid overly complex techniques.
- Prefer balanced meals including vegetables, protein, and carbohydrates.
Formatting rules:
- Always return a numbered list.
- Provide 5 suggestions.
- Each suggestion should include a short explanation.
Ingredient guidelines:
- Prefer seasonal vegetables.
- Avoid exotic ingredients.
- Assume the user has basic pantry staples such as olive oil, salt, pepper, garlic, onions, and pasta.
Cooking philosophy:
- Favor simple home cooking.
- Avoid restaurant-level complexity.
- Focus on meals that people realistically cook on weeknights.
Example meal styles:
- pasta dishes
- rice bowls
- stir fry
- roasted vegetables with protein
- simple soups
- wraps and sandwiches
- sheet pan meals
Diet considerations:
- Default to healthy meals.
- Avoid deep frying.
- Prefer balanced macronutrients.
Additional instructions:
- Keep explanations concise.
- Avoid repeating the same ingredients in every suggestion.
- Provide variety across the meal suggestions.
""" * 80
# huge prefix to make sure i get the 1000 something token threshold for activating prompt caching
prompt1 = prefix + "What should I cook for dinner?"and then for the prompt 2
prompt2 = prefix + "What should I cook for lunch?"
response2 = client.responses.create(
model="gpt-5.2",
input=prompt2
)
print("\nResponse 2:")
print(response2.output_text)
print("\nUsage stats:")
print(response2.usage)So, for prompt 2, we would be only billed full price the remaining, non-identical part of the prompt. That would be the input tokens minus the cached tokens: 20,014 – 19,840 = only 174 tokens, or in other words, 99% less tokens for which we are charged full price. For the remaining cached tokens, we get an extremely discounted price (like with a discount of up to 90%).
On my mind
Prompt Caching is a powerful optimization for LLMs that can significantly improve the efficiency of AI applications both in terms of cost and time. By reusing previous computations for identical prompt prefixes, the model can skip redundant calculations and avoid repeatedly processing the same input tokens. The result is faster responses and lower costs, especially in applications where large parts of prompts—such as system instructions or retrieved context—remain constant across many requests. As AI systems scale and the number of LLM calls increases, these optimizations become increasingly important.
✨Thank you for reading!✨
. . .
If you made it this far, you might find pialgorithms useful — a platform we’ve been building that helps teams securely manage organizational knowledge in one place.
. . .
📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!
All images by the author, except mentioned otherwise.




