4 min read
pip install llama-index llama-index-llms-openaifrom llama_index.llms.openai import OpenAI
llm = OpenAI(
model="ps/qwen3.6-a3b",
api_key="sk-ps-...",
api_base="https://api.pinstripes.io/v1",
)from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
Settings.llm = OpenAI(
model="ps/qwen3.6-a3b",
api_key="sk-ps-...",
api_base="https://api.pinstripes.io/v1",
)
response = Settings.llm.complete("What is retrieval-augmented generation?")
print(response)from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Use pinstripes for generation
Settings.llm = OpenAI(
model="ps/qwen3.6-a3b",
api_key="sk-ps-...",
api_base="https://api.pinstripes.io/v1",
)
# Load documents and build index
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)
# Query
engine = index.as_query_engine()
response = engine.query("Summarise the key findings.")
print(response)In RAG setups, the retrieved context chunks often share a common system prompt prefix. pinstripes automatically caches this prefix after the first call; subsequent queries with the same prefix are billed at **$0.05/1M tokens** instead of the full input rate.
Structure your prompts to front-load the static system instructions:
from llama_index.core import PromptTemplate
qa_prompt = PromptTemplate(
"You are a precise research assistant.\n" # static — gets cached
"Answer questions based only on the context.\n" # static — gets cached
"Context: {context_str}\n" # may vary
"Question: {query_str}\n"
"Answer:"
)For a RAG pipeline handling 1,000 queries a day with a 2,000-token system prompt, prefix caching cuts prompt costs by around 80%, since each call after the first reuses the cached prefix at the reduced rate rather than paying the full input price.
Ready to build?