← Integration guides

4 min read

LlamaIndex with pinstripes

Install

pip install llama-index llama-index-llms-openai

Configure the LLM

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="ps/qwen3.6-a3b",
    api_key="sk-ps-...",
    api_base="https://api.pinstripes.io/v1",
)

Basic query

from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(
    model="ps/qwen3.6-a3b",
    api_key="sk-ps-...",
    api_base="https://api.pinstripes.io/v1",
)

response = Settings.llm.complete("What is retrieval-augmented generation?")
print(response)

RAG pipeline

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Use pinstripes for generation
Settings.llm = OpenAI(
    model="ps/qwen3.6-a3b",
    api_key="sk-ps-...",
    api_base="https://api.pinstripes.io/v1",
)

# Load documents and build index
docs = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(docs)

# Query
engine = index.as_query_engine()
response = engine.query("Summarise the key findings.")
print(response)

Reducing RAG costs with prefix caching

In RAG setups, the retrieved context chunks often share a common system prompt prefix. pinstripes automatically caches this prefix after the first call; subsequent queries with the same prefix are billed at **$0.05/1M tokens** instead of the full input rate.

Structure your prompts to front-load the static system instructions:

from llama_index.core import PromptTemplate

qa_prompt = PromptTemplate(
    "You are a precise research assistant.\n"       # static — gets cached
    "Answer questions based only on the context.\n" # static — gets cached
    "Context: {context_str}\n"                      # may vary
    "Question: {query_str}\n"
    "Answer:"
)

For a RAG pipeline handling 1,000 queries a day with a 2,000-token system prompt, prefix caching cuts prompt costs by around 80%, since each call after the first reuses the cached prefix at the reduced rate rather than paying the full input price.