Using pinstripes with LangChain

Install

pip install langchain langchain-openai

Basic chat

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="ps/qwen3.6-a3b",
    api_key="sk-ps-...",
    base_url="https://api.pinstripes.io/v1",
)

response = llm.invoke("Explain prefix caching in one sentence.")
print(response.content)

That's the entire change from a standard OpenAI setup. Every LangChain feature works identically with pinstripes.

Streaming

for chunk in llm.stream("Write a haiku about distributed inference."):
    print(chunk.content, end="", flush=True)

Using with LCEL chains

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="ps/deepseek-v4-flash",
    api_key="sk-ps-...",
    base_url="https://api.pinstripes.io/v1",
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a concise technical writer."),
    ("user", "{input}"),
])

chain = prompt | llm
result = chain.invoke({"input": "What is mixture-of-experts architecture?"})
print(result.content)

Agents

from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="ps/qwen3.6-a3b",
    api_key="sk-ps-...",
    base_url="https://api.pinstripes.io/v1",
)

# Add your tools here
tools = []

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    ("placeholder", "{chat_history}"),
    ("user", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
result = executor.invoke({"input": "Hello!"})
print(result["output"])

Reducing costs with caching

For long system prompts or repeated context (RAG chunks, persona cards), pinstripes automatically caches repeated prefixes. Structure your prompts with the static portion first:

prompt = ChatPromptTemplate.from_messages([
    ("system", long_static_system_prompt),  # cached after first call
    ("user", "{query}"),                    # changes each turn
])

Cached tokens are billed at $0.05/1M, roughly 60–90% cheaper than input tokens depending on the model.