Day 03: Managing Context and Conversation History
Today’s focus was on building a robust conversational AI loop with Google Gemini and learning how to handle context length and token limits effectively. After a couple of iterations, I now have a stable chat CLI that can manage conversation history without crashing or exceeding model constraints.
Key Challenges
When working with LLMs, there are a few practical hurdles:
-
Context length exceeded errors - LLMs have a maximum number of tokens they can process at once. Sending too much conversation history causes errors.
-
Stateless API behavior - The model does not remember past messages. Every message must be sent in the conversation array for context.
Trimming Old Messages
To prevent context overflow, I implemented history truncation:
EXERCISE_MAX_CONTEXT_TOKENS = 4096
RESERVED_OUTPUT_TOKENS = 500
TRUNCATE_THRESHOLD_TOKENS = 3500
Before sending a request, I:
-
Estimate input tokens with count_tokens(messages, model_name).
-
If the estimate plus reserved output tokens exceeds the maximum, I remove the oldest user + assistant message pair until the conversation fits.
-
Handle malformed histories (e.g., starting with an assistant) by trimming messages one at a time.
Every time truncation occurs, the CLI prints:
[context] Truncated oldest messages to fit token budget.
This ensures the chatbot never crashes due to too-long input, and older conversation is pruned gracefully.
Conversation Loop
The CLI handles user input with a clean loop:
-
Prompts with exactly
You: -
Ignores empty input
-
Exits gracefully on
quit,exit, or/quit -
Appends each user message to messages before sending
-
Appends the assistant reply afterward, preserving context
For example:
messages.append({"role": "user", "content": user_input})
contents = [
genai.types.Content(
role="user" if msg["role"] == "user" else "model",
parts=[genai.types.Part(text=msg["content"])]
)
for msg in messages
]
response = client.models.generate_content(
model=model_name,
contents=contents
)
assistant_text = response.text.strip()
messages.append({"role": "assistant", "content": assistant_text})
Lessons Learned
-
Stateless APIs require explicit context management - Without sending the full conversation, the assistant can’t respond coherently.
-
Input truncation is critical - Implementing a history trimming strategy ensures the chat runs smoothly, even with long conversations.
-
Token estimation is helpful - Using
count_tokens()for the messages lets you anticipate and avoid context overflow.
Day 3 was a great reminder that engineering LLM applications is not just about sending prompts. The structure, limits, and state management all matter.