Context Summarization for Efficient Memory Management
Overview
Implemented an intelligent context summarization system that balances memory depth with token efficiency. The system now summarizes older interactions while keeping recent ones in full detail.
Strategy: Hierarchical Context Management
Two-Tier Approach
All 20 interactions in memory
β
Split:
ββ Older 12 interactions β SUMMARIZED (token-efficient)
ββ Recent 8 interactions β FULL DETAIL (precision)
Smart Transition
- 0-8 interactions: All shown in full detail
- 9+ interactions:
- Recent 8: Full Q&A pairs
- Older 12: Summarized context
Implementation Details
1. Summarization Logic
File: src/agents/synthesis_agent.py (and Research_AI_Assistant version)
Method: _summarize_interactions()
def _summarize_interactions(self, interactions: List[Dict[str, Any]]) -> str:
"""Summarize older interactions to save tokens while maintaining context"""
if not interactions:
return ""
# Extract key topics and questions from older interactions
topics = []
key_points = []
for interaction in interactions:
user_msg = interaction.get('user_input', '')
response = interaction.get('response', '')
if user_msg:
topics.append(user_msg[:100]) # First 100 chars
if response:
# Extract key sentences (first 2 sentences of response)
sentences = response.split('.')[:2]
key_points.append('. '.join(sentences).strip()[:100])
# Build compact summary
summary_lines = []
if topics:
summary_lines.append(f"Topics discussed: {', '.join(topics[:5])}")
if key_points:
summary_lines.append(f"Key points: {'. '.join(key_points[:3])}")
return "\n".join(summary_lines) if summary_lines else "Earlier conversation about various topics."
2. Context Building Logic
Conditional Processing:
if len(recent_interactions) > 8:
oldest_interactions = recent_interactions[8:] # First 12 (oldest)
newest_interactions = recent_interactions[:8] # Last 8 (newest)
# Summarize older interactions
summary = self._summarize_interactions(oldest_interactions)
conversation_history = f"\n\nConversation Summary (earlier context):\n{summary}\n\n"
conversation_history += "Recent conversation details:\n"
# Include recent interactions in detail
for i, interaction in enumerate(reversed(newest_interactions), 1):
# Full Q&A pairs
...
else:
# Less than 8 interactions, show all in detail
# Full Q&A pairs for all
3. Prompt Structure
For 9+ interactions:
User Question: {current_question}
Conversation Summary (earlier context):
Topics discussed: Who is Sachin, Is he the greatest, Define greatness parameters
Key points: Sachin is a legendary Indian cricketer...
Recent conversation details:
Q1: Who is Sachin Tendulkar?
A1: Sachin Ramesh Tendulkar is a legendary Indian cricketer...
Q2: Is he the greatest? What about Don Bradman?
A2: The question of who is the greatest cricketer...
...
Instructions: Provide a comprehensive, helpful response...
For β€8 interactions:
User Question: {current_question}
Previous conversation:
Q1: Who is Sachin?
A1: Sachin Ramesh Tendulkar is a legendary Indian cricketer...
...
Benefits
1. Token Efficiency
- Without summarization: ~4000-8000 tokens (20 full Q&A pairs)
- With summarization: ~1500-3000 tokens (8 full + 12 summarized)
- Savings: ~60-70% reduction
2. Context Preservation
- β Complete recent context (last 8 interactions in full)
- β Summarized older context (topics and key points retained)
- β Long-term memory (all 20+ interactions still in database)
3. Performance Impact
- Faster inference (fewer tokens to process)
- Lower API costs (reduced token usage)
- Better response quality (focus on recent context, awareness of older topics)
4. UX Stability
- Maintains conversation flow
- Prevents topic drift
- Balances precision (recent) with breadth (older)
Example Flow
Scenario: 15 interactions about cricket
Memory (all 15):
I1: Who is Sachin? [OLD]
I2: Is he the greatest? [OLD]
...
I8: Define greatness parameters [RECENT]
I9: Name a cricket journalist [RECENT]
...
I15: What about IPL? [CURRENT]
Sent to LLM:
Conversation Summary (earlier context):
Topics discussed: Who is Sachin, Is he the greatest, Define greatness parameters, Key points: Sachin is a legendary Indian cricketer...
Recent conversation details:
Q1: Name a cricket journalist
A1: Some renowned cricket journalists include...
Q2: What about IPL?
A2: [Current response]
Edge Cases Handled
- 0-8 interactions: All shown in full detail
- Exactly 8 interactions: All shown in full detail
- 9 interactions: 8 full + 1 summarized
- 20 interactions: 8 full + 12 summarized
- 40+ interactions: 8 full + 12 summarized (memory buffer limit)
Files Modified
β
src/agents/synthesis_agent.py- Added
_summarize_interactions()method - Updated
_build_synthesis_prompt()with split logic
- Added
β
Research_AI_Assistant/src/agents/synthesis_agent.py- Same changes applied
Testing Recommendations
Test Scenarios
Short conversation (5 interactions):
- All 5 shown in full β
- No summarization
Medium conversation (10 interactions):
- Last 8 in full β
- First 2 summarized β
Long conversation (20 interactions):
- Last 8 in full β
- First 12 summarized β
- Efficient token usage β
Domain continuity test:
- Ask cricket questions
- Verify cricket context maintained
- Check summarization preserves sport/topic
Technical Details
Summarization Algorithm
- Topic Extraction: First 100 chars of each user question
- Key Point Extraction: First 2 sentences of each response
- Compaction: Top 5 topics + top 3 key points
- Fallback: Generic message if no content
Memory Management
Memory Buffer: 40 interactions (database + in-memory)
β
Context Window: 20 interactions (used)
β
ββ Recent 8 β Full Q&A pairs (detail)
ββ Older 12 β Summarized (efficiency)
Impact
Before (20 full interactions):
- High token usage (~6000-8000)
- Slower inference
- Risk of hitting token limits
- Potential for irrelevant older context
After (8 full + 12 summarized):
- Optimal token usage (~2000-3000)
- Faster inference
- Well within token limits
- Focused on recent + topic awareness
Summary
The context summarization system intelligently balances:
- π Depth: Recent 8 interactions in full detail
- π― Breadth: Older 12 interactions summarized
- β‘ Efficiency: 60-70% token reduction
- β Quality: Maintains conversation coherence
Result: Optimal UX with stable memory and efficient token usage