Spaces:

JustTheStatsHuman
/

Togmal-demo

Sleeping

App Files Files Community

Togmal-demo / DEMO_README.md

HeTalksInMaths

Initial commit: ToGMAL Prompt Difficulty Analyzer with real MMLU data

d67728f about 2 months ago

preview code

raw

history blame

2.23 kB

	# 🧠 ToGMAL Prompt Difficulty Analyzer

	Real-time LLM capability boundary detection using vector similarity search.

	## 🎯 What This Does

	This system analyzes any prompt and tells you:
	1. How difficult it is for current LLMs (based on real benchmark data)
	2. Why it's difficult (shows similar benchmark questions)
	3. What to do about it (actionable recommendations)

	## 🔥 Key Innovation

	Instead of clustering by domain (all math together), we cluster by difficulty - what's actually hard for LLMs regardless of domain.

	## 📊 Real Data

	- 14,042 MMLU questions with real success rates from top models
	- <50ms query time for real-time analysis
	- Production ready vector database

	## 🚀 Demo Links

	- Local: http://127.0.0.1:7860
	- Public: https://99b38fc2e31da2f83d.gradio.live

	## 🧪 Example Results

	### Hard Questions (Low Success Rates)
	```
	Prompt: "Statement 1 \| Every field is also a ring..."
	Risk: HIGH (23.9% success)
	Recommendation: Multi-step reasoning with verification

	Prompt: "Find all zeros of polynomial x³ + 2x + 2 in Z₇"
	Risk: MODERATE (43.8% success)
	Recommendation: Use chain-of-thought prompting
	```

	### Easy Questions (High Success Rates)
	```
	Prompt: "What is 2 + 2?"
	Risk: MINIMAL (100% success)
	Recommendation: Standard LLM response adequate

	Prompt: "What is the capital of France?"
	Risk: MINIMAL (100% success)
	Recommendation: Standard LLM response adequate
	```

	## 🛠️ Technical Details

	### Architecture
	```
	User Prompt → Embedding Model → Vector DB → K Nearest Questions → Weighted Score
	```

	### Components
	1. Sentence Transformers (all-MiniLM-L6-v2) for embeddings
	2. ChromaDB for vector storage
	3. Real MMLU data with success rates from top models
	4. Gradio for web interface

	## 📈 Next Steps

	1. Add more benchmark datasets (GPQA, MATH)
	2. Fetch real per-question results from multiple top models
	3. Integrate with ToGMAL MCP server for Claude Desktop
	4. Deploy to HuggingFace Spaces for permanent hosting

	## 🚀 Quick Start

	```bash
	# Install dependencies
	uv pip install -r requirements.txt
	uv pip install gradio

	# Run the demo
	python demo_app.py
	```

	Visit http://127.0.0.1:7860 to use the web interface.