Lineage-graph-accelerator / LOCAL_SETUP.md
aamanlamba's picture
first version - lineage extractor
60ac2eb
# Local Setup Guide - Lineage Graph Extractor
This guide provides detailed instructions for setting up and running the Lineage Graph Extractor agent locally.
## Table of Contents
1. [System Requirements](#system-requirements)
2. [Installation Methods](#installation-methods)
3. [Configuration](#configuration)
4. [Usage Scenarios](#usage-scenarios)
5. [Advanced Configuration](#advanced-configuration)
6. [Troubleshooting](#troubleshooting)
## System Requirements
### Minimum Requirements
- **OS**: Windows 10+, macOS 10.15+, or Linux
- **Python**: 3.9 or higher
- **Memory**: 2GB RAM minimum
- **Disk Space**: 100MB for agent files
### Recommended Requirements
- **Python**: 3.10+
- **Memory**: 4GB RAM
- **Internet**: Stable connection for API calls
## Installation Methods
### Method 1: Standalone Use (Recommended)
This method uses the agent configuration files with any platform that supports the Anthropic API.
1. **Download the agent**
```bash
# If you have a git repository
git clone <repository-url>
cd local_clone
# Or extract from downloaded archive
unzip lineage-graph-extractor.zip
cd lineage-graph-extractor
```
2. **Set up environment**
```bash
# Copy environment template
cp .env.example .env
```
3. **Edit .env file**
```bash
# Edit with your preferred editor
nano .env
# or
vim .env
# or
code .env # VS Code
```
Add your credentials:
```
ANTHROPIC_API_KEY=sk-ant-your-key-here
GOOGLE_CLOUD_PROJECT=your-gcp-project
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
```
4. **Install Python dependencies** (optional, for examples)
```bash
pip install anthropic google-cloud-bigquery requests pyyaml
```
### Method 2: Claude Desktop Integration
If you're using Claude Desktop or similar platforms:
1. **Locate your agent configuration directory**
- Claude Desktop: `~/.config/claude/agents/` (Linux/Mac) or `%APPDATA%\claude\agents\` (Windows)
- Other platforms: Check platform documentation
2. **Copy the memories folder**
```bash
# Linux/Mac
cp -r memories ~/.config/claude/agents/lineage-extractor/
# Windows
xcopy /E /I memories %APPDATA%\claude\agents\lineage-extractor\
```
3. **Configure API credentials** in your platform's settings
4. **Restart the application**
### Method 3: Python Integration
To integrate into your own Python application:
1. **Install dependencies**
```bash
pip install anthropic python-dotenv
```
2. **Use the integration example**
```python
from anthropic import Anthropic
from dotenv import load_dotenv
import os
# Load environment variables
load_dotenv()
# Initialize client
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
# Load agent configuration
with open("memories/agent.md", "r") as f:
system_prompt = f.read()
# Use the agent
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4000,
system=system_prompt,
messages=[{
"role": "user",
"content": "Extract lineage from this metadata: ..."
}]
)
print(response.content[0].text)
```
## Configuration
### API Keys Setup
#### Anthropic API Key
1. Go to https://console.anthropic.com/
2. Create an account or sign in
3. Navigate to API Keys
4. Create a new key
5. Copy to `.env` file
#### Google Cloud (for BigQuery)
1. Go to https://console.cloud.google.com/
2. Create a project or select existing
3. Enable BigQuery API
4. Create a service account:
- Go to IAM & Admin β†’ Service Accounts
- Create service account
- Grant "BigQuery Data Viewer" role
- Create JSON key
5. Download JSON and reference in `.env`
#### Tavily (for web search)
1. Go to https://tavily.com/
2. Sign up for an account
3. Get your API key
4. Add to `.env` file
### Tool Configuration
Edit `memories/tools.json` to customize available tools:
```json
{
"tools": [
"bigquery_execute_query", // Query BigQuery
"read_url_content", // Fetch from URLs
"google_sheets_read_range", // Read Google Sheets
"tavily_web_search" // Web search
],
"interrupt_config": {
"bigquery_execute_query": false,
"read_url_content": false,
"google_sheets_read_range": false,
"tavily_web_search": false
}
}
```
**Available Tools:**
- `bigquery_execute_query`: Execute SQL queries on BigQuery
- `read_url_content`: Fetch content from URLs/APIs
- `google_sheets_read_range`: Read data from Google Sheets
- `tavily_web_search`: Perform web searches
### Subagent Configuration
Customize subagents by editing their configuration files:
**Metadata Parser** (`memories/subagents/metadata_parser/`)
- `agent.md`: Instructions for parsing metadata
- `tools.json`: Tools available to parser
**Graph Visualizer** (`memories/subagents/graph_visualizer/`)
- `agent.md`: Instructions for creating visualizations
- `tools.json`: Tools available to visualizer
## Usage Scenarios
### Scenario 1: BigQuery Lineage Extraction
```python
from anthropic import Anthropic
import os
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
with open("memories/agent.md", "r") as f:
system_prompt = f.read()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4000,
system=system_prompt,
messages=[{
"role": "user",
"content": "Extract lineage from BigQuery project: my-project, dataset: analytics"
}]
)
print(response.content[0].text)
```
### Scenario 2: File-Based Metadata
```python
# Read metadata from file
with open("dbt_manifest.json", "r") as f:
metadata = f.read()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4000,
system=system_prompt,
messages=[{
"role": "user",
"content": f"Extract lineage from this dbt manifest:\n\n{metadata}"
}]
)
```
### Scenario 3: API Metadata
```python
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=4000,
system=system_prompt,
messages=[{
"role": "user",
"content": "Extract lineage from API: https://api.example.com/metadata"
}]
)
```
## Advanced Configuration
### Custom Visualization Formats
To add custom visualization formats, edit `memories/subagents/graph_visualizer/agent.md`:
```markdown
### 4. Custom Format
Generate a custom format with:
- Your specific requirements
- Custom styling rules
- Special formatting needs
```
### Adding New Metadata Sources
To support new metadata sources:
1. Add tool to `memories/tools.json`
2. Update `memories/agent.md` with source-specific instructions
3. Update `memories/subagents/metadata_parser/agent.md` if needed
### MCP Integration
To integrate with Model Context Protocol servers:
1. Check if MCP tools are available: `/tools` directory
2. Add MCP tools to `memories/tools.json`
3. Configure MCP server connection
4. See `memories/mcp_integration.md` (if available)
## Troubleshooting
### Common Issues
#### 1. Authentication Errors
**Problem**: API authentication fails
**Solutions**:
- Verify API key is correct in `.env`
- Check key hasn't expired
- Ensure environment variables are loaded
- Try regenerating the API key
```bash
# Test Anthropic API key
python -c "from anthropic import Anthropic; import os; from dotenv import load_dotenv; load_dotenv(); client = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY')); print('βœ“ API key works')"
```
#### 2. BigQuery Access Issues
**Problem**: Cannot access BigQuery
**Solutions**:
- Verify service account has BigQuery permissions
- Check project ID is correct
- Ensure JSON key file path is correct
- Test credentials:
```bash
# Test BigQuery access
gcloud auth activate-service-account --key-file=/path/to/key.json
bq ls --project_id=your-project-id
```
#### 3. Import Errors
**Problem**: `ModuleNotFoundError`
**Solutions**:
```bash
# Install missing packages
pip install anthropic google-cloud-bigquery requests pyyaml python-dotenv
# Or install all at once
pip install -r requirements.txt # if you create one
```
#### 4. Environment Variables Not Loading
**Problem**: `.env` file not being read
**Solutions**:
```python
# Explicitly load .env
from dotenv import load_dotenv
load_dotenv()
# Or specify path
load_dotenv(".env")
# Verify loading
import os
print(os.getenv("ANTHROPIC_API_KEY")) # Should not be None
```
#### 5. File Path Issues
**Problem**: Cannot find `memories/agent.md`
**Solutions**:
```python
# Use absolute path
import os
base_dir = os.path.dirname(os.path.abspath(__file__))
agent_path = os.path.join(base_dir, "memories", "agent.md")
# Or change working directory
os.chdir("/path/to/local_clone")
```
### Performance Issues
#### Slow Response Times
**Causes**:
- Large metadata files
- Complex lineage graphs
- Network latency
**Solutions**:
- Break large metadata into chunks
- Use filtering to focus on specific entities
- Increase API timeout settings
- Cache frequently used results
### Debugging Tips
1. **Enable verbose logging**
```python
import logging
logging.basicConfig(level=logging.DEBUG)
```
2. **Test each component separately**
- Test API connection first
- Test metadata retrieval
- Test parsing separately
- Test visualization separately
3. **Validate metadata format**
- Ensure JSON is valid
- Check for required fields
- Verify structure matches expected format
4. **Check agent configuration**
- Verify `memories/agent.md` is readable
- Check `tools.json` syntax
- Ensure subagent files exist
## Getting Help
### Documentation
- Agent instructions: `memories/agent.md`
- Subagent docs: `memories/subagents/*/agent.md`
- Anthropic API: https://docs.anthropic.com/
### Testing Your Setup
Run this complete test:
```python
from anthropic import Anthropic
from dotenv import load_dotenv
import os
# Load environment
load_dotenv()
# Test 1: API Connection
try:
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
print("βœ“ Anthropic API connection successful")
except Exception as e:
print(f"βœ— API connection failed: {e}")
exit(1)
# Test 2: Load Agent Config
try:
with open("memories/agent.md", "r") as f:
system_prompt = f.read()
print("βœ“ Agent configuration loaded")
except Exception as e:
print(f"βœ— Failed to load agent config: {e}")
exit(1)
# Test 3: Simple Query
try:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system=system_prompt,
messages=[{
"role": "user",
"content": "Hello, what can you help me with?"
}]
)
print("βœ“ Agent response successful")
print(f"\nAgent says: {response.content[0].text}")
except Exception as e:
print(f"βœ— Agent query failed: {e}")
exit(1)
print("\nβœ“ All tests passed! Your setup is ready.")
```
Save as `test_setup.py` and run:
```bash
python test_setup.py
```
## Next Steps
1. βœ… Complete setup
2. βœ… Test with sample metadata
3. πŸ“Š Extract your first lineage
4. 🎨 Customize visualization preferences
5. πŸ”§ Integrate with your workflow
---
**Setup complete?** Try the usage examples in README.md or run your own lineage extraction!