Spaces:

aamanlamba
/

Lineage-graph-accelerator

Sleeping

App Files Files Community

Lineage-graph-accelerator / LOCAL_SETUP.md

aamanlamba

first version - lineage extractor

60ac2eb about 1 month ago

preview code

raw

history blame contribute delete

11.3 kB

	# Local Setup Guide - Lineage Graph Extractor

	This guide provides detailed instructions for setting up and running the Lineage Graph Extractor agent locally.

	## Table of Contents
	1. [System Requirements](#system-requirements)
	2. [Installation Methods](#installation-methods)
	3. [Configuration](#configuration)
	4. [Usage Scenarios](#usage-scenarios)
	5. [Advanced Configuration](#advanced-configuration)
	6. [Troubleshooting](#troubleshooting)

	## System Requirements

	### Minimum Requirements
	- OS: Windows 10+, macOS 10.15+, or Linux
	- Python: 3.9 or higher
	- Memory: 2GB RAM minimum
	- Disk Space: 100MB for agent files

	### Recommended Requirements
	- Python: 3.10+
	- Memory: 4GB RAM
	- Internet: Stable connection for API calls

	## Installation Methods

	### Method 1: Standalone Use (Recommended)

	This method uses the agent configuration files with any platform that supports the Anthropic API.

	1. Download the agent
	```bash
	# If you have a git repository
	git clone <repository-url>
	cd local_clone

	# Or extract from downloaded archive
	unzip lineage-graph-extractor.zip
	cd lineage-graph-extractor
	```

	2. Set up environment
	```bash
	# Copy environment template
	cp .env.example .env
	```

	3. Edit .env file
	```bash
	# Edit with your preferred editor
	nano .env
	# or
	vim .env
	# or
	code .env # VS Code
	```

	Add your credentials:
	```
	ANTHROPIC_API_KEY=sk-ant-your-key-here
	GOOGLE_CLOUD_PROJECT=your-gcp-project
	GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
	```

	4. Install Python dependencies (optional, for examples)
	```bash
	pip install anthropic google-cloud-bigquery requests pyyaml
	```

	### Method 2: Claude Desktop Integration

	If you're using Claude Desktop or similar platforms:

	1. Locate your agent configuration directory
	- Claude Desktop: `~/.config/claude/agents/` (Linux/Mac) or `%APPDATA%\claude\agents\` (Windows)
	- Other platforms: Check platform documentation

	2. Copy the memories folder
	```bash
	# Linux/Mac
	cp -r memories ~/.config/claude/agents/lineage-extractor/

	# Windows
	xcopy /E /I memories %APPDATA%\claude\agents\lineage-extractor\
	```

	3. Configure API credentials in your platform's settings

	4. Restart the application

	### Method 3: Python Integration

	To integrate into your own Python application:

	1. Install dependencies
	```bash
	pip install anthropic python-dotenv
	```

	2. Use the integration example
	```python
	from anthropic import Anthropic
	from dotenv import load_dotenv
	import os

	# Load environment variables
	load_dotenv()

	# Initialize client
	client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

	# Load agent configuration
	with open("memories/agent.md", "r") as f:
	system_prompt = f.read()

	# Use the agent
	response = client.messages.create(
	model="claude-3-5-sonnet-20241022",
	max_tokens=4000,
	system=system_prompt,
	messages=[{
	"role": "user",
	"content": "Extract lineage from this metadata: ..."
	}]
	)

	print(response.content[0].text)
	```

	## Configuration

	### API Keys Setup

	#### Anthropic API Key
	1. Go to https://console.anthropic.com/
	2. Create an account or sign in
	3. Navigate to API Keys
	4. Create a new key
	5. Copy to `.env` file

	#### Google Cloud (for BigQuery)
	1. Go to https://console.cloud.google.com/
	2. Create a project or select existing
	3. Enable BigQuery API
	4. Create a service account:
	- Go to IAM & Admin → Service Accounts
	- Create service account
	- Grant "BigQuery Data Viewer" role
	- Create JSON key
	5. Download JSON and reference in `.env`

	#### Tavily (for web search)
	1. Go to https://tavily.com/
	2. Sign up for an account
	3. Get your API key
	4. Add to `.env` file

	### Tool Configuration

	Edit `memories/tools.json` to customize available tools:

	```json
	{
	"tools": [
	"bigquery_execute_query", // Query BigQuery
	"read_url_content", // Fetch from URLs
	"google_sheets_read_range", // Read Google Sheets
	"tavily_web_search" // Web search
	],
	"interrupt_config": {
	"bigquery_execute_query": false,
	"read_url_content": false,
	"google_sheets_read_range": false,
	"tavily_web_search": false
	}
	}
	```

	Available Tools:
	- `bigquery_execute_query`: Execute SQL queries on BigQuery
	- `read_url_content`: Fetch content from URLs/APIs
	- `google_sheets_read_range`: Read data from Google Sheets
	- `tavily_web_search`: Perform web searches

	### Subagent Configuration

	Customize subagents by editing their configuration files:

	Metadata Parser (`memories/subagents/metadata_parser/`)
	- `agent.md`: Instructions for parsing metadata
	- `tools.json`: Tools available to parser

	Graph Visualizer (`memories/subagents/graph_visualizer/`)
	- `agent.md`: Instructions for creating visualizations
	- `tools.json`: Tools available to visualizer

	## Usage Scenarios

	### Scenario 1: BigQuery Lineage Extraction

	```python
	from anthropic import Anthropic
	import os

	client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

	with open("memories/agent.md", "r") as f:
	system_prompt = f.read()

	response = client.messages.create(
	model="claude-3-5-sonnet-20241022",
	max_tokens=4000,
	system=system_prompt,
	messages=[{
	"role": "user",
	"content": "Extract lineage from BigQuery project: my-project, dataset: analytics"
	}]
	)

	print(response.content[0].text)
	```

	### Scenario 2: File-Based Metadata

	```python
	# Read metadata from file
	with open("dbt_manifest.json", "r") as f:
	metadata = f.read()

	response = client.messages.create(
	model="claude-3-5-sonnet-20241022",
	max_tokens=4000,
	system=system_prompt,
	messages=[{
	"role": "user",
	"content": f"Extract lineage from this dbt manifest:\n\n{metadata}"
	}]
	)
	```

	### Scenario 3: API Metadata

	```python
	response = client.messages.create(
	model="claude-3-5-sonnet-20241022",
	max_tokens=4000,
	system=system_prompt,
	messages=[{
	"role": "user",
	"content": "Extract lineage from API: https://api.example.com/metadata"
	}]
	)
	```

	## Advanced Configuration

	### Custom Visualization Formats

	To add custom visualization formats, edit `memories/subagents/graph_visualizer/agent.md`:

	```markdown
	### 4. Custom Format
	Generate a custom format with:
	- Your specific requirements
	- Custom styling rules
	- Special formatting needs
	```

	### Adding New Metadata Sources

	To support new metadata sources:

	1. Add tool to `memories/tools.json`
	2. Update `memories/agent.md` with source-specific instructions
	3. Update `memories/subagents/metadata_parser/agent.md` if needed

	### MCP Integration

	To integrate with Model Context Protocol servers:

	1. Check if MCP tools are available: `/tools` directory
	2. Add MCP tools to `memories/tools.json`
	3. Configure MCP server connection
	4. See `memories/mcp_integration.md` (if available)

	## Troubleshooting

	### Common Issues

	#### 1. Authentication Errors

	Problem: API authentication fails
	Solutions:
	- Verify API key is correct in `.env`
	- Check key hasn't expired
	- Ensure environment variables are loaded
	- Try regenerating the API key

	```bash
	# Test Anthropic API key
	python -c "from anthropic import Anthropic; import os; from dotenv import load_dotenv; load_dotenv(); client = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY')); print('✓ API key works')"
	```

	#### 2. BigQuery Access Issues

	Problem: Cannot access BigQuery
	Solutions:
	- Verify service account has BigQuery permissions
	- Check project ID is correct
	- Ensure JSON key file path is correct
	- Test credentials:

	```bash
	# Test BigQuery access
	gcloud auth activate-service-account --key-file=/path/to/key.json
	bq ls --project_id=your-project-id
	```

	#### 3. Import Errors

	Problem: `ModuleNotFoundError`
	Solutions:
	```bash
	# Install missing packages
	pip install anthropic google-cloud-bigquery requests pyyaml python-dotenv

	# Or install all at once
	pip install -r requirements.txt # if you create one
	```

	#### 4. Environment Variables Not Loading

	Problem: `.env` file not being read
	Solutions:
	```python
	# Explicitly load .env
	from dotenv import load_dotenv
	load_dotenv()

	# Or specify path
	load_dotenv(".env")

	# Verify loading
	import os
	print(os.getenv("ANTHROPIC_API_KEY")) # Should not be None
	```

	#### 5. File Path Issues

	Problem: Cannot find `memories/agent.md`
	Solutions:
	```python
	# Use absolute path
	import os
	base_dir = os.path.dirname(os.path.abspath(__file__))
	agent_path = os.path.join(base_dir, "memories", "agent.md")

	# Or change working directory
	os.chdir("/path/to/local_clone")
	```

	### Performance Issues

	#### Slow Response Times

	Causes:
	- Large metadata files
	- Complex lineage graphs
	- Network latency

	Solutions:
	- Break large metadata into chunks
	- Use filtering to focus on specific entities
	- Increase API timeout settings
	- Cache frequently used results

	### Debugging Tips

	1. Enable verbose logging
	```python
	import logging
	logging.basicConfig(level=logging.DEBUG)
	```

	2. Test each component separately
	- Test API connection first
	- Test metadata retrieval
	- Test parsing separately
	- Test visualization separately

	3. Validate metadata format
	- Ensure JSON is valid
	- Check for required fields
	- Verify structure matches expected format

	4. Check agent configuration
	- Verify `memories/agent.md` is readable
	- Check `tools.json` syntax
	- Ensure subagent files exist

	## Getting Help

	### Documentation
	- Agent instructions: `memories/agent.md`
	- Subagent docs: `memories/subagents/*/agent.md`
	- Anthropic API: https://docs.anthropic.com/

	### Testing Your Setup

	Run this complete test:

	```python
	from anthropic import Anthropic
	from dotenv import load_dotenv
	import os

	# Load environment
	load_dotenv()

	# Test 1: API Connection
	try:
	client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
	print("✓ Anthropic API connection successful")
	except Exception as e:
	print(f"✗ API connection failed: {e}")
	exit(1)

	# Test 2: Load Agent Config
	try:
	with open("memories/agent.md", "r") as f:
	system_prompt = f.read()
	print("✓ Agent configuration loaded")
	except Exception as e:
	print(f"✗ Failed to load agent config: {e}")
	exit(1)

	# Test 3: Simple Query
	try:
	response = client.messages.create(
	model="claude-3-5-sonnet-20241022",
	max_tokens=1000,
	system=system_prompt,
	messages=[{
	"role": "user",
	"content": "Hello, what can you help me with?"
	}]
	)
	print("✓ Agent response successful")
	print(f"\nAgent says: {response.content[0].text}")
	except Exception as e:
	print(f"✗ Agent query failed: {e}")
	exit(1)

	print("\n✓ All tests passed! Your setup is ready.")
	```

	Save as `test_setup.py` and run:
	```bash
	python test_setup.py
	```

	## Next Steps

	1. ✅ Complete setup
	2. ✅ Test with sample metadata
	3. 📊 Extract your first lineage
	4. 🎨 Customize visualization preferences
	5. 🔧 Integrate with your workflow

	---

	Setup complete? Try the usage examples in README.md or run your own lineage extraction!