Lineage-graph-accelerator / USER_GUIDE.md
aamanlamba's picture
Phase 2: Enhanced lineage extraction with export to data catalogs
0510038
# Lineage Graph Accelerator - User Guide
A comprehensive guide to using the Lineage Graph Accelerator for extracting, visualizing, and exporting data lineage from your data platforms.
---
## Table of Contents
1. [Getting Started](#getting-started)
2. [Input Formats](#input-formats)
3. [Sample Lineage Examples](#sample-lineage-examples)
4. [Export to Data Catalogs](#export-to-data-catalogs)
5. [MCP Server Integration](#mcp-server-integration)
6. [Troubleshooting](#troubleshooting)
7. [FAQ](#faq)
---
## Getting Started
### Quick Start (3 Steps)
1. **Open the App**: Navigate to the Lineage Graph Accelerator on HuggingFace Spaces
2. **Load Sample Data**: Click "Load Sample" to try pre-built examples
3. **Extract Lineage**: Click "Extract Lineage" to visualize the data flow
### Interface Overview
The application has four main tabs:
| Tab | Purpose |
|-----|---------|
| **Text/File Metadata** | Paste or upload metadata directly |
| **BigQuery** | Connect to Google BigQuery for schema extraction |
| **URL/API** | Fetch metadata from REST APIs |
| **Demo Gallery** | One-click demos of various lineage scenarios |
---
## Input Formats
The Lineage Graph Accelerator supports multiple metadata formats:
### 1. Simple JSON (Nodes & Edges)
The simplest format with explicit nodes and edges:
```json
{
"nodes": [
{"id": "raw_customers", "type": "table", "name": "raw_customers"},
{"id": "clean_customers", "type": "table", "name": "clean_customers"},
{"id": "analytics_customers", "type": "table", "name": "analytics_customers"}
],
"edges": [
{"from": "raw_customers", "to": "clean_customers"},
{"from": "clean_customers", "to": "analytics_customers"}
]
}
```
**Result**: A linear graph showing `raw_customers β†’ clean_customers β†’ analytics_customers`
---
### 2. dbt Manifest Format
Extract lineage from dbt's `manifest.json`:
```json
{
"metadata": {
"dbt_version": "1.7.0",
"project_name": "my_project"
},
"nodes": {
"source.my_project.raw.customers": {
"resource_type": "source",
"name": "customers",
"schema": "raw"
},
"model.my_project.stg_customers": {
"resource_type": "model",
"name": "stg_customers",
"schema": "staging",
"depends_on": {
"nodes": ["source.my_project.raw.customers"]
}
},
"model.my_project.dim_customers": {
"resource_type": "model",
"name": "dim_customers",
"schema": "marts",
"depends_on": {
"nodes": ["model.my_project.stg_customers"]
}
}
}
}
```
**Result**: A graph showing the dbt model dependencies from source to staging to marts.
---
### 3. Airflow DAG Format
Extract task dependencies from Airflow DAGs:
```json
{
"dag_id": "etl_pipeline",
"tasks": [
{
"task_id": "extract_data",
"operator": "PythonOperator",
"upstream_dependencies": []
},
{
"task_id": "transform_data",
"operator": "SparkSubmitOperator",
"upstream_dependencies": ["extract_data"]
},
{
"task_id": "load_data",
"operator": "SnowflakeOperator",
"upstream_dependencies": ["transform_data"]
}
]
}
```
**Result**: A DAG visualization showing `extract_data β†’ transform_data β†’ load_data`
---
### 4. Data Warehouse Lineage Format
For Snowflake, BigQuery, or other warehouse lineage:
```json
{
"warehouse": {
"platform": "Snowflake",
"database": "ANALYTICS_DW"
},
"lineage": {
"datasets": [
{"id": "raw.customers", "type": "table", "schema": "RAW"},
{"id": "staging.customers", "type": "view", "schema": "STAGING"},
{"id": "marts.dim_customer", "type": "table", "schema": "MARTS"}
],
"relationships": [
{"source": "raw.customers", "target": "staging.customers", "type": "transform"},
{"source": "staging.customers", "target": "marts.dim_customer", "type": "transform"}
]
}
}
```
---
### 5. ETL Pipeline Format
For complex multi-stage ETL pipelines:
```json
{
"pipeline": {
"name": "customer_analytics",
"schedule": "daily"
},
"stages": [
{
"id": "extract",
"steps": [
{"id": "ext_crm", "name": "Extract CRM Data", "inputs": []},
{"id": "ext_payments", "name": "Extract Payments", "inputs": []}
]
},
{
"id": "transform",
"steps": [
{"id": "tfm_customers", "name": "Transform Customers", "inputs": ["ext_crm", "ext_payments"]}
]
},
{
"id": "load",
"steps": [
{"id": "load_warehouse", "name": "Load to Warehouse", "inputs": ["tfm_customers"]}
]
}
]
}
```
---
## Sample Lineage Examples
### Example 1: Simple E-Commerce Lineage
**Scenario**: Track data flow from raw transaction data to analytics reports.
```
Source Systems β†’ Raw Layer β†’ Staging β†’ Data Marts β†’ Reports
```
**Input**:
```json
{
"nodes": [
{"id": "shopify_api", "type": "source", "name": "Shopify API"},
{"id": "raw_orders", "type": "table", "name": "raw.orders"},
{"id": "stg_orders", "type": "model", "name": "staging.stg_orders"},
{"id": "fct_orders", "type": "fact", "name": "marts.fct_orders"},
{"id": "rpt_daily_sales", "type": "report", "name": "Daily Sales Report"}
],
"edges": [
{"from": "shopify_api", "to": "raw_orders", "type": "ingest"},
{"from": "raw_orders", "to": "stg_orders", "type": "transform"},
{"from": "stg_orders", "to": "fct_orders", "type": "transform"},
{"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"}
]
}
```
**Expected Output**: A Mermaid diagram showing the complete data flow with color-coded nodes by type.
---
### Example 2: Multi-Source Customer 360
**Scenario**: Combine data from multiple sources to create a unified customer view.
```
CRM + Payments + Website β†’ Identity Resolution β†’ Customer 360
```
**Input**:
```json
{
"nodes": [
{"id": "salesforce", "type": "source", "name": "Salesforce CRM"},
{"id": "stripe", "type": "source", "name": "Stripe Payments"},
{"id": "ga4", "type": "source", "name": "Google Analytics"},
{"id": "identity_resolution", "type": "model", "name": "Identity Resolution"},
{"id": "customer_360", "type": "dimension", "name": "Customer 360"}
],
"edges": [
{"from": "salesforce", "to": "identity_resolution"},
{"from": "stripe", "to": "identity_resolution"},
{"from": "ga4", "to": "identity_resolution"},
{"from": "identity_resolution", "to": "customer_360"}
]
}
```
---
### Example 3: dbt Project with Multiple Layers
**Scenario**: A complete dbt project with staging, intermediate, and mart layers.
Load the "dbt Manifest" sample from the dropdown to see a full example with:
- 4 source tables
- 4 staging models
- 2 intermediate models
- 3 mart tables
- 2 reporting views
---
### Example 4: Airflow ETL Pipeline
**Scenario**: A daily ETL pipeline with parallel extraction, sequential transformation, and loading.
Load the "Airflow DAG" sample to see:
- Parallel extract tasks
- Transform tasks with dependencies
- Load tasks to data warehouse
- Final notification task
---
## Export to Data Catalogs
The Lineage Graph Accelerator can export lineage to major enterprise data catalogs.
### Supported Formats
| Format | Platform | Description |
|--------|----------|-------------|
| **OpenLineage** | Universal | Open standard, works with Marquez, Atlan, DataHub |
| **Collibra** | Collibra Data Intelligence | Enterprise data governance platform |
| **Purview** | Microsoft Purview | Azure native data governance |
| **Alation** | Alation Data Catalog | Self-service analytics catalog |
### How to Export
1. **Enter or load your metadata** in the Text/File Metadata tab
2. **Extract the lineage** to verify it looks correct
3. **Expand "Export to Data Catalog"** accordion
4. **Select your format** from the dropdown
5. **Click "Generate Export"** to create the export file
6. **Copy or download** the JSON output
### Export Format Details
#### OpenLineage Export
The OpenLineage export follows the [OpenLineage specification](https://openlineage.io/):
```json
{
"producer": "lineage-accelerator",
"schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json",
"events": [
{
"eventType": "COMPLETE",
"job": {"namespace": "...", "name": "..."},
"inputs": [...],
"outputs": [...]
}
]
}
```
#### Collibra Export
Ready for Collibra's Import API:
```json
{
"community": {"name": "Data Lineage"},
"domain": {"name": "Physical Data Dictionary"},
"assets": [...],
"relations": [...]
}
```
#### Microsoft Purview Export
Compatible with Purview's bulk import:
```json
{
"collection": {"referenceName": "lineage-accelerator"},
"entities": [...],
"processes": [...]
}
```
#### Alation Export
Ready for Alation's bulk upload:
```json
{
"datasource": {"id": 1, "title": "..."},
"tables": [...],
"columns": [...],
"lineage": [...],
"dataflows": [...]
}
```
---
## MCP Server Integration
Connect to external MCP (Model Context Protocol) servers for enhanced processing.
### What is MCP?
MCP (Model Context Protocol) is a standard for AI model integration. The Lineage Graph Accelerator can connect to MCP servers hosted on HuggingFace Spaces for:
- Enhanced lineage extraction with AI
- Support for additional metadata formats
- Custom processing pipelines
### Configuration
1. **Expand "MCP Server Configuration"** at the top of the app
2. **Enter the MCP Server URL**: e.g., `https://your-space.hf.space/mcp`
3. **Add API Key** (if required)
4. **Click "Test Connection"** to verify
### Example MCP Servers
| Server | URL | Description |
|--------|-----|-------------|
| Demo Server | `http://localhost:9000/mcp` | Local testing |
| HuggingFace | `https://your-space.hf.space/mcp` | Production deployment |
### Running Your Own MCP Server
See `mcp_example/server.py` for a FastAPI-based MCP server example:
```bash
cd mcp_example
uvicorn server:app --reload --port 9000
```
---
## Troubleshooting
### Common Issues
#### "No data to display"
**Cause**: The input metadata couldn't be parsed.
**Solutions**:
1. Verify your JSON is valid (use a JSON validator)
2. Check that the format matches one of the supported types
3. Try loading a sample first to see the expected format
#### "Export functionality not available"
**Cause**: The exporters module isn't loaded.
**Solutions**:
1. Ensure you're running the latest version
2. Check that the `exporters/` directory exists
3. Restart the application
#### MCP Connection Failed
**Cause**: Cannot reach the MCP server.
**Solutions**:
1. Verify the URL is correct
2. Check if the server is running
3. Ensure network/firewall allows the connection
4. Try without the API key first
#### Mermaid Diagram Not Rendering
**Cause**: JavaScript loading issue.
**Solutions**:
1. Refresh the page
2. Try a different browser
3. Check browser console for errors
4. Ensure JavaScript is enabled
### Error Messages
| Error | Meaning | Solution |
|-------|---------|----------|
| "JSONDecodeError" | Invalid JSON input | Fix JSON syntax |
| "KeyError" | Missing required field | Check input format |
| "Timeout" | MCP server slow/unreachable | Increase timeout or check server |
---
## FAQ
### General Questions
**Q: What file formats are supported?**
A: JSON is the primary format. We also support SQL DDL (with limitations) and can parse dbt manifests, Airflow DAGs, and custom formats.
**Q: Can I upload files?**
A: Currently, you need to paste content into the text box. File upload is planned for a future release.
**Q: Is my data stored?**
A: No. All processing happens in your browser session. No data is stored on servers.
### Export Questions
**Q: Which export format should I use?**
A:
- Use **OpenLineage** for universal compatibility
- Use **Collibra/Purview/Alation** if you use those specific platforms
**Q: Can I customize the export?**
A: The current exports use default settings. Advanced customization is available through the API.
### Technical Questions
**Q: What's the maximum graph size?**
A: The UI handles graphs up to ~500 nodes smoothly. Larger graphs may be slow to render.
**Q: Can I use this programmatically?**
A: Yes! See `integration_example.py` for API usage examples.
**Q: Is there a rate limit?**
A: The HuggingFace Space has standard rate limits. For heavy usage, deploy your own instance.
---
## Support
- **Issues**: [GitHub Issues](https://github.com/your-repo/issues)
- **Documentation**: This guide and README.md
- **Community**: HuggingFace Discussions
---
## Appendix: Complete Sample Data
### E-Commerce Platform (Complex)
This sample demonstrates a complete e-commerce analytics platform with:
- 9 source systems (Shopify, Stripe, GA4, etc.)
- 50+ nodes across all data layers
- 80+ lineage relationships
- Multiple output destinations (BI tools, reverse ETL)
Load the "Complex Demo" sample to explore the full graph.
### Node Types Reference
| Type | Color | Description |
|------|-------|-------------|
| `source` | Light Blue | External data sources |
| `table` | Light Green | Database tables |
| `view` | Light Purple | Database views |
| `model` | Light Orange | Transformation models |
| `report` | Light Pink | Reports and dashboards |
| `dimension` | Cyan | Dimension tables |
| `fact` | Light Yellow | Fact tables |
| `destination` | Light Red | Output destinations |
### Edge Types Reference
| Type | Arrow | Description |
|------|-------|-------------|
| `transform` | `-->` | Data transformation |
| `reference` | `-.->` | Reference/lookup |
| `ingest` | `-->` | Data ingestion |
| `export` | `-->` | Data export |
| `join` | `-->` | Table join |
| `aggregate` | `-->` | Aggregation |
---
*Last updated: November 2025*
*Version: 1.0.0*