| # Lineage Graph Accelerator - User Guide | |
| A comprehensive guide to using the Lineage Graph Accelerator for extracting, visualizing, and exporting data lineage from your data platforms. | |
| --- | |
| ## Table of Contents | |
| 1. [Getting Started](#getting-started) | |
| 2. [Input Formats](#input-formats) | |
| 3. [Sample Lineage Examples](#sample-lineage-examples) | |
| 4. [Export to Data Catalogs](#export-to-data-catalogs) | |
| 5. [MCP Server Integration](#mcp-server-integration) | |
| 6. [Troubleshooting](#troubleshooting) | |
| 7. [FAQ](#faq) | |
| --- | |
| ## Getting Started | |
| ### Quick Start (3 Steps) | |
| 1. **Open the App**: Navigate to the Lineage Graph Accelerator on HuggingFace Spaces | |
| 2. **Load Sample Data**: Click "Load Sample" to try pre-built examples | |
| 3. **Extract Lineage**: Click "Extract Lineage" to visualize the data flow | |
| ### Interface Overview | |
| The application has four main tabs: | |
| | Tab | Purpose | | |
| |-----|---------| | |
| | **Text/File Metadata** | Paste or upload metadata directly | | |
| | **BigQuery** | Connect to Google BigQuery for schema extraction | | |
| | **URL/API** | Fetch metadata from REST APIs | | |
| | **Demo Gallery** | One-click demos of various lineage scenarios | | |
| --- | |
| ## Input Formats | |
| The Lineage Graph Accelerator supports multiple metadata formats: | |
| ### 1. Simple JSON (Nodes & Edges) | |
| The simplest format with explicit nodes and edges: | |
| ```json | |
| { | |
| "nodes": [ | |
| {"id": "raw_customers", "type": "table", "name": "raw_customers"}, | |
| {"id": "clean_customers", "type": "table", "name": "clean_customers"}, | |
| {"id": "analytics_customers", "type": "table", "name": "analytics_customers"} | |
| ], | |
| "edges": [ | |
| {"from": "raw_customers", "to": "clean_customers"}, | |
| {"from": "clean_customers", "to": "analytics_customers"} | |
| ] | |
| } | |
| ``` | |
| **Result**: A linear graph showing `raw_customers β clean_customers β analytics_customers` | |
| --- | |
| ### 2. dbt Manifest Format | |
| Extract lineage from dbt's `manifest.json`: | |
| ```json | |
| { | |
| "metadata": { | |
| "dbt_version": "1.7.0", | |
| "project_name": "my_project" | |
| }, | |
| "nodes": { | |
| "source.my_project.raw.customers": { | |
| "resource_type": "source", | |
| "name": "customers", | |
| "schema": "raw" | |
| }, | |
| "model.my_project.stg_customers": { | |
| "resource_type": "model", | |
| "name": "stg_customers", | |
| "schema": "staging", | |
| "depends_on": { | |
| "nodes": ["source.my_project.raw.customers"] | |
| } | |
| }, | |
| "model.my_project.dim_customers": { | |
| "resource_type": "model", | |
| "name": "dim_customers", | |
| "schema": "marts", | |
| "depends_on": { | |
| "nodes": ["model.my_project.stg_customers"] | |
| } | |
| } | |
| } | |
| } | |
| ``` | |
| **Result**: A graph showing the dbt model dependencies from source to staging to marts. | |
| --- | |
| ### 3. Airflow DAG Format | |
| Extract task dependencies from Airflow DAGs: | |
| ```json | |
| { | |
| "dag_id": "etl_pipeline", | |
| "tasks": [ | |
| { | |
| "task_id": "extract_data", | |
| "operator": "PythonOperator", | |
| "upstream_dependencies": [] | |
| }, | |
| { | |
| "task_id": "transform_data", | |
| "operator": "SparkSubmitOperator", | |
| "upstream_dependencies": ["extract_data"] | |
| }, | |
| { | |
| "task_id": "load_data", | |
| "operator": "SnowflakeOperator", | |
| "upstream_dependencies": ["transform_data"] | |
| } | |
| ] | |
| } | |
| ``` | |
| **Result**: A DAG visualization showing `extract_data β transform_data β load_data` | |
| --- | |
| ### 4. Data Warehouse Lineage Format | |
| For Snowflake, BigQuery, or other warehouse lineage: | |
| ```json | |
| { | |
| "warehouse": { | |
| "platform": "Snowflake", | |
| "database": "ANALYTICS_DW" | |
| }, | |
| "lineage": { | |
| "datasets": [ | |
| {"id": "raw.customers", "type": "table", "schema": "RAW"}, | |
| {"id": "staging.customers", "type": "view", "schema": "STAGING"}, | |
| {"id": "marts.dim_customer", "type": "table", "schema": "MARTS"} | |
| ], | |
| "relationships": [ | |
| {"source": "raw.customers", "target": "staging.customers", "type": "transform"}, | |
| {"source": "staging.customers", "target": "marts.dim_customer", "type": "transform"} | |
| ] | |
| } | |
| } | |
| ``` | |
| --- | |
| ### 5. ETL Pipeline Format | |
| For complex multi-stage ETL pipelines: | |
| ```json | |
| { | |
| "pipeline": { | |
| "name": "customer_analytics", | |
| "schedule": "daily" | |
| }, | |
| "stages": [ | |
| { | |
| "id": "extract", | |
| "steps": [ | |
| {"id": "ext_crm", "name": "Extract CRM Data", "inputs": []}, | |
| {"id": "ext_payments", "name": "Extract Payments", "inputs": []} | |
| ] | |
| }, | |
| { | |
| "id": "transform", | |
| "steps": [ | |
| {"id": "tfm_customers", "name": "Transform Customers", "inputs": ["ext_crm", "ext_payments"]} | |
| ] | |
| }, | |
| { | |
| "id": "load", | |
| "steps": [ | |
| {"id": "load_warehouse", "name": "Load to Warehouse", "inputs": ["tfm_customers"]} | |
| ] | |
| } | |
| ] | |
| } | |
| ``` | |
| --- | |
| ## Sample Lineage Examples | |
| ### Example 1: Simple E-Commerce Lineage | |
| **Scenario**: Track data flow from raw transaction data to analytics reports. | |
| ``` | |
| Source Systems β Raw Layer β Staging β Data Marts β Reports | |
| ``` | |
| **Input**: | |
| ```json | |
| { | |
| "nodes": [ | |
| {"id": "shopify_api", "type": "source", "name": "Shopify API"}, | |
| {"id": "raw_orders", "type": "table", "name": "raw.orders"}, | |
| {"id": "stg_orders", "type": "model", "name": "staging.stg_orders"}, | |
| {"id": "fct_orders", "type": "fact", "name": "marts.fct_orders"}, | |
| {"id": "rpt_daily_sales", "type": "report", "name": "Daily Sales Report"} | |
| ], | |
| "edges": [ | |
| {"from": "shopify_api", "to": "raw_orders", "type": "ingest"}, | |
| {"from": "raw_orders", "to": "stg_orders", "type": "transform"}, | |
| {"from": "stg_orders", "to": "fct_orders", "type": "transform"}, | |
| {"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"} | |
| ] | |
| } | |
| ``` | |
| **Expected Output**: A Mermaid diagram showing the complete data flow with color-coded nodes by type. | |
| --- | |
| ### Example 2: Multi-Source Customer 360 | |
| **Scenario**: Combine data from multiple sources to create a unified customer view. | |
| ``` | |
| CRM + Payments + Website β Identity Resolution β Customer 360 | |
| ``` | |
| **Input**: | |
| ```json | |
| { | |
| "nodes": [ | |
| {"id": "salesforce", "type": "source", "name": "Salesforce CRM"}, | |
| {"id": "stripe", "type": "source", "name": "Stripe Payments"}, | |
| {"id": "ga4", "type": "source", "name": "Google Analytics"}, | |
| {"id": "identity_resolution", "type": "model", "name": "Identity Resolution"}, | |
| {"id": "customer_360", "type": "dimension", "name": "Customer 360"} | |
| ], | |
| "edges": [ | |
| {"from": "salesforce", "to": "identity_resolution"}, | |
| {"from": "stripe", "to": "identity_resolution"}, | |
| {"from": "ga4", "to": "identity_resolution"}, | |
| {"from": "identity_resolution", "to": "customer_360"} | |
| ] | |
| } | |
| ``` | |
| --- | |
| ### Example 3: dbt Project with Multiple Layers | |
| **Scenario**: A complete dbt project with staging, intermediate, and mart layers. | |
| Load the "dbt Manifest" sample from the dropdown to see a full example with: | |
| - 4 source tables | |
| - 4 staging models | |
| - 2 intermediate models | |
| - 3 mart tables | |
| - 2 reporting views | |
| --- | |
| ### Example 4: Airflow ETL Pipeline | |
| **Scenario**: A daily ETL pipeline with parallel extraction, sequential transformation, and loading. | |
| Load the "Airflow DAG" sample to see: | |
| - Parallel extract tasks | |
| - Transform tasks with dependencies | |
| - Load tasks to data warehouse | |
| - Final notification task | |
| --- | |
| ## Export to Data Catalogs | |
| The Lineage Graph Accelerator can export lineage to major enterprise data catalogs. | |
| ### Supported Formats | |
| | Format | Platform | Description | | |
| |--------|----------|-------------| | |
| | **OpenLineage** | Universal | Open standard, works with Marquez, Atlan, DataHub | | |
| | **Collibra** | Collibra Data Intelligence | Enterprise data governance platform | | |
| | **Purview** | Microsoft Purview | Azure native data governance | | |
| | **Alation** | Alation Data Catalog | Self-service analytics catalog | | |
| ### How to Export | |
| 1. **Enter or load your metadata** in the Text/File Metadata tab | |
| 2. **Extract the lineage** to verify it looks correct | |
| 3. **Expand "Export to Data Catalog"** accordion | |
| 4. **Select your format** from the dropdown | |
| 5. **Click "Generate Export"** to create the export file | |
| 6. **Copy or download** the JSON output | |
| ### Export Format Details | |
| #### OpenLineage Export | |
| The OpenLineage export follows the [OpenLineage specification](https://openlineage.io/): | |
| ```json | |
| { | |
| "producer": "lineage-accelerator", | |
| "schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json", | |
| "events": [ | |
| { | |
| "eventType": "COMPLETE", | |
| "job": {"namespace": "...", "name": "..."}, | |
| "inputs": [...], | |
| "outputs": [...] | |
| } | |
| ] | |
| } | |
| ``` | |
| #### Collibra Export | |
| Ready for Collibra's Import API: | |
| ```json | |
| { | |
| "community": {"name": "Data Lineage"}, | |
| "domain": {"name": "Physical Data Dictionary"}, | |
| "assets": [...], | |
| "relations": [...] | |
| } | |
| ``` | |
| #### Microsoft Purview Export | |
| Compatible with Purview's bulk import: | |
| ```json | |
| { | |
| "collection": {"referenceName": "lineage-accelerator"}, | |
| "entities": [...], | |
| "processes": [...] | |
| } | |
| ``` | |
| #### Alation Export | |
| Ready for Alation's bulk upload: | |
| ```json | |
| { | |
| "datasource": {"id": 1, "title": "..."}, | |
| "tables": [...], | |
| "columns": [...], | |
| "lineage": [...], | |
| "dataflows": [...] | |
| } | |
| ``` | |
| --- | |
| ## MCP Server Integration | |
| Connect to external MCP (Model Context Protocol) servers for enhanced processing. | |
| ### What is MCP? | |
| MCP (Model Context Protocol) is a standard for AI model integration. The Lineage Graph Accelerator can connect to MCP servers hosted on HuggingFace Spaces for: | |
| - Enhanced lineage extraction with AI | |
| - Support for additional metadata formats | |
| - Custom processing pipelines | |
| ### Configuration | |
| 1. **Expand "MCP Server Configuration"** at the top of the app | |
| 2. **Enter the MCP Server URL**: e.g., `https://your-space.hf.space/mcp` | |
| 3. **Add API Key** (if required) | |
| 4. **Click "Test Connection"** to verify | |
| ### Example MCP Servers | |
| | Server | URL | Description | | |
| |--------|-----|-------------| | |
| | Demo Server | `http://localhost:9000/mcp` | Local testing | | |
| | HuggingFace | `https://your-space.hf.space/mcp` | Production deployment | | |
| ### Running Your Own MCP Server | |
| See `mcp_example/server.py` for a FastAPI-based MCP server example: | |
| ```bash | |
| cd mcp_example | |
| uvicorn server:app --reload --port 9000 | |
| ``` | |
| --- | |
| ## Troubleshooting | |
| ### Common Issues | |
| #### "No data to display" | |
| **Cause**: The input metadata couldn't be parsed. | |
| **Solutions**: | |
| 1. Verify your JSON is valid (use a JSON validator) | |
| 2. Check that the format matches one of the supported types | |
| 3. Try loading a sample first to see the expected format | |
| #### "Export functionality not available" | |
| **Cause**: The exporters module isn't loaded. | |
| **Solutions**: | |
| 1. Ensure you're running the latest version | |
| 2. Check that the `exporters/` directory exists | |
| 3. Restart the application | |
| #### MCP Connection Failed | |
| **Cause**: Cannot reach the MCP server. | |
| **Solutions**: | |
| 1. Verify the URL is correct | |
| 2. Check if the server is running | |
| 3. Ensure network/firewall allows the connection | |
| 4. Try without the API key first | |
| #### Mermaid Diagram Not Rendering | |
| **Cause**: JavaScript loading issue. | |
| **Solutions**: | |
| 1. Refresh the page | |
| 2. Try a different browser | |
| 3. Check browser console for errors | |
| 4. Ensure JavaScript is enabled | |
| ### Error Messages | |
| | Error | Meaning | Solution | | |
| |-------|---------|----------| | |
| | "JSONDecodeError" | Invalid JSON input | Fix JSON syntax | | |
| | "KeyError" | Missing required field | Check input format | | |
| | "Timeout" | MCP server slow/unreachable | Increase timeout or check server | | |
| --- | |
| ## FAQ | |
| ### General Questions | |
| **Q: What file formats are supported?** | |
| A: JSON is the primary format. We also support SQL DDL (with limitations) and can parse dbt manifests, Airflow DAGs, and custom formats. | |
| **Q: Can I upload files?** | |
| A: Currently, you need to paste content into the text box. File upload is planned for a future release. | |
| **Q: Is my data stored?** | |
| A: No. All processing happens in your browser session. No data is stored on servers. | |
| ### Export Questions | |
| **Q: Which export format should I use?** | |
| A: | |
| - Use **OpenLineage** for universal compatibility | |
| - Use **Collibra/Purview/Alation** if you use those specific platforms | |
| **Q: Can I customize the export?** | |
| A: The current exports use default settings. Advanced customization is available through the API. | |
| ### Technical Questions | |
| **Q: What's the maximum graph size?** | |
| A: The UI handles graphs up to ~500 nodes smoothly. Larger graphs may be slow to render. | |
| **Q: Can I use this programmatically?** | |
| A: Yes! See `integration_example.py` for API usage examples. | |
| **Q: Is there a rate limit?** | |
| A: The HuggingFace Space has standard rate limits. For heavy usage, deploy your own instance. | |
| --- | |
| ## Support | |
| - **Issues**: [GitHub Issues](https://github.com/your-repo/issues) | |
| - **Documentation**: This guide and README.md | |
| - **Community**: HuggingFace Discussions | |
| --- | |
| ## Appendix: Complete Sample Data | |
| ### E-Commerce Platform (Complex) | |
| This sample demonstrates a complete e-commerce analytics platform with: | |
| - 9 source systems (Shopify, Stripe, GA4, etc.) | |
| - 50+ nodes across all data layers | |
| - 80+ lineage relationships | |
| - Multiple output destinations (BI tools, reverse ETL) | |
| Load the "Complex Demo" sample to explore the full graph. | |
| ### Node Types Reference | |
| | Type | Color | Description | | |
| |------|-------|-------------| | |
| | `source` | Light Blue | External data sources | | |
| | `table` | Light Green | Database tables | | |
| | `view` | Light Purple | Database views | | |
| | `model` | Light Orange | Transformation models | | |
| | `report` | Light Pink | Reports and dashboards | | |
| | `dimension` | Cyan | Dimension tables | | |
| | `fact` | Light Yellow | Fact tables | | |
| | `destination` | Light Red | Output destinations | | |
| ### Edge Types Reference | |
| | Type | Arrow | Description | | |
| |------|-------|-------------| | |
| | `transform` | `-->` | Data transformation | | |
| | `reference` | `-.->` | Reference/lookup | | |
| | `ingest` | `-->` | Data ingestion | | |
| | `export` | `-->` | Data export | | |
| | `join` | `-->` | Table join | | |
| | `aggregate` | `-->` | Aggregation | | |
| --- | |
| *Last updated: November 2025* | |
| *Version: 1.0.0* | |