Home › Architecture & Pipeline

Architecture & Pipeline

How Graphify transforms files into a knowledge graph, step by step.

The 7-Stage Pipeline

Each stage is a pure function in its own module. They communicate through plain Python dicts and NetworkX graphs — no shared state, no side effects outside graphify-out/.

Click a stage to see details
detect
detect.py
extract
extract.py
build
build.py
cluster
cluster.py
analyze
analyze.py
report
report.py
export
export.py

1. Detect

Input: Directory path
Output: Classified file list {code: [...], doc: [...], paper: [...], image: [...]}
What happens: Recursively walks the directory, classifies files by extension and heuristics, filters sensitive files, applies .graphifyignore patterns, reports corpus health.

2. Extract

Input: File paths
Output: {nodes: [...], edges: [...]} dicts
What happens: Code files get tree-sitter AST parsing (Pass 1, deterministic). Docs/papers/images get Claude semantic extraction (Pass 2, via parallel subagents). Both produce the same schema.

3. Build

Input: List of extraction dicts
Output: nx.Graph
What happens: Validates extractions against schema, merges all nodes and edges into a single NetworkX graph. Handles deduplication and dangling edge removal.

4. Cluster

Input: nx.Graph
Output: Graph with community attribute on each node
What happens: Runs Leiden (or Louvain fallback) community detection. Splits oversized communities. Computes cohesion scores.

5. Analyze

Input: Graph + communities
Output: Analysis dict (god nodes, surprises, questions)
What happens: Identifies highest-degree nodes, ranks cross-community edges by surprise score, generates suggested questions, computes graph diff (if update mode).

6. Report

Input: Graph + analysis
Output: GRAPH_REPORT.md string
What happens: Renders a human-readable audit report with god nodes, surprising connections, hyperedges, suggested questions, rationale nodes, and token cost.

7. Export

Input: Graph + communities + output dir
Output: graph.json, graph.html, optionally Obsidian vault, SVG, GraphML, Cypher
What happens: Serializes graph to multiple formats. HTML uses vis.js for interactive visualization. Obsidian creates backlinked markdown notes.

Module Map

Core Pipeline

detect.py — File Discovery
collect_files(root) → directory → filtered [Path] list
detect(root) → classified dict with file counts & health warnings
detect_incremental(root) → only new/modified files via manifest mtime comparison
classify_file(path) → FileType enum (CODE, DOCUMENT, PAPER, IMAGE)
extract.py — AST Extraction
extract(paths) → merged {nodes, edges} from all files
_extract_generic(path, config) → generic tree-sitter walker for all 19 languages
extract_python(path) → Python-specific extraction with decorators & inheritance
Each language defined as a LanguageConfig dataclass instance
build.py — Graph Assembly
build_from_json(extraction) → single extraction dict → nx.Graph
build(extractions) → multiple dicts → merged graph
Validates schema, drops dangling edges, preserves direction metadata
cluster.py — Community Detection
cluster(G){community_id: [node_ids]}
_partition(G) → Leiden (graspologic) or Louvain (NetworkX) fallback
_split_community(G, nodes) → recursive split for oversized clusters
cohesion_score(G, nodes) → intra-edge density (0.0–1.0)
analyze.py — Structural Analysis
god_nodes(G, top_n) → highest-degree nodes after filtering
surprising_connections(G, communities, top_n) → cross-community edges by surprise score
suggest_questions(G, communities) → 4–7 targeted questions
graph_diff(G_old, G_new) → added/removed nodes & edges
report.py — Audit Report
render_report(G, analysis)GRAPH_REPORT.md markdown string
Sections: corpus check, summary stats, god nodes, surprises, hyperedges, questions, rationale, token cost
export.py — Multi-Format Output
export(G, out_dir, ...) → graph.json, graph.html, and optional formats
HTML: vis.js force-directed layout with sidebar search, community legend, node info
JSON: NetworkX node-link format
Also: SVG, GraphML, Cypher, Obsidian vault

Supporting Modules

cache.py — Extraction Caching
SHA256 content hashing. check_semantic_cache(files) splits into cached/uncached. save_semantic_cache() writes per-file cache entries. Atomic writes via .tmp → rename.
security.py — Input Validation
validate_url() blocks SSRF. safe_fetch() enforces size caps. validate_graph_path() prevents path traversal. sanitize_label() prevents XSS/injection.
validate.py — Schema Validation
Enforces extraction output schema: required node fields (id, label, file_type, source_file), required edge fields (source, target, relation, confidence), enum validation, reference checks.
serve.py, watch.py, hooks.py, ingest.py, benchmark.py, wiki.py
serve.py: MCP stdio server for agent queries (BFS/DFS traversal + token budgets)
watch.py: File system monitoring via watchdog (code: instant rebuild, docs: notify user)
hooks.py: Git post-commit/post-checkout hook install/uninstall
ingest.py: URL fetching (tweets, papers, webpages) with security validation
benchmark.py: Token reduction measurement (corpus tokens vs graph query tokens)
wiki.py: Wikipedia-style markdown articles per community

Data Flow

Main Pipeline

1
detect(root)

Walk directory, classify files into {code, doc, paper, image}, filter sensitive files, apply .graphifyignore

2
extract(code_files) — Pass 1

Deterministic AST parsing via tree-sitter. Free, fast, reproducible. Produces {nodes, edges} with EXTRACTED confidence.

3
[parallel subagents] — Pass 2

Claude extracts concepts from docs, papers, images. Runs in parallel. Produces INFERRED and AMBIGUOUS edges with confidence scores.

4
build([code_extraction, semantic_extraction])

Merge all extractions into a single NetworkX graph. Validate schema. Remove dangling edges.

5
cluster(G)

Leiden community detection. Split oversized communities. Compute cohesion scores.

6
analyze(G) + report(G)

God nodes, surprising connections, suggested questions → GRAPH_REPORT.md

7
export(G)

Write graph.json, graph.html, and optional formats to graphify-out/

Incremental Update (--update)

1
detect_incremental(root)

Compare file mtimes against manifest. Return only new/modified files.

2
Re-extract only changed files

Unchanged files loaded from SHA256 cache. Only new/modified files go through extraction.

3
Merge & rebuild

Load existing graph.json, merge with new extractions, re-cluster, re-analyze, re-export.

Design Principles

Pure Composition, No Shared State

Each module consumes dicts/graphs and produces dicts/graphs. No globals, no hidden dependencies. This enables:

  • Testing: Unit tests with fixtures, no complex setup
  • Parallelism: Subagent dispatch for semantic extraction
  • Composability: Use functions programmatically (from graphify import god_nodes)
  • Debugging: Inspect/modify/replay intermediate outputs

Two-Pass Extraction: Why?

Pass 1: Deterministic AST
  • Free (no LLM calls)
  • Fast (milliseconds)
  • Reproducible (same input = same output)
  • 19 languages via tree-sitter
Pass 2: LLM Semantic
  • Costs tokens (Claude/GPT-4)
  • Slower (seconds per file)
  • Probabilistic (confidence scores)
  • Handles docs, papers, images

The hybrid approach means code is always free. You only pay LLM costs for unstructured content that can't be parsed deterministically.

Lazy Imports for Skill Bootstrap

__init__.py uses __getattr__ to defer imports until needed. Why? Because graphify install must work before heavy dependencies (tree-sitter, NetworkX, graspologic) are installed. The skill framework may be installed in a minimal environment.

Language-Agnostic Extraction Schema

One LanguageConfig dataclass template supports all 19 languages. Customization points are AST node type names, field names, import handlers, and post-processing functions. Adding a new language = defining a new LanguageConfig instance.

Extraction Schema

Every extractor (AST or semantic) returns the same shape:

Click a field to see its description
{
  "nodes": [
    {
      "id": "auth.py::UserService",
      "label": "UserService",
      "file_type": "code",
      "source_file": "auth.py",
      "source_location": "L12-L45",
      "docstring": "Handles user authentication...",
      "rationale": ""
    }
  ],
  "edges": [
    {
      "source": "auth.py::UserService",
      "target": "auth.py::hash_password",
      "relation": "calls",
      "confidence": "EXTRACTED",
      "confidence_score": 1.0,
      "source_file": "auth.py"
    }
  ]
}
Click a highlighted field above to see its description.
← Core Concepts Implementation →