Architecture & Pipeline

How Graphify transforms files into a knowledge graph, step by step.

The 7-Stage Pipeline

Each stage is a pure function in its own module. They communicate through plain Python dicts and NetworkX graphs — no shared state, no side effects outside graphify-out/.

Click a stage to see details

detect

detect.py

→

extract

extract.py

→

build

build.py

→

cluster

cluster.py

→

analyze

analyze.py

→

report

report.py

→

export

export.py

1. Detect

Input: Directory path
Output: Classified file list {code: [...], doc: [...], paper: [...], image: [...]}
What happens: Recursively walks the directory, classifies files by extension and heuristics, filters sensitive files, applies .graphifyignore patterns, reports corpus health.

2. Extract

Input: File paths
Output: {nodes: [...], edges: [...]} dicts
What happens: Code files get tree-sitter AST parsing (Pass 1, deterministic). Docs/papers/images get Claude semantic extraction (Pass 2, via parallel subagents). Both produce the same schema.

3. Build

Input: List of extraction dicts
Output: nx.Graph
What happens: Validates extractions against schema, merges all nodes and edges into a single NetworkX graph. Handles deduplication and dangling edge removal.

4. Cluster

Input: nx.Graph
Output: Graph with community attribute on each node
What happens: Runs Leiden (or Louvain fallback) community detection. Splits oversized communities. Computes cohesion scores.

5. Analyze

Input: Graph + communities
Output: Analysis dict (god nodes, surprises, questions)
What happens: Identifies highest-degree nodes, ranks cross-community edges by surprise score, generates suggested questions, computes graph diff (if update mode).

6. Report

Input: Graph + analysis
Output: GRAPH_REPORT.md string
What happens: Renders a human-readable audit report with god nodes, surprising connections, hyperedges, suggested questions, rationale nodes, and token cost.

7. Export

Input: Graph + communities + output dir
Output: graph.json, graph.html, optionally Obsidian vault, SVG, GraphML, Cypher
What happens: Serializes graph to multiple formats. HTML uses vis.js for interactive visualization. Obsidian creates backlinked markdown notes.

Module Map

Core Pipeline

detect.py — File Discovery▶

collect_files(root) → directory → filtered [Path] list
detect(root) → classified dict with file counts & health warnings
detect_incremental(root) → only new/modified files via manifest mtime comparison
classify_file(path) → FileType enum (CODE, DOCUMENT, PAPER, IMAGE)

extract.py — AST Extraction▶

extract(paths) → merged {nodes, edges} from all files
_extract_generic(path, config) → generic tree-sitter walker for all 19 languages
extract_python(path) → Python-specific extraction with decorators & inheritance
Each language defined as a LanguageConfig dataclass instance

build.py — Graph Assembly▶

build_from_json(extraction) → single extraction dict → nx.Graph
build(extractions) → multiple dicts → merged graph
Validates schema, drops dangling edges, preserves direction metadata

cluster.py — Community Detection▶

cluster(G) → {community_id: [node_ids]}
_partition(G) → Leiden (graspologic) or Louvain (NetworkX) fallback
_split_community(G, nodes) → recursive split for oversized clusters
cohesion_score(G, nodes) → intra-edge density (0.0–1.0)

analyze.py — Structural Analysis▶

god_nodes(G, top_n) → highest-degree nodes after filtering
surprising_connections(G, communities, top_n) → cross-community edges by surprise score
suggest_questions(G, communities) → 4–7 targeted questions
graph_diff(G_old, G_new) → added/removed nodes & edges

report.py — Audit Report▶

render_report(G, analysis) → GRAPH_REPORT.md markdown string
Sections: corpus check, summary stats, god nodes, surprises, hyperedges, questions, rationale, token cost

export.py — Multi-Format Output▶

export(G, out_dir, ...) → graph.json, graph.html, and optional formats
HTML: vis.js force-directed layout with sidebar search, community legend, node info
JSON: NetworkX node-link format
Also: SVG, GraphML, Cypher, Obsidian vault

Supporting Modules

cache.py — Extraction Caching▶

SHA256 content hashing. check_semantic_cache(files) splits into cached/uncached. save_semantic_cache() writes per-file cache entries. Atomic writes via .tmp → rename.

security.py — Input Validation▶

validate_url() blocks SSRF. safe_fetch() enforces size caps. validate_graph_path() prevents path traversal. sanitize_label() prevents XSS/injection.

validate.py — Schema Validation▶

Enforces extraction output schema: required node fields (id, label, file_type, source_file), required edge fields (source, target, relation, confidence), enum validation, reference checks.

serve.py, watch.py, hooks.py, ingest.py, benchmark.py, wiki.py▶

serve.py: MCP stdio server for agent queries (BFS/DFS traversal + token budgets)
watch.py: File system monitoring via watchdog (code: instant rebuild, docs: notify user)
hooks.py: Git post-commit/post-checkout hook install/uninstall
ingest.py: URL fetching (tweets, papers, webpages) with security validation
benchmark.py: Token reduction measurement (corpus tokens vs graph query tokens)
wiki.py: Wikipedia-style markdown articles per community

Data Flow

Main Pipeline

detect(root)

Walk directory, classify files into {code, doc, paper, image}, filter sensitive files, apply .graphifyignore

extract(code_files) — Pass 1

Deterministic AST parsing via tree-sitter. Free, fast, reproducible. Produces {nodes, edges} with EXTRACTED confidence.

[parallel subagents] — Pass 2

Claude extracts concepts from docs, papers, images. Runs in parallel. Produces INFERRED and AMBIGUOUS edges with confidence scores.

build([code_extraction, semantic_extraction])

Merge all extractions into a single NetworkX graph. Validate schema. Remove dangling edges.

cluster(G)

Leiden community detection. Split oversized communities. Compute cohesion scores.

analyze(G) + report(G)

God nodes, surprising connections, suggested questions → GRAPH_REPORT.md

export(G)

Write graph.json, graph.html, and optional formats to graphify-out/

Incremental Update (`--update`)

detect_incremental(root)

Compare file mtimes against manifest. Return only new/modified files.

Re-extract only changed files

Unchanged files loaded from SHA256 cache. Only new/modified files go through extraction.

Merge & rebuild

Load existing graph.json, merge with new extractions, re-cluster, re-analyze, re-export.

Design Principles

Pure Composition, No Shared State

Each module consumes dicts/graphs and produces dicts/graphs. No globals, no hidden dependencies. This enables:

Testing: Unit tests with fixtures, no complex setup
Parallelism: Subagent dispatch for semantic extraction
Composability: Use functions programmatically (from graphify import god_nodes)
Debugging: Inspect/modify/replay intermediate outputs

Two-Pass Extraction: Why?

Pass 1: Deterministic AST

Free (no LLM calls)
Fast (milliseconds)
Reproducible (same input = same output)
19 languages via tree-sitter

Pass 2: LLM Semantic

Costs tokens (Claude/GPT-4)
Slower (seconds per file)
Probabilistic (confidence scores)
Handles docs, papers, images

The hybrid approach means code is always free. You only pay LLM costs for unstructured content that can't be parsed deterministically.

Lazy Imports for Skill Bootstrap

__init__.py uses __getattr__ to defer imports until needed. Why? Because graphify install must work before heavy dependencies (tree-sitter, NetworkX, graspologic) are installed. The skill framework may be installed in a minimal environment.

Language-Agnostic Extraction Schema

One LanguageConfig dataclass template supports all 19 languages. Customization points are AST node type names, field names, import handlers, and post-processing functions. Adding a new language = defining a new LanguageConfig instance.

Extraction Schema

Every extractor (AST or semantic) returns the same shape:

Click a field to see its description

{
  "nodes": [
    {
      "id": "auth.py::UserService",
      "label": "UserService",
      "file_type": "code",
      "source_file": "auth.py",
      "source_location": "L12-L45",
      "docstring": "Handles user authentication...",
      "rationale": ""
    }
  ],
  "edges": [
    {
      "source": "auth.py::UserService",
      "target": "auth.py::hash_password",
      "relation": "calls",
      "confidence": "EXTRACTED",
      "confidence_score": 1.0,
      "source_file": "auth.py"
    }
  ]
}

Click a highlighted field above to see its description.

← Core Concepts Implementation →