Architecture & Pipeline
How Graphify transforms files into a knowledge graph, step by step.
The 7-Stage Pipeline
Each stage is a pure function in its own module. They communicate through plain Python dicts and NetworkX graphs — no shared state, no side effects outside graphify-out/.
1. Detect
Input: Directory path
Output: Classified file list {code: [...], doc: [...], paper: [...], image: [...]}
What happens: Recursively walks the directory, classifies files by extension and heuristics, filters sensitive files, applies .graphifyignore patterns, reports corpus health.
2. Extract
Input: File paths
Output: {nodes: [...], edges: [...]} dicts
What happens: Code files get tree-sitter AST parsing (Pass 1, deterministic). Docs/papers/images get Claude semantic extraction (Pass 2, via parallel subagents). Both produce the same schema.
3. Build
Input: List of extraction dicts
Output: nx.Graph
What happens: Validates extractions against schema, merges all nodes and edges into a single NetworkX graph. Handles deduplication and dangling edge removal.
4. Cluster
Input: nx.Graph
Output: Graph with community attribute on each node
What happens: Runs Leiden (or Louvain fallback) community detection. Splits oversized communities. Computes cohesion scores.
5. Analyze
Input: Graph + communities
Output: Analysis dict (god nodes, surprises, questions)
What happens: Identifies highest-degree nodes, ranks cross-community edges by surprise score, generates suggested questions, computes graph diff (if update mode).
6. Report
Input: Graph + analysis
Output: GRAPH_REPORT.md string
What happens: Renders a human-readable audit report with god nodes, surprising connections, hyperedges, suggested questions, rationale nodes, and token cost.
7. Export
Input: Graph + communities + output dir
Output: graph.json, graph.html, optionally Obsidian vault, SVG, GraphML, Cypher
What happens: Serializes graph to multiple formats. HTML uses vis.js for interactive visualization. Obsidian creates backlinked markdown notes.
Module Map
Core Pipeline
collect_files(root) → directory → filtered [Path] listdetect(root) → classified dict with file counts & health warningsdetect_incremental(root) → only new/modified files via manifest mtime comparisonclassify_file(path) → FileType enum (CODE, DOCUMENT, PAPER, IMAGE)extract(paths) → merged {nodes, edges} from all files_extract_generic(path, config) → generic tree-sitter walker for all 19 languagesextract_python(path) → Python-specific extraction with decorators & inheritanceEach language defined as a
LanguageConfig dataclass instancebuild_from_json(extraction) → single extraction dict → nx.Graphbuild(extractions) → multiple dicts → merged graphValidates schema, drops dangling edges, preserves direction metadata
cluster(G) → {community_id: [node_ids]}_partition(G) → Leiden (graspologic) or Louvain (NetworkX) fallback_split_community(G, nodes) → recursive split for oversized clusterscohesion_score(G, nodes) → intra-edge density (0.0–1.0)god_nodes(G, top_n) → highest-degree nodes after filteringsurprising_connections(G, communities, top_n) → cross-community edges by surprise scoresuggest_questions(G, communities) → 4–7 targeted questionsgraph_diff(G_old, G_new) → added/removed nodes & edgesrender_report(G, analysis) → GRAPH_REPORT.md markdown stringSections: corpus check, summary stats, god nodes, surprises, hyperedges, questions, rationale, token cost
export(G, out_dir, ...) → graph.json, graph.html, and optional formatsHTML: vis.js force-directed layout with sidebar search, community legend, node info
JSON: NetworkX node-link format
Also: SVG, GraphML, Cypher, Obsidian vault
Supporting Modules
check_semantic_cache(files) splits into cached/uncached. save_semantic_cache() writes per-file cache entries. Atomic writes via .tmp → rename.validate_url() blocks SSRF. safe_fetch() enforces size caps. validate_graph_path() prevents path traversal. sanitize_label() prevents XSS/injection.watch.py: File system monitoring via watchdog (code: instant rebuild, docs: notify user)
hooks.py: Git post-commit/post-checkout hook install/uninstall
ingest.py: URL fetching (tweets, papers, webpages) with security validation
benchmark.py: Token reduction measurement (corpus tokens vs graph query tokens)
wiki.py: Wikipedia-style markdown articles per community
Data Flow
Main Pipeline
Walk directory, classify files into {code, doc, paper, image}, filter sensitive files, apply .graphifyignore
Deterministic AST parsing via tree-sitter. Free, fast, reproducible. Produces {nodes, edges} with EXTRACTED confidence.
Claude extracts concepts from docs, papers, images. Runs in parallel. Produces INFERRED and AMBIGUOUS edges with confidence scores.
Merge all extractions into a single NetworkX graph. Validate schema. Remove dangling edges.
Leiden community detection. Split oversized communities. Compute cohesion scores.
God nodes, surprising connections, suggested questions → GRAPH_REPORT.md
Write graph.json, graph.html, and optional formats to graphify-out/
Incremental Update (--update)
Compare file mtimes against manifest. Return only new/modified files.
Unchanged files loaded from SHA256 cache. Only new/modified files go through extraction.
Load existing graph.json, merge with new extractions, re-cluster, re-analyze, re-export.
Design Principles
Pure Composition, No Shared State
Each module consumes dicts/graphs and produces dicts/graphs. No globals, no hidden dependencies. This enables:
- Testing: Unit tests with fixtures, no complex setup
- Parallelism: Subagent dispatch for semantic extraction
- Composability: Use functions programmatically (
from graphify import god_nodes) - Debugging: Inspect/modify/replay intermediate outputs
Two-Pass Extraction: Why?
- Free (no LLM calls)
- Fast (milliseconds)
- Reproducible (same input = same output)
- 19 languages via tree-sitter
- Costs tokens (Claude/GPT-4)
- Slower (seconds per file)
- Probabilistic (confidence scores)
- Handles docs, papers, images
The hybrid approach means code is always free. You only pay LLM costs for unstructured content that can't be parsed deterministically.
Lazy Imports for Skill Bootstrap
__init__.py uses __getattr__ to defer imports until needed. Why? Because graphify install must work before heavy dependencies (tree-sitter, NetworkX, graspologic) are installed. The skill framework may be installed in a minimal environment.
Language-Agnostic Extraction Schema
One LanguageConfig dataclass template supports all 19 languages. Customization points are AST node type names, field names, import handlers, and post-processing functions. Adding a new language = defining a new LanguageConfig instance.
Extraction Schema
Every extractor (AST or semantic) returns the same shape:
{
"nodes": [
{
"id": "auth.py::UserService",
"label": "UserService",
"file_type": "code",
"source_file": "auth.py",
"source_location": "L12-L45",
"docstring": "Handles user authentication...",
"rationale": ""
}
],
"edges": [
{
"source": "auth.py::UserService",
"target": "auth.py::hash_password",
"relation": "calls",
"confidence": "EXTRACTED",
"confidence_score": 1.0,
"source_file": "auth.py"
}
]
}