Implementation Deep Dive
How the key modules work under the hood.
File Detection
The detect.py module classifies files using a FileType enum: CODE, DOCUMENT, PAPER, IMAGE.
Enter a filename to see how Graphify classifies it.
Paper Detection Heuristic
PDFs aren't automatically classified as PAPER. Graphify scans the first 3000 characters for 3+ signals:
- arXiv ID patterns (
arXiv:1706.03762) "doi:"references"abstract"section headers- Citation patterns (
[1],(Smith et al.)) "we propose","literature"keywords
This avoids treating every PDF invoice as an academic paper.
Security Filtering
The _is_sensitive() function blocks files matching patterns for:
| Pattern | Example |
|---|---|
.env* | .env, .env.local |
*.pem, *.key | server.key, cert.pem |
credentials* | credentials.json |
*service-account* | gcp-service-account.json |
AST Extraction
The core of Pass 1: tree-sitter parses source code into a concrete syntax tree, then Graphify walks the tree to extract nodes and edges.
See how Python code maps to graph nodes and edges.
Source Code
class UserService:
"""Handles user authentication"""
def login(self, email, pw):
# WHY: bcrypt chosen for timing-attack resistance
hashed = hash_password(pw)
return self.create_session(email)
def create_session(self, email):
return Session(email)
Extracted Graph
The LanguageConfig Pattern
All 19 languages share one generic walker (_extract_generic) parameterized by a LanguageConfig dataclass:
class LanguageConfig:
ts_module: str # "tree_sitter_python"
class_types: frozenset # {"class_definition"}
function_types: frozenset # {"function_definition"}
call_types: frozenset # {"call"}
import_types: frozenset # {"import_statement", "import_from_statement"}
name_field: str # "name"
body_field: str # "body"
call_function_field: str # "function"
call_accessor_node_types: frozenset # {"attribute"}
resolve_function_name_fn # C/C++ declarator handling
extra_walk_fn # JS arrow functions, C# namespaces
Language-specific quirks are handled through the customization points rather than separate extraction functions.
Two-Pass Call Graph
Pass 1: Walk AST, collect basic structure (classes, functions, imports).
Pass 2: Walk function bodies to find call sites. Build a "call name" by concatenating the callee expression:
helper()→helperself.validate()→self_validateauth.service.login()→auth_service_login
Then match against known function IDs. Matches become INFERRED "calls" edges with confidence based on name uniqueness.
Node Deduplication (Three Layers)
- Within file:
seen_idsset tracks emitted node IDs per file - Between files: NetworkX
add_node()is idempotent; semantic nodes intentionally overwrite AST nodes - Pre-build merge: Skill deduplicates cached and new semantic extractions
Community Detection
Leiden Algorithm
Leiden optimizes modularity — a measure of how densely connected nodes are within communities vs. between them. It guarantees well-connected communities (unlike Louvain, which can produce poorly-connected ones).
Implementation: Via graspologic library. Falls back to NetworkX Louvain with max_level=10, threshold=1e-4 (tuned to prevent hangs on large sparse graphs).
Oversized Community Splitting
if len(community) > 0.25 * len(G.nodes) and len(community) >= 10:
subgraph = G.subgraph(community)
sub_communities = leiden(subgraph) # recursive split
This prevents one giant cluster from dominating. Communities are re-indexed by size (community 0 = largest).
Cohesion Score
cohesion = actual_intra_edges / max_possible_edges
# where max_possible = n * (n - 1) / 2 for n nodes
Range 0.0–1.0. Communities below 0.15 are flagged in suggested questions as splitting candidates.
Caching Strategy
SHA256 Content Hashing
Cache key = SHA256 of file contents + resolved path. This means:
- Identical files in different locations share one cache entry
- Renaming a file (same content) = cache hit
- Modifying a file (different content) = cache miss
No LLM cost
Save result to cache
Atomic writes: Results write to .tmp file first, then rename. Prevents corruption from interrupted writes.
Security Model
Defense-in-depth approach: all external input is validated before use.
| Layer | Function | Threat |
|---|---|---|
| URL | validate_url() | SSRF — blocks file://, private IPs, cloud metadata |
| Download | safe_fetch() | DoS — streaming with 50MB cap |
| Path | validate_graph_path() | Path traversal — must resolve inside graphify-out/ |
| Label | sanitize_label() | XSS + prompt injection — strip control chars, HTML-escape, 256 char cap |
Enter a URL to see the validation steps.
Advanced Security Details
- Redirect safety:
_NoFileRedirectHandlerintercepts HTTP redirects and blocks any resolving tofile://(open redirect → SSRF) - YAML injection:
_yaml_str()escapes quotes and newlines in frontmatter values when ingesting URLs - Cypher injection: Node labels sanitized before generating Neo4j
CREATEstatements - IP resolution: Hostnames resolved and checked against
ipaddressmodule ranges (127.x, 10.x, 172.16.x, 192.168.x, 169.254.x)
Testing Philosophy
All tests are pure unit tests — no network calls, no file system side effects outside tmp_path.
- 1:1 mapping: Each module has a corresponding test file
- Fixture-based: Sample files in
tests/fixtures/(Python, JS, JSON extractions) - CI matrix: Python 3.10 + 3.12 via GitHub Actions
- No mocks for core logic: Functions are pure, so fixtures suffice
| Test File | Covers |
|---|---|
test_detect.py | File classification, .graphifyignore, symlink safety |
test_extract.py | AST extraction for all 19 languages |
test_build.py | Graph assembly, deduplication, hyperedges |
test_cluster.py | Community detection, cohesion, splitting |
test_security.py | URL validation, path traversal, sanitization |
test_languages.py | Multi-language extraction |