Implementation Deep Dive

How the key modules work under the hood.

File Detection

The detect.py module classifies files using a FileType enum: CODE, DOCUMENT, PAPER, IMAGE.

Try It: File Classifier

Enter a filename to see how Graphify classifies it.

Paper Detection Heuristic

PDFs aren't automatically classified as PAPER. Graphify scans the first 3000 characters for 3+ signals:

arXiv ID patterns (arXiv:1706.03762)
"doi:" references
"abstract" section headers
Citation patterns ([1], (Smith et al.))
"we propose", "literature" keywords

This avoids treating every PDF invoice as an academic paper.

Security Filtering

The _is_sensitive() function blocks files matching patterns for:

Pattern	Example
`.env*`	.env, .env.local
`.pem, .key`	server.key, cert.pem
`credentials*`	credentials.json
`service-account`	gcp-service-account.json

AST Extraction

The core of Pass 1: tree-sitter parses source code into a concrete syntax tree, then Graphify walks the tree to extract nodes and edges.

Code → Extraction

See how Python code maps to graph nodes and edges.

Source Code

class UserService:
    """Handles user authentication"""

    def login(self, email, pw):
        # WHY: bcrypt chosen for timing-attack resistance
        hashed = hash_password(pw)
        return self.create_session(email)

    def create_session(self, email):
        return Session(email)

Extracted GraphNode: CLASS UserService (L1-L9)
Node: FUNC login() (L4-L7)
Node: FUNC create_session() (L9-L10)
Edge: login → calls → hash_password INFERRED
Edge: login → calls → create_session EXTRACTED
Edge: UserService → contains → login EXTRACTED
Rationale: WHY bcrypt chosen for timing-attack resistance

The LanguageConfig Pattern

All 19 languages share one generic walker (_extract_generic) parameterized by a LanguageConfig dataclass:

class LanguageConfig:
    ts_module: str              # "tree_sitter_python"
    class_types: frozenset      # {"class_definition"}
    function_types: frozenset   # {"function_definition"}
    call_types: frozenset       # {"call"}
    import_types: frozenset     # {"import_statement", "import_from_statement"}
    name_field: str             # "name"
    body_field: str             # "body"
    call_function_field: str    # "function"
    call_accessor_node_types: frozenset  # {"attribute"}
    resolve_function_name_fn    # C/C++ declarator handling
    extra_walk_fn               # JS arrow functions, C# namespaces

Language-specific quirks are handled through the customization points rather than separate extraction functions.

Two-Pass Call Graph

Pass 1: Walk AST, collect basic structure (classes, functions, imports).

Pass 2: Walk function bodies to find call sites. Build a "call name" by concatenating the callee expression:

helper() → helper
self.validate() → self_validate
auth.service.login() → auth_service_login

Then match against known function IDs. Matches become INFERRED "calls" edges with confidence based on name uniqueness.

Node Deduplication (Three Layers)

Within file: seen_ids set tracks emitted node IDs per file
Between files: NetworkX add_node() is idempotent; semantic nodes intentionally overwrite AST nodes
Pre-build merge: Skill deduplicates cached and new semantic extractions

Community Detection

Leiden Algorithm

Leiden optimizes modularity — a measure of how densely connected nodes are within communities vs. between them. It guarantees well-connected communities (unlike Louvain, which can produce poorly-connected ones).

Implementation: Via graspologic library. Falls back to NetworkX Louvain with max_level=10, threshold=1e-4 (tuned to prevent hangs on large sparse graphs).

Oversized Community Splitting

if len(community) > 0.25 * len(G.nodes) and len(community) >= 10:
    subgraph = G.subgraph(community)
    sub_communities = leiden(subgraph)  # recursive split

This prevents one giant cluster from dominating. Communities are re-indexed by size (community 0 = largest).

Cohesion Score

cohesion = actual_intra_edges / max_possible_edges
# where max_possible = n * (n - 1) / 2 for n nodes

Range 0.0–1.0. Communities below 0.15 are flagged in suggested questions as splitting candidates.

Caching Strategy

SHA256 Content Hashing

Cache key = SHA256 of file contents + resolved path. This means:

Identical files in different locations share one cache entry
Renaming a file (same content) = cache hit
Modifying a file (different content) = cache miss

Cache Flow

File List

from detect()

→

check_semantic_cache()

SHA256 lookup

Cache Hit

Load cached {nodes, edges}
No LLM cost

Cache Miss

Send to LLM extraction
Save result to cache

Atomic writes: Results write to .tmp file first, then rename. Prevents corruption from interrupted writes.

Security Model

Defense-in-depth approach: all external input is validated before use.

Layer	Function	Threat
URL	`validate_url()`	SSRF — blocks `file://`, private IPs, cloud metadata
Download	`safe_fetch()`	DoS — streaming with 50MB cap
Path	`validate_graph_path()`	Path traversal — must resolve inside `graphify-out/`
Label	`sanitize_label()`	XSS + prompt injection — strip control chars, HTML-escape, 256 char cap

Try It: URL Validator

Enter a URL to see the validation steps.

Advanced Security Details

Redirect safety: _NoFileRedirectHandler intercepts HTTP redirects and blocks any resolving to file:// (open redirect → SSRF)
YAML injection: _yaml_str() escapes quotes and newlines in frontmatter values when ingesting URLs
Cypher injection: Node labels sanitized before generating Neo4j CREATE statements
IP resolution: Hostnames resolved and checked against ipaddress module ranges (127.x, 10.x, 172.16.x, 192.168.x, 169.254.x)

Testing Philosophy

All tests are pure unit tests — no network calls, no file system side effects outside tmp_path.

1:1 mapping: Each module has a corresponding test file
Fixture-based: Sample files in tests/fixtures/ (Python, JS, JSON extractions)
CI matrix: Python 3.10 + 3.12 via GitHub Actions
No mocks for core logic: Functions are pure, so fixtures suffice

Test File	Covers
`test_detect.py`	File classification, .graphifyignore, symlink safety
`test_extract.py`	AST extraction for all 19 languages
`test_build.py`	Graph assembly, deduplication, hyperedges
`test_cluster.py`	Community detection, cohesion, splitting
`test_security.py`	URL validation, path traversal, sanitization
`test_languages.py`	Multi-language extraction

← Architecture Use Cases →