Home › Implementation Deep Dive

Implementation Deep Dive

How the key modules work under the hood.

File Detection

The detect.py module classifies files using a FileType enum: CODE, DOCUMENT, PAPER, IMAGE.

Try It: File Classifier

Enter a filename to see how Graphify classifies it.

Paper Detection Heuristic

PDFs aren't automatically classified as PAPER. Graphify scans the first 3000 characters for 3+ signals:

  • arXiv ID patterns (arXiv:1706.03762)
  • "doi:" references
  • "abstract" section headers
  • Citation patterns ([1], (Smith et al.))
  • "we propose", "literature" keywords

This avoids treating every PDF invoice as an academic paper.

Security Filtering

The _is_sensitive() function blocks files matching patterns for:

PatternExample
.env*.env, .env.local
*.pem, *.keyserver.key, cert.pem
credentials*credentials.json
*service-account*gcp-service-account.json

AST Extraction

The core of Pass 1: tree-sitter parses source code into a concrete syntax tree, then Graphify walks the tree to extract nodes and edges.

Code → Extraction

See how Python code maps to graph nodes and edges.

Source Code

class UserService:
    """Handles user authentication"""

    def login(self, email, pw):
        # WHY: bcrypt chosen for timing-attack resistance
        hashed = hash_password(pw)
        return self.create_session(email)

    def create_session(self, email):
        return Session(email)

Extracted Graph

Node: CLASS UserService (L1-L9)
Node: FUNC login() (L4-L7)
Node: FUNC create_session() (L9-L10)
Edge: login → calls → hash_password INFERRED
Edge: login → calls → create_session EXTRACTED
Edge: UserService → contains → login EXTRACTED
Rationale: WHY bcrypt chosen for timing-attack resistance

The LanguageConfig Pattern

All 19 languages share one generic walker (_extract_generic) parameterized by a LanguageConfig dataclass:

class LanguageConfig:
    ts_module: str              # "tree_sitter_python"
    class_types: frozenset      # {"class_definition"}
    function_types: frozenset   # {"function_definition"}
    call_types: frozenset       # {"call"}
    import_types: frozenset     # {"import_statement", "import_from_statement"}
    name_field: str             # "name"
    body_field: str             # "body"
    call_function_field: str    # "function"
    call_accessor_node_types: frozenset  # {"attribute"}
    resolve_function_name_fn    # C/C++ declarator handling
    extra_walk_fn               # JS arrow functions, C# namespaces

Language-specific quirks are handled through the customization points rather than separate extraction functions.

Two-Pass Call Graph

Pass 1: Walk AST, collect basic structure (classes, functions, imports).

Pass 2: Walk function bodies to find call sites. Build a "call name" by concatenating the callee expression:

  • helper()helper
  • self.validate()self_validate
  • auth.service.login()auth_service_login

Then match against known function IDs. Matches become INFERRED "calls" edges with confidence based on name uniqueness.

Node Deduplication (Three Layers)

  1. Within file: seen_ids set tracks emitted node IDs per file
  2. Between files: NetworkX add_node() is idempotent; semantic nodes intentionally overwrite AST nodes
  3. Pre-build merge: Skill deduplicates cached and new semantic extractions

Community Detection

Leiden Algorithm

Leiden optimizes modularity — a measure of how densely connected nodes are within communities vs. between them. It guarantees well-connected communities (unlike Louvain, which can produce poorly-connected ones).

Implementation: Via graspologic library. Falls back to NetworkX Louvain with max_level=10, threshold=1e-4 (tuned to prevent hangs on large sparse graphs).

Oversized Community Splitting

if len(community) > 0.25 * len(G.nodes) and len(community) >= 10:
    subgraph = G.subgraph(community)
    sub_communities = leiden(subgraph)  # recursive split

This prevents one giant cluster from dominating. Communities are re-indexed by size (community 0 = largest).

Cohesion Score

cohesion = actual_intra_edges / max_possible_edges
# where max_possible = n * (n - 1) / 2 for n nodes

Range 0.0–1.0. Communities below 0.15 are flagged in suggested questions as splitting candidates.

Caching Strategy

SHA256 Content Hashing

Cache key = SHA256 of file contents + resolved path. This means:

  • Identical files in different locations share one cache entry
  • Renaming a file (same content) = cache hit
  • Modifying a file (different content) = cache miss
Cache Flow
File List
from detect()
check_semantic_cache()
SHA256 lookup
Cache Hit
Load cached {nodes, edges}
No LLM cost
Cache Miss
Send to LLM extraction
Save result to cache

Atomic writes: Results write to .tmp file first, then rename. Prevents corruption from interrupted writes.

Security Model

Defense-in-depth approach: all external input is validated before use.

LayerFunctionThreat
URLvalidate_url()SSRF — blocks file://, private IPs, cloud metadata
Downloadsafe_fetch()DoS — streaming with 50MB cap
Pathvalidate_graph_path()Path traversal — must resolve inside graphify-out/
Labelsanitize_label()XSS + prompt injection — strip control chars, HTML-escape, 256 char cap
Try It: URL Validator

Enter a URL to see the validation steps.

Advanced Security Details

  • Redirect safety: _NoFileRedirectHandler intercepts HTTP redirects and blocks any resolving to file:// (open redirect → SSRF)
  • YAML injection: _yaml_str() escapes quotes and newlines in frontmatter values when ingesting URLs
  • Cypher injection: Node labels sanitized before generating Neo4j CREATE statements
  • IP resolution: Hostnames resolved and checked against ipaddress module ranges (127.x, 10.x, 172.16.x, 192.168.x, 169.254.x)

Testing Philosophy

All tests are pure unit tests — no network calls, no file system side effects outside tmp_path.

  • 1:1 mapping: Each module has a corresponding test file
  • Fixture-based: Sample files in tests/fixtures/ (Python, JS, JSON extractions)
  • CI matrix: Python 3.10 + 3.12 via GitHub Actions
  • No mocks for core logic: Functions are pure, so fixtures suffice
Test FileCovers
test_detect.pyFile classification, .graphifyignore, symlink safety
test_extract.pyAST extraction for all 19 languages
test_build.pyGraph assembly, deduplication, hyperedges
test_cluster.pyCommunity detection, cohesion, splitting
test_security.pyURL validation, path traversal, sanitization
test_languages.pyMulti-language extraction
← Architecture Use Cases →