Why Token Matching
Rationale for our cross-domain linking strategy.
The Problem
How do you know env:DATABASE_URL in Python relates to output.database_url in Terraform?
There's no explicit link — it's a convention.
The Alternatives
| Approach | Accuracy | Effort | Coverage |
|---|---|---|---|
| Manual annotation | Perfect | High | Limited |
| Semantic analysis | High | Very high | Limited |
| Token matching | Good | Low | Broad |
Why Not Manual Annotation?
# Manual mapping file
links:
- source: env:DATABASE_URL
target: infra:output.database_url
- source: env:REDIS_HOST
target: infra:output.redis_endpoint
Problems:
- Doesn't scale — Hundreds of env vars × resources
- Gets stale — New vars aren't added
- Human error — Typos, missed entries
- Defeats the purpose — If you tracked it manually, you wouldn't need Jnkn
Why Not Semantic Analysis?
Semantic analysis would understand that:
- aws_rds_instance provides a database
- DATABASE_URL is a database connection string
- Therefore, they're related
Problems:
- Requires domain knowledge — What does
aws_elasticacheprovide? - Computationally expensive — Need embeddings, inference
- Brittle — Custom resources, unusual naming
- Overkill — Names usually tell you what you need
Why Token Matching Works
In practice, teams use consistent naming:
Python: DATABASE_URL
Terraform: output.database_url
Kubernetes: secret.database-url
Config file: database_url
Tokens: [database, url] — identical!
Token matching exploits this convention.
How It Works
Step 1: Tokenize
Split names on separators and case boundaries:
DATABASE_URL → [database, url]
databaseUrl → [database, url]
database-url → [database, url]
PAYMENT_DB_HOST → [payment, db, host]
Step 2: Normalize
Lowercase, remove noise:
Step 3: Compare
Calculate overlap:
Step 4: Score
Apply penalties for weak matches:
Source: [db, host]
Target: [db, host]
Overlap: 100%
Penalty: short tokens (db=2 chars)
Final: Medium confidence
When It Works Well
✅ Consistent naming — Teams follow conventions
✅ Descriptive names — PAYMENT_DATABASE_HOST not PDH
✅ Similar patterns — *_URL, *_HOST, *_KEY
When It Struggles
❌ Abbreviations — DB doesn't match database
❌ Generic names — HOST matches everything
❌ Different conventions — camelCase vs SCREAMING_SNAKE
We mitigate with:
- Blocked tokens (id, key, etc.)
- Confidence thresholds
- Suppressions for known false positives
The Tradeoff
Token matching is heuristic, not perfect.
This is acceptable because:
- False positives are visible — Developers review and suppress
- False negatives surface elsewhere — Tests, prod errors
- The alternative is nothing — No other tool catches this
Improving Accuracy
Organization-Level
- Enforce naming conventions
- Document env var → resource mappings
- Run Jnkn in CI to catch drift
Jnkn-Level
- Tune confidence thresholds
- Add suppressions for known issues
- Custom rules for domain-specific patterns