Confidence Levels

Understanding HIGH, MEDIUM, and LOW confidence scores in jnkn.

Current Design

Every cross-domain edge in jnkn's dependency graph carries a confidence score between 0.0 and 1.0. This score reflects how certain jnkn is that the relationship is real, not a false positive. The confidence system is designed around a core principle: every match must be explainable.

Confidence Tiers

Level	Score Range	Meaning	Example
HIGH	0.80 - 1.00	Strong evidence of relationship	Exact/normalized name match
MEDIUM	0.50 - 0.79	Likely related, some uncertainty	3+ significant token overlap
LOW	0.00 - 0.49	Weak evidence, may be false positive	Single token match with penalties

The default minimum threshold is 0.5, meaning only MEDIUM and HIGH confidence matches create edges.

Signal-Based Scoring

Confidence is built from signals—evidence that two artifacts are related:

class ConfidenceSignal(StrEnum):
    EXACT_MATCH = "exact_match"           # "db_host" == "db_host"
    NORMALIZED_MATCH = "normalized_match" # "DB_HOST" == "db_host" after normalization
    TOKEN_OVERLAP_HIGH = "token_overlap_high"    # 3+ significant tokens shared
    TOKEN_OVERLAP_MEDIUM = "token_overlap_medium" # 2 significant tokens shared
    SUFFIX_MATCH = "suffix_match"         # target ends with source
    PREFIX_MATCH = "prefix_match"         # target starts with source
    CONTAINS = "contains"                 # target contains source (weak)
    SINGLE_TOKEN = "single_token"         # Only 1 token match (weakest)

Each signal has a configurable weight:

signal_weights = {
    ConfidenceSignal.EXACT_MATCH: 1.0,
    ConfidenceSignal.NORMALIZED_MATCH: 0.9,
    ConfidenceSignal.TOKEN_OVERLAP_HIGH: 0.8,
    ConfidenceSignal.TOKEN_OVERLAP_MEDIUM: 0.6,
    ConfidenceSignal.SUFFIX_MATCH: 0.7,
    ConfidenceSignal.PREFIX_MATCH: 0.7,
    ConfidenceSignal.CONTAINS: 0.4,
    ConfidenceSignal.SINGLE_TOKEN: 0.2,
}

Penalty System

Penalties reduce confidence when matches have concerning characteristics:

class PenaltyType(StrEnum):
    SHORT_TOKEN = "short_token"      # Tokens < 4 chars are less reliable
    COMMON_TOKEN = "common_token"    # Generic tokens like "id", "host", "key"
    AMBIGUITY = "ambiguity"          # Multiple potential matches exist
    LOW_VALUE_TOKEN = "low_value_token"  # Cloud prefixes like "aws", "gcp"

Penalty multipliers are applied multiplicatively:

penalty_multipliers = {
    PenaltyType.SHORT_TOKEN: 0.5,     # Cuts score in half
    PenaltyType.COMMON_TOKEN: 0.7,    # 30% reduction
    PenaltyType.AMBIGUITY: 0.8,       # 20% reduction per alternative
    PenaltyType.LOW_VALUE_TOKEN: 0.6, # 40% reduction
}

Common and Low-Value Tokens

Certain tokens are flagged as providing weak signal:

Common tokens (very generic, match many things):

common_tokens = {
    "id", "db", "host", "url", "key", "name", "type", "data",
    "info", "temp", "test", "api", "app", "env", "var", "val",
    "config", "setting", "path", "port", "user", "password",
    "secret", "token", "auth", "log", "file", "dir", "src",
    "dst", "in", "out", "err", "msg", "str", "int", "num",
}

Low-value tokens (provide some signal but reduced):

low_value_tokens = {
    "aws", "gcp", "azure", "main", "default", "primary",
    "production", "prod", "staging", "dev", "development",
    "internal", "external", "public", "private", "local",
    "remote", "master", "slave", "read", "write",
}

Score Calculation

The ConfidenceCalculator combines signals and penalties:

class ConfidenceCalculator:
    def calculate(
        self,
        source_name: str,
        target_name: str,
        source_tokens: List[str],
        target_tokens: List[str],
        matched_tokens: Optional[List[str]] = None,
        alternative_match_count: int = 0,
    ) -> ConfidenceResult:
        # 1. Evaluate all signals
        signal_results = self._evaluate_signals(
            source_name, target_name,
            source_tokens, target_tokens,
            matched_tokens
        )

        # 2. Evaluate penalties
        penalty_results = self._evaluate_penalties(
            matched_tokens, alternative_match_count
        )

        # 3. Calculate base score (max signal, not sum)
        base_score = self._calculate_base_score(signal_results)

        # 4. Apply penalties multiplicatively
        final_score = self._apply_penalties(base_score, penalty_results)

        return ConfidenceResult(score=final_score, ...)

Important: The base score uses the maximum signal weight, not the sum. This prevents multiple weak signals from inflating scores:

def _calculate_base_score(self, signal_results: List[SignalResult]) -> float:
    matched_weights = [s.weight for s in signal_results if s.matched]
    if not matched_weights:
        return 0.0

    # Use max weight, with small bonus for additional signals
    max_weight = max(matched_weights)
    bonus = min(0.1, (len(matched_weights) - 1) * 0.02)

    return min(1.0, max_weight + bonus)

Explainability

Every confidence result includes a human-readable explanation:

result = calculator.calculate(
    source_name="PAYMENT_DB_HOST",
    target_name="payment_db_host",
    source_tokens=["payment", "db", "host"],
    target_tokens=["payment", "db", "host"]
)

print(calculator.explain(result))

Output:

Match: PAYMENT_DB_HOST → payment_db_host
Confidence: 0.90

Signals:
  ✓ normalized_match (0.90)
    → 'paymentdbhost' == 'paymentdbhost'

Penalties: None

Score Breakdown:
  Base: 0.90
  Final: 0.90

Ambiguity Penalty Example

When multiple targets could match a source, confidence is reduced:

# Source: DB_HOST
# Potential targets: payment_db_host, orders_db_host, users_db_host

# Each match gets penalized for ambiguity
# penalty = 0.8 ** (1 + (alternative_count - 2) * 0.2)

# With 3 alternatives:
# penalty = 0.8 ** (1 + 0.2) = 0.8 ** 1.2 ≈ 0.76

Real-World Examples

HIGH Confidence (0.90)

Source: env:STRIPE_API_KEY
Target: infra:stripe_api_key

Signals:
  ✓ normalized_match (0.90)
    → 'stripeapikey' == 'stripeapikey'
  ✓ token_overlap_high (0.80)
    → 3 significant tokens: ['stripe', 'api', 'key']

Penalties: None

Final Score: 0.90 (HIGH)

MEDIUM Confidence (0.63)

Source: env:DB_CONNECTION_URL
Target: infra:database_url

Signals:
  ✓ token_overlap_medium (0.60)
    → 2 significant tokens: ['db', 'url']

Penalties:
  - common_token (×0.70)
    → All matched tokens are common: ['db', 'url']
  - short_token (×0.50)
    → Short tokens (< 4 chars): ['db']

Score Breakdown:
  Base: 0.60
  After penalties: 0.60 × 0.70 × 0.50 = 0.21

Wait, that's LOW. Let's recalculate with better tokens...

LOW Confidence (0.35)

Source: env:API_KEY
Target: infra:payment_service_key

Signals:
  ✓ single_token (0.20)
    → Single token match: ['key']

Penalties:
  - common_token (×0.70)
    → All matched tokens are common: ['key']
  - ambiguity (×0.64)
    → Source has 5 potential matches

Score Breakdown:
  Base: 0.20
  After penalties: 0.20 × 0.70 × 0.64 = 0.09

Final Score: 0.09 (LOW - filtered out)

Configuration

Confidence thresholds are configurable in .jnkn/config.yaml:

confidence:
  min_threshold: 0.5  # Only create edges above this score

  # Override signal weights
  signal_weights:
    exact_match: 1.0
    normalized_match: 0.9
    token_overlap_high: 0.8
    token_overlap_medium: 0.6

  # Override penalty multipliers
  penalty_multipliers:
    short_token: 0.5
    common_token: 0.7
    ambiguity: 0.8

  # Customize common tokens for your project
  common_tokens:
    - id
    - key
    - your_custom_prefix

Future Ideas

Short-term: Confidence Explanations in Output

Include confidence breakdowns in CLI and JSON output:

{
  "edge": {
    "source": "infra:payment_db_host",
    "target": "env:PAYMENT_DB_HOST",
    "confidence": 0.90,
    "confidence_breakdown": {
      "base_score": 0.90,
      "signals": [
        {"type": "normalized_match", "weight": 0.90}
      ],
      "penalties": []
    }
  }
}

Short-term: Project-Specific Common Tokens

Auto-detect common tokens from the codebase:

def detect_common_tokens(graph: DependencyGraph, threshold: float = 0.2) -> Set[str]:
    """Tokens appearing in >20% of nodes are likely common."""
    token_counts = Counter()
    total_nodes = graph.node_count

    for node in graph.iter_nodes():
        for token in node.tokens:
            token_counts[token] += 1

    return {t for t, c in token_counts.items() if c / total_nodes > threshold}

Medium-term: Confidence Calibration

Track false positive rates to adjust weights:

class ConfidenceCalibrator:
    def record_feedback(self, edge: Edge, is_correct: bool):
        """Record user feedback on match quality."""
        self.feedback_log.append({
            "edge": edge,
            "predicted_confidence": edge.confidence,
            "actual_correct": is_correct
        })

    def calibrate(self) -> Dict[str, float]:
        """Adjust weights based on actual accuracy."""
        # If normalized_match has 95% accuracy but weight is 0.9,
        # maybe increase to 0.95
        pass

Medium-term: Multi-Factor Scoring

Add additional signals beyond name matching:

Co-location: Files in same directory get bonus
Co-change: Files that change together in commits
Documentation: README or comments mentioning relationship
Import patterns: Transitive dependency chains

Long-term: ML-Based Confidence

Train a classifier on confirmed matches:

class MLConfidenceScorer:
    def __init__(self, model_path: str):
        self.model = load_model(model_path)

    def predict(self, source: Node, target: Node) -> float:
        features = self._extract_features(source, target)
        # Features: token overlap, edit distance, file proximity,
        # node types, language pair, etc.
        return self.model.predict_proba(features)[0][1]

Long-term: Confidence Decay

Reduce confidence for stale matches:

def calculate_with_decay(self, edge: Edge, last_confirmed: datetime) -> float:
    """Reduce confidence for old, unconfirmed matches."""
    days_since_confirmation = (datetime.now() - last_confirmed).days
    decay_factor = 0.99 ** (days_since_confirmation / 30)  # 1% decay per month
    return edge.confidence * decay_factor