Static Analysis Challenges: A Better Way to Track Taint Sources

Modern apps constantly ingest untrusted data—from form fields, URLs, and cookies to API payloads and mobile sensor inputs—and risk often emerges only after that data winds through code to a sensitive operation. Taint analysis is meant to track that journey and flag the danger. But when models label too early and too broadly, teams end up with both noise and missed risks.

Let’s walk through how OpenRewrite’s advanced taint analysis—combining multi-type labeling with usage awareness—cuts noise and surfaces real risk, with at-scale execution and remediation by Moderne.

If it looks like a duck…it may not be a duck

In the world of static taint analysis, we’ve all accepted a fundamental assumption: when you see a piece of data, you can immediately determine what kind of security risk it represents. A string literal in a SQL query position? SQL injection source. User input flowing to HTML output? XSS source. It’s clean, it’s simple, and it works...until it doesn’t.

We recently hit this wall while building cryptography analysis for OpenRewrite. Here’s the code that broke our assumptions:

byte[] data = new byte[16];
SecretKeySpec key = new SecretKeySpec(data, "AES");  // It's key material!
// ... or is it?
IvParameterSpec iv = new IvParameterSpec(data);      // It's an IV!

The same 16-byte array could be cryptographic key material OR an initialization vector. Both are security-critical, both need tracking, but they have different security implications. And we can’t tell which one it is until we see how it’s used.

How the industry has always done static analysis

Most major taint analysis frameworks settled on a simple binary expression for source detection. Here are some examples:

FlowDroid (Android Security)

public boolean isSource(Stmt stmt, AccessPath ap) {
    // Returns true if this is a source, false otherwise
    // No way to express "this might be multiple types"
}

WALA (IBM Research)

public boolean isSource(CGNode node, int valueNumber) {
    // Same story - boolean return
}

Facebook's Mariana Trench

{
  "sources": [
    {
      "port": "Return",
      "method": "Lcom/example/Class;.getPassword()Ljava/lang/String;"
    }
  ]
}
// Sources are predefined by method signature - no ambiguity

This boolean approach has worked for decades because traditional taint sources are unambiguous:

request.getParameter() → Always user input
System.getenv() → Always environment data
new FileInputStream() → Always file system access

But now, this can lead to traditional tools either over-tagging or under-tagging, which can miss real problems.

Why cryptography breaks traditional taint analysis models

In modern code, the same-looking data can play very different roles depending on how it’s used. Cryptography makes this painfully obvious. The same byte array could be:

A secret key (must never be logged or exposed)
An initialization vector (should be random but can be public)
A salt (should be random, stored with the hash)
A nonce (must be unique, sometimes public)
Random padding (security-critical for timing attacks)

At the moment the data is created, you can’t tell which of these it will be—only the use site reveals its true role. The security implications vary dramatically, but the data looks identical at creation time.

Moderne solution: Multi-type taint sources

At Moderne, we’ve implemented a fundamental change to how taint analysis works. Instead of asking "is this a source," we let the code speak for itself by asking "what could this be?" We record every plausible source type up front and resolve the real one when usage makes it clear.

// Old way
boolean isSource(Cursor cursor);

// New way
Set<PotentialTaint> detectPotentialSources(Cursor cursor);

public class PotentialTaint {
    private final String type;
    private final double confidence;
    private final TypeResolver resolver;
    
    // Resolver determines final type based on usage
    interface TypeResolver {
        Optional<String> resolveType(TaintPath path);
    }
}

This allows us to:

Track multiple potential interpretations
Defer type resolution until we have usage context
Handle ambiguous cases gracefully
Maintain confidence scores for probabilistic analysis

By embracing overlap and deferring decisions to the point of use, you get fewer noisy alerts, more accurate risk signals, and a direct path to automated remediation.

Why improving taint analysis matters

This isn’t just about cryptography. As static analysis expands into more domains, we’re seeing more cases where source type depends on usage:

Machine learning: Is that tensor user data or model weights?
Cloud security: Is that string an AWS credential or a resource identifier?
Privacy analysis: Is that string PII or public data?
Supply chain: Is that dependency trusted or potentially malicious?

The boolean source model is showing its age. It’s time for taint analysis to evolve. This is more than a bug fix—it’s a fundamental rethinking of how taint analysis should work in modern codebases. The simple boolean model served us well, but the complexity of modern security analysis demands more sophisticated approaches.

Learn more about Moderne’s advanced program analysis capabilities in the documentation, and request a demo to see it in action.

‍

Overlapping taint sources: A challenge in static analysis solved

Key Takeaways