Static Program Analysis For Security
Cambridge IB Tech Talks
Zayne Zhang
zz513@cam.ac.uk
March 12, 2025
1 / 34
Overview
1. Introduction to Static Program Analysis
2. Lattices and Fixed Points
3. Data Flow Analysis
4. CodeQL
2 / 34
How to find bugs in code?
Dynamic Analysis (e.g. Unit Testing, Fuzzing)
Done on a limited number of different inputs
Often reveals the presence of errors but cannot guarantee their absence
100% test coverage != bug-free code
Many tests are regressive and only added after a bug is found
Static Analysis: analyze code without executing it
Can check all possible executions and provide guarantees about its behavior
With the right tools, can catch bugs early in the development process
Particularly useful for testing the absence of security vulnerabilities
Introduction to Static Program Analysis 3 / 34
How to find bugs in code?
Introduction to Static Program Analysis 4 / 34
Data Flow Analysis
A particular form of static analysis that examines how data moves through a program
to answer questions such as:
What values can reach this point in the code?
Is this variable always initialized before it is used?
Does untrusted data ever reach an unsafe function?
Introduction to Static Program Analysis 5 / 34
Partial Orders
Definition
A partial order (S, ) is a set S equipped with a binary relation that is:
Reflexive: x S, x x
Transitive: x, y, z S, x y y z x z
Antisymmetric: x, y S, x y y x x = y
y S is an upper bound for X (X y) if x X , x y
y S is the least upper bound for X (X
F
y) if y is an upper bound for X and
z S, X z y z
y S is a lower bound for X (y X ) if x X , y x
y S is the greatest lower bound for X (y
F
X ) if y is a lower bound for X and
z S, z X z y
Lattices and Fixed Points 6 / 34
Lattices
Definition
A lattice (L, ) is a partial order (L, ) in which every pair of elements x, y L has a
least upper bound x y (join) and a greatest lower bound x y (meet).
A complete lattice is a lattice in which every subset has a least upper bound and a
greatest lower bound.
Lattices and Fixed Points 7 / 34
Sign Analysis
As an example, we want to find the possible signs of integer variables and expressions.
Consider the following abstract values for the sign of an integer:
: unknown sign
+: positive
: negative
0: zero
: not an integer, or unreachable
code
+
0
This partial order, with edges for ,
forms a complete lattice. e.g. +
means + is at least as precise as
Lattices and Fixed Points 8 / 34
Sign Analysis
Let’s create a map lattice State = Var Sign that describes the sign of each variable.
Derive a system of equations, one per line, using values from the lattice.
var a, b; // 1
a = 42; // 2
b = a + input(); // 3
a = a - b; // 4
x
1
= [a 7→ , b 7→ ]
x
2
= x
1
[a 7→ +]
x
3
= x
2
[b 7→ x
2
(a) + ]
x
4
= x
3
[a 7→ x
3
(a) x
3
(b)]
+
0
Sign
Lattices and Fixed Points 9 / 34
Sign Analysis
x
1
= [a 7→ , b 7→ ]
x
2
= x
1
[a 7→ +]
x
3
= x
2
[b 7→ x
2
(a) + ]
x
4
= x
3
[a 7→ x
3
(a) x
3
(b)]
f
1
(x
1
, . . . , x
n
) = [a 7→ , b 7→ ]
f
2
(x
1
, . . . , x
n
) = x
1
[a 7→ +]
f
3
(x
1
, . . . , x
n
) = x
2
[b 7→ x
2
(a) + ]
f
4
(x
1
, . . . , x
n
) = x
3
[a 7→ x
3
(a) x
3
(b)]
Generalised equation system over a lattice L, with functions f
i
: L
n
L:
x
1
= f
1
(x
1
, . . . , x
n
)
x
2
= f
2
(x
1
, . . . , x
n
)
.
.
.
x
n
= f
n
(x
1
, . . . , x
n
)
Lattices and Fixed Points 10 / 34
Monotonicity and Fixed Points
Generalised equation system over a lattice L, with functions f
i
: L
n
L:
x
1
= f
1
(x
1
, . . . , x
n
)
x
2
= f
2
(x
1
, . . . , x
n
)
.
.
.
x
n
= f
n
(x
1
, . . . , x
n
)
Combine the n functions into F : L
n
L
n
:
F (x
1
, . . . , x
n
) = (f
1
(x
1
, . . . , x
n
), . . . , f
n
(x
1
, . . . , x
n
))
= (x
1
, . . . , x
n
)
Then we are looking for x = F (x), i.e. a fixed point of F .
Lattices and Fixed Points 11 / 34
Monotonicity and Fixed Points
Definition
A function f : L
1
L
2
is monotone if x, y L
1
, x y f (x) f (y)
More precise input leads to more precise output
Theorem
Kleene’s Fixed Point Theorem: In a complete lattice L with finite height, every
monotone function f : L L has a unique least fixed point
F
i=0
f
i
()
These results generalise to functions that take multiple arguments f : L
n
L that are
monotone in each argument such as the ones we derived for sign analysis.
Corollary
For an equation system over complete lattices of finite height with monotone constraint
functions, a unique, most precise solution always exists
Lattices and Fixed Points 12 / 34
Computing the Least Fixed Point
Algorithm 1 Naive Fixed Point Algorithm
1: procedure NaiveFixedPoint(F )
2: x :=
3: while x = F (x) do
4: x := F (x)
5: end while
6: return x
7: end procedure
In each iteraction, all of f
1
, . . . , f
4
are applied. But f
2
depends only on x
1
, and the value
of x
1
is unchanged in most iterations. We’ll see a more efficient way later.
Lattices and Fixed Points 13 / 34
Data Flow Analysis
Main idea: we want to find the possible values of variables at each point in the
program.
In compilers: used for optimisations (e.g. constant propagation)
In security: used to find vulnerabilities (e.g. untrusted data reaching a sink)
Data Flow Analysis 14 / 34
Control Flow Graph
In our previous example, we had a
sequence of statements with no
branches
In general, we have a control flow
graph (CFG) with basic blocks
and edges
Data Flow Analysis 15 / 34
Abstract States
Recall: each element of the lattice
State = Var Sign is an abstract
state that maps variables to signs
For each CFG node v , let the
constraint variable [[v]] be the
abstract state at the program point
immediately after v
We have a lattice State
n
of
abstract states, where n is the
number of CFG nodes
Data Flow Analysis 16 / 34
Constraint Rules
We need to combine the abstract states
of the predecessors of a node to get the
abstract state of the node itself.
JOIN(v) =
G
upred(v)
[[u]]
JOIN([[a = c + 2]]) = [[c = b]] [[c = 5]]
Data Flow Analysis 17 / 34
Constraint Rules
[[c = b]] = [b 7→ +, c 7→ +]
[[c = 5]] = [b 7→ , c 7→ ]
JOIN([[a = c + 2]]) = [[c = b]] [[c = 5]]
= [b 7→ , c 7→ ]
[[a = c + 2]] = JOIN([[a = c + 2]])[a 7→ eval(JOIN([[a = c + 2]]), c + 2)]
= [a 7→ , b 7→ , c 7→ ]
Data Flow Analysis 18 / 34
Solving Data Flow Equations
Generalised equation system over a lattice L, with functions f
i
: L
n
L:
[[v
1
]] = f
v
1
([[v
1
]], . . . , [[v
n
]])
[[v
2
]] = f
v
2
([[v
1
]], . . . , [[v
n
]])
.
.
.
[[v
n
]] = f
v
n
([[v
1
]], . . . , [[v
n
]])
Combine the n functions into F : L
n
L
n
:
F ([[v
1
]], . . . , [[v
n
]]) = (f
v
1
([[v
1
]], . . . , [[v
n
]]), . . . , f
v
n
([[v
1
]], . . . , [[v
n
]))
= ([[v
1
]], . . . , [[v
n
]])
Then we are looking for x = F (x), i.e. a fixed point of F .
Data Flow Analysis 19 / 34
A More Efficient Algorithm
Algorithm 2 Simple Worklist Algorithm
1: procedure SimpleWorkList(F )
2: (x
1
, . . . , x
n
) := (, . . . , )
3: W := {v
1
, . . . , v
n
}
4: while W = do
5: v
i
:= W .pop()
6: y := f
v
i
(x
1
, . . . , x
n
)
7: if y = x
i
then
8: x
i
:= y
9: for v
j
dep(v
i
) do
10: W .add(v
j
)
11: end for
12: end if
13: end while
14: return (x
1
, . . . , x
n
)
15: end procedure
Insight: most f
v
i
will only read the
values from a few other variables,
instead of all [[v
1
]], . . . , [[v
n
]].
dep(v
i
) is the set of nodes that depend
on v
i
(i.e. the successors of v
i
)
Data Flow Analysis 20 / 34
Taint Tracking
Can we tell which computations may involve “tainted” data?
i.e. data that comes from an untrusted source
e.g. user input, HTTP responses, environment variables
Data Flow Analysis 21 / 34
Taint Tracking
The approach is similar! We just define different abstract values and equations.
Abstract taint values:
: Unknown taint status.
T : Tainted.
U: Untainted.
: Unreachable.
Ordering: U T
Abstract state: A mapping σ : Var {⊥, U, T , ⊤}.
Transfer functions: e.g. for an assignment x := y op z, define
f (σ) = σ[x 7→ σ(y) σ(z)]
i.e. if either operand is tainted, the result is tainted
Data Flow Analysis 22 / 34
In Practice: CodeQL
A tool developed by Semmle (a spin-out company from Oxf*rd), now acquired by GitHub.
Used with CLI or GitHub integration (free for all public repos!)
The source code is compiled into a relational database, which includes information
about the control flow graph, data flow, and other properties of the code.
The user writes queries in a high-level language called QL, which is executed by the
CodeQL engine.
The engine uses fixed-point algorithms to perform data flow analysis.
Results are exported into the SARIF format which can be consumed by CI tools or
custom integrations.
CodeQL 23 / 34
Let’s Go On A Little Adventure
Next.js, the most popular React framework, has some weird, poorly documented
URL parsing semantics that does not conform to the widely accepted WHATWG
URL standard
This is unexpected behaviour, and often results in wrong URL validation
Made a responsible disclosure 1 year ago, still not fixed
Let’s query open-source GitHub projects to find instances of this bug!
Common design pattern: unauthenticated user visits /admin, gets redirected to
/login?next=/admin, logs in, and gets redirected back to /admin
Use Next.js URL parsing trickery to turn a “normal” URL into
javascript:sendToAttacker(authToken) at the final step
CodeQL 24 / 34
Taint Tracking in CodeQL
We want to find all instances where untrusted user input (source) reaches a sensitive
function (sink) without being sanitized.
CodeQL 25 / 34
Defining Sources
You can also extend this with custom logic, to incorporate codebase-specific patterns
e.g. RPC calls, deserialization, etc.
CodeQL 26 / 34
Defining Sinks
isSink(node) call, receiver . call invokes the .push method on receiver
invocation . invocation is useRouter() invocation
receiver
node = args(call)[0]
CodeQL 27 / 34
Changing the Transfer Function
CodeQL 28 / 34
Changing the Transfer Function
Any results from DOMPurify.sanitize are treated as untainted. Know your assumptions!
CodeQL 29 / 34
Let’s Go Hunting
CodeQL 30 / 34
Let’s Go Hunting
CodeQL 31 / 34
Limitations
CodeQL is very useful as a CI integration to catch security issues early in the development
process, and provide guarantees about your code. But it’s really hard to get right . . .
We need to create custom taint specifications for third-party library APIs.
False positives: even if tainted data reaches a sink, it may not always be exploitable
some other conditions may need to be met
Requires a good understanding of the codebase and the problem domain, and lots of
fine-tuning to get good results only as good as the queries you write
CodeQL 32 / 34
Alternative Approaches
Symbolic execution: represent the program inputs symbolically and explore all
possible paths through the program, generating constraints on the inputs such that a
certain path is taken
Ziyang Li, Saikat Dutta, and Mayur Naik. LLM-assisted static analysis for detecting
security vulnerabilities, 2024
IRIS leverages LLMs to infer taint specifications and perform contextual analysis,
alleviating needs for human specifications and inspection . . .
A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnera-
bilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL’s
average false discovery rate by 5% points.
CodeQL 33 / 34
References
[1] Anders Møller and Michael I. Schwartzbach. Static Program Analysis. November
2020.
[2] Oege de Moor, Mathieu Verbaere, Elnar Hajiyev, Pavel Avgustinov, Torbjorn Ekman,
Neil Ongkingco, Damien Sereni, and Julian Tibble. Keynote address: .ql for source
code analysis. In Seventh IEEE International Working Conference on Source Code
Analysis and Manipulation (SCAM 2007), pages 3–16, 2007.
[3] Ziyang Li, Saikat Dutta, and Mayur Naik. LLM-assisted static analysis for detecting
security vulnerabilities, 2024.
CodeQL 34 / 34