Static Program Analysis For Security

Cambridge IB Tech Talks

Zayne Zhang

zz513@cam.ac.uk

March 12, 2025

1 / 34

Overview

1. Introduction to Static Program Analysis

2. Lattices and Fixed Points

3. Data Flow Analysis

4. CodeQL

2 / 34

How to ﬁnd bugs in code?

Dynamic Analysis (e.g. Unit Testing, Fuzzing)

•

Done on a limited number of diﬀerent inputs

•

Often reveals the presence of errors but cannot guarantee their absence

•

100% test coverage != bug-free code

•

Many tests are regressive and only added after a bug is found

Static Analysis: analyze code without executing it

•

Can check all possible executions and provide guarantees about its behavior

•

With the right tools, can catch bugs early in the development process

•

Particularly useful for testing the absence of security vulnerabilities

Introduction to Static Program Analysis 3 / 34

How to ﬁnd bugs in code?

Introduction to Static Program Analysis 4 / 34

Data Flow Analysis

A particular form of static analysis that examines how data moves through a program

to answer questions such as:

•

What values can reach this point in the code?

•

Is this variable always initialized before it is used?

•

Does untrusted data ever reach an unsafe function?

Introduction to Static Program Analysis 5 / 34

Partial Orders

Deﬁnition

A partial order (S, ⊑) is a set S equipped with a binary relation ⊑ that is:

•

Reﬂexive: ∀x ∈ S, x ⊑ x

•

Transitive: ∀x, y, z ∈ S, x ⊑ y ∧ y ⊑ z ⇒ x ⊑ z

•

Antisymmetric: ∀x, y ∈ S, x ⊑ y ∧ y ⊑ x ⇒ x = y

•

y ∈ S is an upper bound for X (X ⊑ y) if ∀x ∈ X , x ⊑ y

•

y ∈ S is the least upper bound for X (X

y) if y is an upper bound for X and

∀z ∈ S, X ⊑ z ⇒ y ⊑ z

•

y ∈ S is a lower bound for X (y ⊑ X ) if ∀x ∈ X , y ⊑ x

•

y ∈ S is the greatest lower bound for X (y

X ) if y is a lower bound for X and

∀z ∈ S, z ⊑ X ⇒ z ⊑ y

Lattices and Fixed Points 6 / 34

Lattices

Deﬁnition

A lattice (L, ⊑) is a partial order (L, ⊑) in which every pair of elements x, y ∈ L has a

least upper bound x ⊔ y (join) and a greatest lower bound x ⊓ y (meet).

A complete lattice is a lattice in which every subset has a least upper bound and a

greatest lower bound.

Lattices and Fixed Points 7 / 34

Sign Analysis

As an example, we want to ﬁnd the possible signs of integer variables and expressions.

Consider the following abstract values for the sign of an integer:

•

⊤: unknown sign

•

+: positive

•

−: negative

•

0: zero

•

⊥: not an integer, or unreachable

code

⊤

− +

⊥

This partial order, with edges for ⊑,

forms a complete lattice. e.g. + ⊑ ⊤

means + is at least as precise as ⊤

Lattices and Fixed Points 8 / 34

Sign Analysis

Let’s create a map lattice State = Var → Sign that describes the sign of each variable.

Derive a system of equations, one per line, using values from the lattice.

var a, b; // 1

a = 42; // 2

b = a + input(); // 3

a = a - b; // 4

= [a 7→ ⊤, b 7→ ⊤]

= x

[a 7→ +]

= x

[b 7→ x

(a) + ⊤]

= x

[a 7→ x

(a) − x

(b)]

⊤

− +

⊥

Sign

Lattices and Fixed Points 9 / 34

Sign Analysis

= [a 7→ ⊤, b 7→ ⊤]

= x

[a 7→ +]

= x

[b 7→ x

(a) + ⊤]

= x

[a 7→ x

(a) − x

(b)]

, . . . , x

) = [a 7→ ⊤, b 7→ ⊤]

, . . . , x

) = x

[a 7→ +]

, . . . , x

) = x

[b 7→ x

(a) + ⊤]

, . . . , x

) = x

[a 7→ x

(a) − x

(b)]

Generalised equation system over a lattice L, with functions f

: L

→ L:

= f

, . . . , x

)

= f

, . . . , x

)

= f

, . . . , x

)

Lattices and Fixed Points 10 / 34

Monotonicity and Fixed Points

Generalised equation system over a lattice L, with functions f

: L

→ L:

= f

, . . . , x

)

= f

, . . . , x

)

= f

, . . . , x

)

Combine the n functions into F : L

→ L

F (x

, . . . , x

) = (f

, . . . , x

), . . . , f

, . . . , x

))

= (x

, . . . , x

)

Then we are looking for x = F (x), i.e. a ﬁxed point of F .

Lattices and Fixed Points 11 / 34

Monotonicity and Fixed Points

Deﬁnition

A function f : L

→ L

is monotone if ∀x, y ∈ L

, x ⊑ y ⇒f (x) ⊑ f (y)

More precise input leads to more precise output

Theorem

Kleene’s Fixed Point Theorem: In a complete lattice L with ﬁnite height, every

monotone function f : L → L has a unique least ﬁxed point

∞

i=0

(⊥)

These results generalise to functions that take multiple arguments f : L

→ L that are

monotone in each argument – such as the ones we derived for sign analysis.

Corollary

For an equation system over complete lattices of ﬁnite height with monotone constraint

functions, a unique, most precise solution always exists

Lattices and Fixed Points 12 / 34

Computing the Least Fixed Point

Algorithm 1 Naive Fixed Point Algorithm

1: procedure NaiveFixedPoint(F )

2: x := ⊥

3: while x = F (x) do

4: x := F (x)

5: end while

6: return x

7: end procedure

In each iteraction, all of f

, . . . , f

are applied. But f

depends only on x

, and the value

of x

is unchanged in most iterations. We’ll see a more eﬃcient way later.

Lattices and Fixed Points 13 / 34

Data Flow Analysis

Main idea: we want to ﬁnd the possible values of variables at each point in the

program.

•

In compilers: used for optimisations (e.g. constant propagation)

•

In security: used to ﬁnd vulnerabilities (e.g. untrusted data reaching a sink)

Data Flow Analysis 14 / 34

Control Flow Graph

•

In our previous example, we had a

sequence of statements with no

branches

•

In general, we have a control ﬂow

graph (CFG) with basic blocks

and edges

Data Flow Analysis 15 / 34

Abstract States

•

Recall: each element of the lattice

State = Var → Sign is an abstract

state that maps variables to signs

•

For each CFG node v , let the

constraint variable [[v]] be the

abstract state at the program point

immediately after v

•

We have a lattice State

abstract states, where n is the

number of CFG nodes

Data Flow Analysis 16 / 34

Constraint Rules

We need to combine the abstract states

of the predecessors of a node to get the

abstract state of the node itself.

JOIN(v) =

u∈pred(v)

[[u]]

JOIN([[a = c + 2]]) = [[c = b]] ⊔ [[c = −5]]

Data Flow Analysis 17 / 34

Constraint Rules

[[c = b]] = [b 7→ +, c 7→ +]

[[c = −5]] = [b 7→ ⊤, c 7→ −]

JOIN([[a = c + 2]]) = [[c = b]] ⊔ [[c = −5]]

= [b 7→ ⊤, c 7→ ⊤]

[[a = c + 2]] = JOIN([[a = c + 2]])[a 7→ eval(JOIN([[a = c + 2]]), c + 2)]

= [a 7→ ⊤, b 7→ ⊤, c 7→ ⊤]

Data Flow Analysis 18 / 34

Solving Data Flow Equations

Generalised equation system over a lattice L, with functions f

: L

→ L:

[[v

]] = f

([[v

]], . . . , [[v

]])

[[v

]] = f

([[v

]], . . . , [[v

]])

[[v

]] = f

([[v

]], . . . , [[v

]])

Combine the n functions into F : L

→ L

F ([[v

]], . . . , [[v

]]) = (f

([[v

]], . . . , [[v

]]), . . . , f

([[v

]], . . . , [[v

]))

= ([[v

]], . . . , [[v

]])

Then we are looking for x = F (x), i.e. a ﬁxed point of F .

Data Flow Analysis 19 / 34

A More Eﬃcient Algorithm

Algorithm 2 Simple Worklist Algorithm

1: procedure SimpleWorkList(F )

2: (x

, . . . , x

) := (⊥, . . . , ⊥)

3: W := {v

, . . . , v

}

4: while W = ∅ do

5: v

:= W .pop()

6: y := f

, . . . , x

)

7: if y = x

then

8: x

:= y

9: for v

∈ dep(v

) do

10: W .add(v

)

11: end for

12: end if

13: end while

14: return (x

, . . . , x

)

15: end procedure

Insight: most f

will only read the

values from a few other variables,

instead of all [[v

]], . . . , [[v

]].

dep(v

) is the set of nodes that depend

on v

(i.e. the successors of v

)

Data Flow Analysis 20 / 34

Taint Tracking

•

Can we tell which computations may involve “tainted” data?

•

i.e. data that comes from an untrusted source

•

e.g. user input, HTTP responses, environment variables

Data Flow Analysis 21 / 34

Taint Tracking

The approach is similar! We just deﬁne diﬀerent abstract values and equations.

•

Abstract taint values:

•

⊤: Unknown taint status.

•

T : Tainted.

•

U: Untainted.

•

⊥: Unreachable.

•

Ordering: ⊥ ⊑ U ⊑ T ⊑ ⊤

•

Abstract state: A mapping σ : Var → {⊥, U, T , ⊤}.

•

Transfer functions: e.g. for an assignment x := y op z, deﬁne

f (σ) = σ[x 7→ σ(y) ⊔ σ(z)]

i.e. if either operand is tainted, the result is tainted

Data Flow Analysis 22 / 34

In Practice: CodeQL

A tool developed by Semmle (a spin-out company from Oxf*rd), now acquired by GitHub.

Used with CLI or GitHub integration (free for all public repos!)

•

The source code is compiled into a relational database, which includes information

about the control ﬂow graph, data ﬂow, and other properties of the code.

•

The user writes queries in a high-level language called QL, which is executed by the

CodeQL engine.

•

The engine uses ﬁxed-point algorithms to perform data ﬂow analysis.

•

Results are exported into the SARIF format which can be consumed by CI tools or

custom integrations.

CodeQL 23 / 34

Let’s Go On A Little Adventure

•

Next.js, the most popular React framework, has some weird, poorly documented

URL parsing semantics that does not conform to the widely accepted WHATWG

URL standard

•

This is unexpected behaviour, and often results in wrong URL validation

•

Made a responsible disclosure ≈ 1 year ago, still not ﬁxed

Let’s query open-source GitHub projects to ﬁnd instances of this bug!

•

Common design pattern: unauthenticated user visits /admin, gets redirected to

/login?next=/admin, logs in, and gets redirected back to /admin

•

Use Next.js URL parsing trickery to turn a “normal” URL into

javascript:sendToAttacker(authToken) at the ﬁnal step

CodeQL 24 / 34

Taint Tracking in CodeQL

We want to ﬁnd all instances where untrusted user input (source) reaches a sensitive

function (sink) without being sanitized.

CodeQL 25 / 34

Deﬁning Sources

You can also extend this with custom logic, to incorporate codebase-speciﬁc patterns

e.g. RPC calls, deserialization, etc.

CodeQL 26 / 34

Deﬁning Sinks

isSink(node) ≜ ∃ call, receiver . call invokes the .push method on receiver ∧

∃ invocation . invocation is useRouter() ∧ invocation →

∗

receiver ∧

node = args(call)[0]

CodeQL 27 / 34

Changing the Transfer Function

CodeQL 28 / 34

Changing the Transfer Function

Any results from DOMPurify.sanitize are treated as untainted. Know your assumptions!

CodeQL 29 / 34

Let’s Go Hunting

CodeQL 30 / 34

Let’s Go Hunting

CodeQL 31 / 34

Limitations

CodeQL is very useful as a CI integration to catch security issues early in the development

process, and provide guarantees about your code. But it’s really hard to get right . . .

•

We need to create custom taint speciﬁcations for third-party library APIs.

•

False positives: even if tainted data reaches a sink, it may not always be exploitable

– some other conditions may need to be met

•

Requires a good understanding of the codebase and the problem domain, and lots of

ﬁne-tuning to get good results – only as good as the queries you write

CodeQL 32 / 34

Alternative Approaches

•

Symbolic execution: represent the program inputs symbolically and explore all

possible paths through the program, generating constraints on the inputs such that a

certain path is taken

•

Ziyang Li, Saikat Dutta, and Mayur Naik. LLM-assisted static analysis for detecting

security vulnerabilities, 2024

IRIS leverages LLMs to infer taint speciﬁcations and perform contextual analysis,

alleviating needs for human speciﬁcations and inspection . . .

A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnera-

bilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL’s

average false discovery rate by 5% points.

CodeQL 33 / 34

References

[1] Anders Møller and Michael I. Schwartzbach. Static Program Analysis. November

2020.

[2] Oege de Moor, Mathieu Verbaere, Elnar Hajiyev, Pavel Avgustinov, Torbjorn Ekman,

Neil Ongkingco, Damien Sereni, and Julian Tibble. Keynote address: .ql for source

code analysis. In Seventh IEEE International Working Conference on Source Code

Analysis and Manipulation (SCAM 2007), pages 3–16, 2007.

[3] Ziyang Li, Saikat Dutta, and Mayur Naik. LLM-assisted static analysis for detecting

security vulnerabilities, 2024.

CodeQL 34 / 34