clang: avoid uninitialized variables in eBPF
(even if you compile with
-Wall
).(This is an extended and updated version of my original LinkedIn post).
eBPF program execution skipping code? 🤔 #
While working on the Cilium codebase1, I
realised portions of some new code weren’t executed. Cilium compiles
all its BPF code
with -Wall -Werror
2, and the eBPF verifier also detects the use of
uninitialized variables, so my first thought was “it must be an error in the
flow control logic”.
In this (not so) small “repro”,
you can see that nexthdr
is uninitialized,
and passed as a pointer to ipv6_hdrlen_offset()
, which is dereferenced
and assigned to nh
. nh
is then used to loop over IPv6 extension headers, if
any:
static __always_inline
int ipv6_hdrlen_offset(struct __sk_buff *ctx,
__u8 *nexthdr, int l3_off)
{
/* ... */
__u8 nh = *nexthdr;
/* ... */
#pragma unroll
for (i = 0; i < IPV6_MAX_HEADERS; i++) {
switch (nh) {
/* ... */
}
static __always_inline
bool icmp6_ndisc_validate(struct __sk_buff *ctx, /* ... */)
{
__u8 nexthdr;
/* ... */
bpf_printk("pre ipv6_hdrlen_offset");
l4_off = ipv6_hdrlen_offset(ctx, &nexthdr, l3_off);
bpf_printk("post ipv6_hdrlen_offset");
/* ... */
}
Interestingly, when running the eBPF program, there was no other bpf_printk()
after bpf_printk("pre ipv6_hdrlen_offset")
, not even the unconditional
post ipv6_hdrlen_offset
, as if that code had magically disappeared.
Looking at the LLVM disassembly, large parts of the code were indeed missing.
So what is going on? #
The code is obviously buggy; nexthdr
just can’t be undefined at the point
where ipv6_hdrlen_offset()
is called.
The C/C++ standard specifies that if a variable is not initialized in some path,
optimizations (especially aggressive ones, like -O2
and -O3
, hence the importance
of 2) can treat those paths as “unreachable” or safe to optimize out.
This resulted into clang
removing the whole inlined body of
ipv6_hdrlen_offset()
and, given that the code below depended on it, all the
code in icmp6_ndisc_validate()
after that.
As opposed to what this LWN article says (they didn’t contact us), the verifier didn’t emit an error. The program loaded, as the resulting binary didn’t contain any invalid memory access.
But clang -Wall -Werror
issued no warning, and produced a binary, so the
outcome was a completely abnormal and unpredictable execution given the source
code.
Improvements in clang
21.1, but not yet there… #
@yonghong-song started a thread
in the LLVM Discourse community - with a (much) simplified repro - to discuss
ways how clang
could better detect these uninitialized variable conditions in
BPF programs. He also shared the issue on bpf@vger.kernel.org.
Since then, @yonghong-song has worked on a number of Pull Requests in the llvm/llvm-project:
- ❌ [RFC][clang][BPF] Make trivial uninit var value to be 0 #125601
- ❌ [RFC][BPF] Report Unreachable Behavior from IR #126858
- ✅ [BPF] Handle traps with kfunc call __bpf_trap #131731
but, at least with latest and greatest clang
v21.1.1 (5a86dc996c26
), which
includes #131731, the repro still compiles without any warnings.
-ftrivial-auto-var-init=0
as a mitigation #
@yonghong-song also suggested in the LLVM
thread using -ftrivial-auto-var-init=0
as a mitigation strategy to, at the
very least, prevent the compiler to aggressively optimize out big parts of the
code.
While this may result into a more predictable execution (code will execute
sequentially), results might still be unexpected. For instance, in the
example above, the value must come from the skb itself, e.g. 6
for TCP, as
nexthdr=0
is hop-by-hop options extension header.
Conclusion #
For now, make sure all stack variables are initialized!
Acknowledgments #
Thank you to @yonghong-song for pushing this discussion and working on upstream fixes in the LLVM community, and to @borkmann for his help.