clang: avoid uninitialized variables in eBPF

If your eBPF program executes strangely, double check all stack variables are initialized
(even if you compile with -Wall).

(This is an extended and updated version of my original LinkedIn post).

eBPF program execution skipping code? 🤔 #

While working on the Cilium codebase¹, I realised portions of some new code weren’t executed. Cilium compiles all its BPF code with -Wall -Werror², and the eBPF verifier also detects the use of uninitialized variables, so my first thought was “it must be an error in the flow control logic”.

In this (not so) small “repro”, you can see that nexthdr is uninitialized, and passed as a pointer to ipv6_hdrlen_offset(), which is dereferenced and assigned to nh. nh is then used to loop over IPv6 extension headers, if any:

static __always_inline
int ipv6_hdrlen_offset(struct __sk_buff *ctx,
					   __u8 *nexthdr, int l3_off)
{
    /* ... */
	__u8 nh = *nexthdr;
    /* ... */
#pragma unroll
	for (i = 0; i < IPV6_MAX_HEADERS; i++) {
		switch (nh) {
    /* ... */
}

static __always_inline
bool icmp6_ndisc_validate(struct __sk_buff *ctx, /* ... */)
{
	__u8 nexthdr;
    /* ... */
	bpf_printk("pre ipv6_hdrlen_offset");
	l4_off = ipv6_hdrlen_offset(ctx, &nexthdr, l3_off);
	bpf_printk("post ipv6_hdrlen_offset");
    /* ... */
}

Interestingly, when running the eBPF program, there was no other bpf_printk() after bpf_printk("pre ipv6_hdrlen_offset"), not even the unconditional post ipv6_hdrlen_offset, as if that code had magically disappeared.

Looking at the LLVM disassembly, large parts of the code were indeed missing.

So what is going on? #

The code is obviously buggy; nexthdr just can’t be undefined at the point where ipv6_hdrlen_offset() is called.

The C/C++ standard specifies that if a variable is not initialized in some path, optimizations (especially aggressive ones, like -O2 and -O3, hence the importance of ²) can treat those paths as “unreachable” or safe to optimize out.

This resulted into clang removing the whole inlined body of ipv6_hdrlen_offset() and, given that the code below depended on it, all the code in icmp6_ndisc_validate() after that.

As opposed to what this LWN article says (they didn’t contact us), the verifier didn’t emit an error. The program loaded, as the resulting binary didn’t contain any invalid memory access.

But clang -Wall -Werror issued no warning, and produced a binary, so the outcome was a completely abnormal and unpredictable execution given the source code.

Improvements in `clang` 21.1, but not yet there… #

@yonghong-song started a thread in the LLVM Discourse community - with a (much) simplified repro - to discuss ways how clang could better detect these uninitialized variable conditions in BPF programs. He also shared the issue on bpf@vger.kernel.org.

Since then, @yonghong-song has worked on a number of Pull Requests in the llvm/llvm-project:

but, at least with latest and greatest clang v21.1.1 (5a86dc996c26), which includes #131731, the repro still compiles without any warnings.

`-ftrivial-auto-var-init=0` as a mitigation #

@yonghong-song also suggested in the LLVM thread using -ftrivial-auto-var-init=0 as a mitigation strategy to, at the very least, prevent the compiler to aggressively optimize out big parts of the code.

While this may result into a more predictable execution (code will execute sequentially), results might still be unexpected. For instance, in the example above, the value must come from the skb itself, e.g. 6 for TCP, as nexthdr=0 is hop-by-hop options extension header.

Conclusion #

For now, make sure all stack variables are initialized!

Acknowledgments #

Thank you to @yonghong-song for pushing this discussion and working on upstream fixes in the LLVM community, and to @borkmann for his help.

doesn’t happen all that often 😅 ↩︎
and -Wextra. Any C/C++ project should compile with at least -Wall -Werror. ↩︎ ↩︎