This is the story of an investigation conducted by Jochen Sprickerhof, Helmut Grohne, and myself. It was true teamwork, and we would have not reached the bottom of the issue working individually. We think you will find it as interesting and fun as we did, so here is a brief writeup. A few of the steps mentioned here took several days, others just a few minutes. What is described as a natural progression of events did not always look very obvious at the moment at all.
Let us go through the Six Stages of Debugging together.
Official Debian GCC builds start failing on multiple architectures in late November.
The build error happens on the build servers when running the testuite, but we
know this cannot happen. GCC builds are not meant to fail in case of testsuite
failures! Return codes are not making the build fail, make
is being called
with -k
, it just cannot happen.
A lot of the GCC tests are always failing in fact, and an extensive log of the results is posted to the debian-gcc mailing list, but the packages always build fine regardless.
Building on my machine running Bookworm is just fine. The Build Daemons run Bookworm and use a Sid chroot for the build environment, just like I am. Same kernel.
The only obvious difference between my setup and the Debian buildds is that I am using sbuild 0.85.0 from bookworm, and the buildds have 0.86.3~bpo12+1 from bookworm-backports. Trying again with 0.86.3~bpo12+1, the build fails on my system too. The build daemons were updated to the bookworm-backports version of sbuild at some point in late November. Ha.
There are quite a few sbuild versions in between 0.85.0 and 0.86.3~bpo12+1, but looking at recent sbuild bugs shows that sbuild 0.86.0 was breaking "quite a number of packages". Indeed, with 0.86.0 the build still fails. Trying the version immediately before, 0.85.11, the build finishes correctly. This took more time than it sounds, one run including the tests takes several hours. We need a way to shorten this somehow.
The Debian packaging of GCC allows to specify which languages you may want to
skip, and by default it builds Ada, Go, C, C++, D, Fortran, Objective
C, Objective C++, M2, and Rust. When running the tests sequentially,
the build logs stop roughly around the tests of a runtime library for D,
libphobos. So can we still reproduce the failure by skipping everything except
for D? With
DEB_BUILD_OPTIONS=nolang=ada,go,c,c++,fortran,objc,obj-c++,m2,rust
the build still fails, and it fails faster than before. Several minutes, not
hours. This is progress, and time to file a bug. The report contains massive
spoilers, so no link. :-)
Something is causing the build to end prematurely. It’s not the OOM killer, and
the kernel does not have anything useful to say in the logs. Can it be that the
D language tests are sending signals to some process, and that is what’s
killing make
? We start tracing signals sent with bpftrace
by writing the
following script, signals.bt
:
tracepoint:signal:signal_generate { printf("%s PID %d (%s) sent signal %d to PID %d\n", comm, pid, args->sig, args->pid); }
And executing it with sudo bpftrace signals.bt
.
The build takes its sweet time, and it fails. Looking at the trace output
there’s a suspicious process.exe
terminating stuff.
process.exe (PID: 2868133) sent signal 15 to PID 711826
That looks interesting, but we have no clue what PID 711826 may be. Let’s change the script a bit, and trace signals received as well.
tracepoint:signal:signal_generate { printf("PID %d (%s) sent signal %d to %d\n", pid, comm, args->sig, args->pid); } tracepoint:signal:signal_deliver { printf("PID %d (%s) received signal %d\n", pid, comm, args->sig); }
The working version of sbuild was using dumb-init
, whereas the new one
features
a
little init in perl. We patch the current version of sbuild by making it use
dumb-init
instead, and trace two builds: one with the perl init, one with
dumb-init
.
Here are the signals observed when building with dumb-init
.
PID 3590011 (process.exe) sent signal 2 to 3590014 PID 3590014 (sleep) received signal 9 PID 3590011 (process.exe) sent signal 15 to 3590063 PID 3590063 (std.process tem) received signal 9 PID 3590011 (process.exe) sent signal 9 to 3590065 PID 3590065 (std.process tem) received signal 9
And this is what happens with the new init in perl:
PID 3589274 (process.exe) sent signal 2 to 3589291 PID 3589291 (sleep) received signal 9 PID 3589274 (process.exe) sent signal 15 to 3589338 PID 3589338 (std.process tem) received signal 9 PID 3589274 (process.exe) sent signal 9 to 3589340 PID 3589340 (std.process tem) received signal 9 PID 3589274 (process.exe) sent signal 15 to 3589341 PID 3589274 (process.exe) sent signal 15 to 3589323 PID 3589274 (process.exe) sent signal 15 to 3589320 PID 3589274 (process.exe) sent signal 15 to 3589274 PID 3589274 (process.exe) received signal 9 PID 3589341 (sleep) received signal 9 PID 3589273 (sbuild-usernsex) sent signal 9 to 3589320 PID 3589273 (sbuild-usernsex) sent signal 9 to 3589323
There are a few additional SIGTERM being sent when using the perl init, that’s
helpful. At this point we are fairly convinced that process.exe
is worth
additional inspection. The
source
code of process.d shows something interesting:
1221 @system unittest 1222 { [...] 1247 auto pid = spawnProcess(["sleep", "10000"], [...] 1260 // kill the spawned process with SIGINT 1261 // and send its return code 1262 spawn((shared Pid pid) { 1263 auto p = cast() pid; 1264 kill(p, SIGINT);
So yes, there’s our sleep
and the SIGINT (signal 2) right in the unit tests
of process.d
, just like we have observed in the bpftrace output.
Can we study the behavior of process.exe
in isolation, separatedly from the
build? Indeed we can. Let’s take the executable from a failed build, and try
running it under /usr/libexec/sbuild-usernsexec.
First, we prepare a chroot inside a suitable user namespace:
unshare --map-auto --setuid 0 --setgid 0 mkdir /tmp/rootfs cd /tmp/rootfs cat /home/ema/.cache/sbuild/unstable-arm64.tar | unshare --map-auto --setuid 0 --setgid 0 tar xf - unshare --map-auto --setuid 0 --setgid 0 mkdir /tmp/rootfs/whatever unshare --map-auto --setuid 0 --setgid 0 cp process.exe /tmp/rootfs/
Now we can run process.exe
on its own using the perl init, and trace signals at will:
/usr/libexec/sbuild-usernsexec --pivotroot --nonet u:0:100000:65536 g:0:100000:65536 /tmp/rootfs ema /whatever -- /process.exe
We can compare the behavior of the perl init vis-a-vis the one using
dumb-init
in milliseconds instead of minutes.
Why does process.exe
send more SIGTERMs when using the perl init is now the
big question. We have a simple reproducer, so this is where using strace
becomes possible.
sudo strace --user ema --follow-forks -o sbuild-dumb-init.strace ./sbuild-usernsexec-dumb-init --pivotroot --nonet u:0:100000:65536 g:0:100000:65536 /tmp/dumbroot ema /whatever -- /process.exe
Here is what process.exe
does under dumb-init:
3593883 kill(-2, SIGTERM) = -1 ESRCH (No such process)
No such process. Under perl-init
instead:
3593777 kill(-2, SIGTERM <unfinished ...>
The process is there under perl-init
!
That is a kill with negative pid. From the kill(2)
man page:
If pid is less than -1, then sig is sent to every process in the process group whose ID is -pid.
It would have been very useful to see this kill with negative pid in the
output of bpftrace, why didn’t we? The tracepoint used,
tracepoint:signal:signal_generate
, shows when signals are actually being
sent, and not the syscall being called. To confirm, one can trace
tracepoint:syscalls:sys_enter_kill
and see the negative PIDs, for example:
PID 312719 (bash) sent signal 2 to -312728
The obvious question at this point is: why is there no process group 2 when
using dumb-init
?
We know that process.exe
sends a SIGTERM to every process in the process
group with ID 2. To find out what this process group may be, we spawn a shell
with dumb-init
and observe under /proc
PIDs 1, 16, and 17. With perl-init
we have 1, 2, and 17. When running dumb-init, there are a few forks before
launching the program, explaining the difference. Looking at /proc/2/cmdline
we see that it’s bash, ie. the program we are running under perl-init
. When
building a package, that is dpkg-buildpackage
itself.
The test is accidentally killing its own process group.
Now where does this -2
come from in the test?
2363 // Special values for _processID. 2364 enum invalid = -1, terminated = -2;
Oh. -2 is used as a special value for PID, meaning "terminated". And there’s a call to kill() later on:
2694 do { s = tryWait(pid); } while (!s.terminated); [...] 2697 assertThrown!ProcessException(kill(pid));
What sets pid
to terminated
you ask?
Here is tryWait
:
2568 auto tryWait(Pid pid) @safe 2569 { 2570 import std.typecons : Tuple; 2571 assert(pid !is null, "Called tryWait on a null Pid."); 2572 auto code = pid.performWait(false);
And performWait
:
2306 _processID = terminated;
PS: the bug report with spoilers for those interested is #1089007.