exec java ...
You can’t see, them but they can find you.
You can’t kill them, but they will kill you.
On a Saturday night, I found myself deep in strace logs checking [not a good word] syscalls made by our (CodeScene) on-prem app. This thing, that a customer reported, with "git zombies wandering around" on their machine had been worrying me. I couldn’t resist the temptation - I had to find what’s going on.
Months ago, when enabling JFR monitoring for our application running in a Docker container,
I noticed that the app isn’t responding to signals: if you called docker stop <codescene-container>
,
then it would halt for 10 seconds and then the whole container was SIGKILL-ed.
This is
a classic problem
when your application is running as a subprocess of a shell such as bash
.
The shell won’t pass signals to the subprocess and thus the subprocess cannot respond to it.
docker stop
first stends SIGTERM and waits for 10 seconds.
If the container isn’t stopped yet, then it sends SIGKILL which does its dirty job
and kills all the processes running in the container, immediately.
I was aware of the "signal-passing problem" and decided to fix it.
After a brief study and asking on a forum I believed I came up with a simple solution:
use exec
to execute our Java/JVM process as the PID 1 in the container,
effectively replacing the shell wrapper.
This way, the signals would be passed immediately to the CodeScene process
and thus the JVM can respond to SIGTERM and other signals as expected.
The fix was simple - prepend exec
to the java
command in the start.sh file used as ENTRYPOINT
in Dockerfile:
exec java ...
Who’s the father of zombies?
In the end, the quickest solution was to add --init
docker run flag or init: true
for docker-compose.
However, we cannot rely on customers doing that work for us.
So we decided to add tiny
straing into our docker image - that way it works in environments that don’t support the init flag
(like Kubernetes) and doesn’t require any work/awareness from the end user.
As a bonus, I also remapped exit code 143[1] to 0. 143 is returned by Java when it receives the SIGTERM signal
The best solution I find
I intended this to be a lead but perhaps better just as a conclusion.
Remember the last time you learned about a bug, or a "suboptimal solution", but you were too lazy (didn’t have time?) to fix it?
This one was a reminder for me: when you ignore bug it doesn’t fix itself. It will come, with no mercy, again. Don’t ignore problems and fix them when you find them (or at least plan for the fix).