[WIP] shepherd: Poll every 0.5s to find dead forked services

  • Done
  • quality assurance status badge
Details
2 participants
  • Carlo Zancanaro
  • Ludovic Courtès
Owner
unassigned
Submitted by
Carlo Zancanaro
Severity
normal
C
C
Carlo Zancanaro wrote on 27 Feb 2018 22:56
(address . guix-patches@gnu.org)
878tbe9jvx.fsf@zancanaro.id.au
Another patch for shepherd!

This one I'm much less happy about, but I'm sending it for
feedback. I'll explain the problem it solves, then the (hacky) way
that it solves it for now. I'm sure there's a better way to solve
this, but it was annoying me enough for me to do this now.

The problem is that shepherd, when run as a user process, can
"lose" services which fork away. Shepherd can still kill them, but
a SIGCHLD won't be delivered if they die, so shepherd can't
restart/disable them. My prime example is emacs, which I run with
--daemon. If I then kill emacs, shepherd will still think that it
is running.

To solve this problem, I have modified shepherd to poll each
process that it thinks is running. I think this is approximately
every 0.5s, but I don't quite understand guile's behaviour when it
comes to socket timeouts.

This involves a subtle change in the meaning of the running field
in <service> objects. My patch treats any `integer?` as a pid, and
if a corresponding process is not found it will attempt to restart
the service. I considered creating a new "pid" datatype to make it
clear when a number represents a pid, but didn't want to go
overboard without further feedback. So, thoughts? Can anyone
suggest a better way to solve this problem?

I'm also confused by my new test. If I run it myself then it
passes fine, but in a guix build container there is something
different which means that the processes don't get reaped
properly, so the test doesn't terminate. I'll keep trying to work
it out, but if anyone else has an idea I'd love to hear it.

Carlo
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqV1BIACgkQqdyPv9aw
Ibz/Nw/5AdjbhliDDcHtRm/WW9Mf/cMZdbNQU2Dl2+/Ig/yQ4KgtZcIEtGxQV49E
0TYIXnbD0lJ7KWR/+4bttlzBcXe9xZSNObMC5OcM68Kvuikmrp9DsjjttNJ6kLcx
056Meze8bCoY9V9C+3MyESXhkipVA6AdvqSfR+z+0bsbl2xaebZyufMdbooEjTAg
Ift3U8A073NyGKnMtEn5JAcVbyn8djFx37tZkBbB+nYRGhzy5sodsMRb0EOREo/8
Wxfrpg8fIxzNE+vppEBPbk5E2F823j+cIVgQmztiT3Ia1epYlKR6ooj6sIzoFI2T
U9LT43swEFcsk0VazvjBx1ADGjtdT3cNhVXfdnCrHvBOhHsDg9mA9rJ4/ILpEosP
Wbc2W3Yym8N5DKc3yqDgSsJ1enwUhW6XP/RN6p9kdxOBbykhStp0s1xsILmXNzsW
3fDi9h14WVhq+ZbyWbaoHRQZ/hNKBsOxoOAjimI5/NjfGzIMWJWUcwyTSe1HN/MK
v3a4Uo04UEMva/CX+B6+1oi6O7X068elROfPmn1BhZkhRrUV9eDugaGqgasGRpgV
QEWc5rLidIwuSnXWeJqfMbi8wldcrWh5joqIchW0dKC8kuE4t2iXL8RYzSkEepkA
1Lde+ema8r1VA971yE4xNuftH38AGBhBtWdQ1PseMXaGS7uoCBk=
=yPdk
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 28 Feb 2018 23:06
(name . Carlo Zancanaro)(address . carlo@zancanaro.id.au)(address . 30637@debbugs.gnu.org)
87y3jcu5v5.fsf@gnu.org
Hi Carlo,

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

Toggle quote (2 lines)
> Another patch for shepherd!

Always welcome! :-)

Toggle quote (6 lines)
> The problem is that shepherd, when run as a user process, can "lose"
> services which fork away. Shepherd can still kill them, but a SIGCHLD
> won't be delivered if they die, so shepherd can't restart/disable
> them. My prime example is emacs, which I run with --daemon. If I then
> kill emacs, shepherd will still think that it is running.

There are two issues here, I think.

1. shepherd cannot lose SIGCHLD: if a process dies immediately once
it’s been spawned, as is the case with “emacs --daemon” or any
other daemon-style program, it should receive SIGCHLD and process
it.

However, there could be a race condition: if SIGCHLD is handled
before the ‘running’ value has been set, then we’ll still get the
non-#f ‘running’ value even though the process died in the
meantime.

The code tries to prevent that (see (shepherd service) around line
320), but looking more closely, I think the race is still there.
Namely, the whole ‘let’ block, including the call to ‘start’,
should be in ‘call-with-blocked-asyncs’, I think. Could you check
if that helps for you?

2. shepherd currently can’t do much with real daemons. So what we do
in GuixSD is to either start programs in non-daemon mode, when
that’s an option, or pass #:pid-file to retrieve the forked process
PID. I think you should do one of these as well.

WDYT?

Thanks,
Ludo’.
C
C
Carlo Zancanaro wrote on 1 Mar 2018 23:37
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30637@debbugs.gnu.org)
87d10nwhfl.fsf@zancanaro.id.au
Hey Ludo,

On Wed, Feb 28 2018, Ludovic Courtès wrote:
Toggle quote (20 lines)
>> The problem is that shepherd, when run as a user process, can
>> "lose"
>> services which fork away. Shepherd can still kill them, but a
>> SIGCHLD
>> won't be delivered if they die, so shepherd can't
>> restart/disable
>> them. My prime example is emacs, which I run with --daemon. If
>> I then
>> kill emacs, shepherd will still think that it is running.
>
> There are two issues here, I think.
>
> 1. shepherd cannot lose SIGCHLD: if a process dies immediately
> once
> it’s been spawned, as is the case with “emacs --daemon” or
> any
> other daemon-style program, it should receive SIGCHLD and
> process
> it.

Yeah, that's true, but the problem is that shepherd only processes
the SIGCHLD if there is a service with its `running` slot set to
the pid. When emacs forks, the original process may have its
SIGCHLD handled, but that doesn't affect shepherd's service state
(as it shouldn't, because it's using #:pid-file to track the
forked process).

Toggle quote (8 lines)
> 2. shepherd currently can’t do much with real daemons. So
> what we do
> in GuixSD is to either start programs in non-daemon mode,
> when
> that’s an option, or pass #:pid-file to retrieve the forked
> process
> PID. I think you should do one of these as well.

I am doing that. The problem is that when a service dies (crashes,
quits, etc.) the `respawn?` option cannot be honoured because
shepherd is not notified that the process has terminated (because
it never receives a SIGCHLD for the forked pid). My patch polls
for the processes we expect, to make up for the lack of
notification. I would much rather it receive an event/signal to
notify that the forked process has died, but I don't know how to
do that in a robust, portable way so I chose to poll instead.

If you look at my test case in tests/respawn-service.sh (which can
be read in its entirety in the diff attached to my previous email)
you can see the problem that this patch solves. The test will fail
without the rest of my patch, but will pass with them (guix build
container issue notwithstanding).

Carlo
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqYgL8ACgkQqdyPv9aw
IbzJhxAAoJLq07/eAo2bipD6ie8voBqtBfrlqn5PbwcAiHUtN1WEJQ8xyl0fW3bt
7XNbH8DNIkHWXj1KTrCVRVgz8rOtCav1APoHh8n3CMx+jhqQV2dtAON/Fizj1zHB
rKYooZt1WprbWzDm6KRm4NokwIFNAYG2vN19DNqtuhHj9rC+tv7ZIN4jSJKb6vdY
5mwWwy1nE1yNI6CUl6x/oIQL0oNfnKKFM0SDLYKihgbjTIjykRVbMOJnDdze+MYI
rbrsWmrZaese2Jj2NC2ZiEBFrKx2igvieu2HEI+zijV07QiQG+XPR0pdnf/lCng4
b/wVOZBunW4ouSVvOIaSqo3adOyu6J1mZEK7MXf1+DbpZB8TYB5SeSkEJ0Qlhn8U
Chq/Xipggpf4Wwpd6Fu6jrNwQEAk3F+pFqqpDrQLo3bFGlb26hhJpyAScDQ3UGAY
ktCPwEyAsIXoiTQtsYEoVaysnmxW3ZnCaSF/zwB56gV50bD45TUGgTcf2R7W1TkH
etUHfUQ74P8G2OMuvF1VkuuUFCxjq1bDUgZkxNWeQhBqjhnZy9B6XtUTcas1We/h
bMlllsBU3JFVPbGtLbsTTC2FJof7yzyrcJ3XCMew5xutY4GsZKRauNNdYR/5ybEW
z9X38N2oWQ8l6U9t4leKjrxsEk5AvFdwBnfUj8CoEFOFRaD1VaI=
=TndF
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 2 Mar 2018 10:44
(name . Carlo Zancanaro)(address . carlo@zancanaro.id.au)(address . 30637@debbugs.gnu.org)
87r2p2izgz.fsf@gnu.org
Hi Carlo,

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

Toggle quote (24 lines)
> On Wed, Feb 28 2018, Ludovic Courtès wrote:
>>> The problem is that shepherd, when run as a user process, can
>>> "lose"
>>> services which fork away. Shepherd can still kill them, but a
>>> SIGCHLD
>>> won't be delivered if they die, so shepherd can't restart/disable
>>> them. My prime example is emacs, which I run with --daemon. If I
>>> then
>>> kill emacs, shepherd will still think that it is running.
>>
>> There are two issues here, I think.
>>
>> 1. shepherd cannot lose SIGCHLD: if a process dies immediately
>> once
>> it’s been spawned, as is the case with “emacs --daemon” or
>> any
>> other daemon-style program, it should receive SIGCHLD and
>> process
>> it.
>
> Yeah, that's true, but the problem is that shepherd only processes the
> SIGCHLD if there is a service with its `running` slot set to the
> pid.

Well, it does call ‘waitpid’ every time it gets a SIGCHLD, but it’s true
that it doesn’t do anything beyond that if it doesn’t know what service
a PID corresponds to.

Toggle quote (18 lines)
> When emacs forks, the original process may have its SIGCHLD handled,
> but that doesn't affect shepherd's service state (as it shouldn't,
> because it's using #:pid-file to track the forked process).
>
>> 2. shepherd currently can’t do much with real daemons. So what
>> we do
>> in GuixSD is to either start programs in non-daemon mode,
>> when
>> that’s an option, or pass #:pid-file to retrieve the forked
>> process
>> PID. I think you should do one of these as well.
>
> I am doing that. The problem is that when a service dies (crashes,
> quits, etc.) the `respawn?` option cannot be honoured because shepherd
> is not notified that the process has terminated (because it never
> receives a SIGCHLD for the forked pid). My patch polls for the
> processes we expect, to make up for the lack of notification.

I see.

Actually, thinking more about it, we should be using
PR_SET_CHILD_SUBREAPER from prctl(2), which is designed exactly for
that.

So what about this plan:

1. Add FFI bindings in (shepherd system) for prctl(2). We should
arrange for it to throw to 'system-error when the ‘prctl’ symbol is
missing, as is the case on GNU/Hurd.

2. Use prctl/PR_SET_CHILD_SUBREAPER in ‘exec-command’. Here we must
‘catch-system-error’ around that call to cater to GNU/Hurd.

That would address the main issue without having to resort to polling.
Respawning will work only when #:pid-file is used though, but that’s
already an improvement.

Thoughts?

Thanks,
Ludo’.
C
C
Carlo Zancanaro wrote on 2 Mar 2018 11:13
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30637@debbugs.gnu.org)
87371ihjj2.fsf@zancanaro.id.au
Hey Ludo,

On Fri, Mar 02 2018, Ludovic Courtès wrote:
Toggle quote (13 lines)
>> I am doing that. The problem is that when a service dies
>> (crashes, quits, etc.) the `respawn?` option cannot be honoured
>> because shepherd is not notified that the process has
>> terminated (because it never receives a SIGCHLD for the forked
>> pid). My patch polls for the processes we expect, to make up
>> for the lack of notification.
>
> I see.
>
> Actually, thinking more about it, we should be using
> PR_SET_CHILD_SUBREAPER from prctl(2), which is designed exactly
> for that.

Excellent! This is exactly the information that I needed. This is
what I've been looking for, but without enough knowledge to be
able to find it. Thanks!

Toggle quote (6 lines)
> So what about this plan:
>
> 1. Add FFI bindings in (shepherd system) for prctl(2). We
> should arrange for it to throw to 'system-error when the
> ‘prctl’ symbol is missing, as is the case on GNU/Hurd.

Are we okay with having this just not work on GNU/Hurd (or kernels
older than 3.4, according to the prctl manpage)? We could fall
back to a polling approach if prctl isn't available? I don't
really like the idea of this working on some kernels but not
others, given that process supervision is one of the main jobs of
shepherd.

Toggle quote (4 lines)
> 2. Use prctl/PR_SET_CHILD_SUBREAPER in ‘exec-command’. Here we
> must ‘catch-system-error’ around that call to cater to
> GNU/Hurd.

Why would we need to set it in exec-command? It looks like it
modifies the state of the calling process, which means we'd want
to set it in the shepherd service, not in each of the child
processes.

Toggle quote (6 lines)
> That would address the main issue without having to resort to
> polling. Respawning will work only when #:pid-file is used
> though, but that’s already an improvement.
>
> Thoughts?

I'll try to get this working in the next few days. Hopefully
you'll see a patch from me soon.

Carlo
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqZI+IACgkQqdyPv9aw
Iby6/A/+OoaURMIb/e8YD50Iz/jnEC/KFy8hw629phIITlVHwGIglx25aWdcTgha
zuJj+mNBuT09QaX+HY8lyhOiSGQaLvphdXHPykr7CjNsDYES7dwyYTSIuteeYvI0
0Srg1KE44jumPmTdK0jTu4EdSLnfzVRKVyb3UzUGM6GarjRM0DDYYM8dpQ0Tde7P
F5YzJ6LvG+T638hDqpz8oideSl9+FWTOJ81TDHbXMHir76JuYhf4ir7JZUQovRTt
1pVa9S0d7L7HYwRun5NWI7v5EvRdAWSiUl6XKjzG25T/YfGEfmHW4yxdSnmGfIit
Q9K4vw7U2MYq68DBbPIIt63NUyA5sxpfY5I+sH+9pfqU2cl9ZhiS/y4CkippFASl
dYRkYDRnHyUySL3kWtRJNwibK/SdKi7GOKcfJftJuW5tON40OgKGKyUwiDh8Po1y
QS/p88uLz6FfsiiEvgVQ+rLF2yIRM489TPZq5vfi7yGyens6eOJ+d7qdYsTGomP8
kI+3+Ul+lFi5BMxlQfvkvaGSc0OqSvf/S9Ith333UzSpWsxIqaO3ZenrVurC7oT1
/d41BLDpy7IAf7jdT6SJRHLVQr6V7bOZ87FwPn217QW6tOXHRnCoLkqSJhURu6VS
opTUOp+U95nVBqkCOnHG+iNdW/Uy4pkb/CNoxHjwTHM6EU+Hz2Y=
=pyZA
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 2 Mar 2018 13:42
(name . Carlo Zancanaro)(address . carlo@zancanaro.id.au)(address . 30637@debbugs.gnu.org)
87po4mhcn2.fsf@gnu.org
Hello!

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

Toggle quote (2 lines)
> On Fri, Mar 02 2018, Ludovic Courtès wrote:

[...]

Toggle quote (12 lines)
>> So what about this plan:
>>
>> 1. Add FFI bindings in (shepherd system) for prctl(2). We should
>> arrange for it to throw to 'system-error when the ‘prctl’ symbol
>> is missing, as is the case on GNU/Hurd.
>
> Are we okay with having this just not work on GNU/Hurd (or kernels
> older than 3.4, according to the prctl manpage)? We could fall back to
> a polling approach if prctl isn't available? I don't really like the
> idea of this working on some kernels but not others, given that
> process supervision is one of the main jobs of shepherd.

Yeah, I agree.

The ‘prctl’ procedure itself should simply throw to 'system-error on
GNU/Hurd. But then, in (shepherd), we could add the polling thing when
(not (string-contains %host-type "linux")).

WDYT?

Toggle quote (3 lines)
>> 2. Use prctl/PR_SET_CHILD_SUBREAPER in ‘exec-command’. Here we
>> must ‘catch-system-error’ around that call to cater to GNU/Hurd.

Actually this should be done in ‘fork+exec-command’ in the child
process.

Toggle quote (4 lines)
> Why would we need to set it in exec-command? It looks like it modifies
> the state of the calling process, which means we'd want to set it in
> the shepherd service, not in each of the child processes.

We want to set the “reaper” of child processes to Shepherd itself, so we
must do that in child processes. The shepherd process cannot be its own
reaper I suppose.

BTW, we should do PR_SET_CHILD_SUBREAPER only when (not (= 1 (getpid))).

Toggle quote (3 lines)
> I'll try to get this working in the next few days. Hopefully you'll
> see a patch from me soon.

Awesome, thank you!

Ludo’.
C
C
Carlo Zancanaro wrote on 3 Mar 2018 08:58
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30637@debbugs.gnu.org)
87inadr3np.fsf@zancanaro.id.au
Hey Ludo,

I've re-written my patch, and it's attached in two commits. The
first one adds the necessary calls to prctl, and the second adds
the fallback to polling.

On Fri, Mar 02 2018, Ludovic Courtès wrote:
Toggle quote (7 lines)
> The ‘prctl’ procedure itself should simply throw to
> 'system-error on GNU/Hurd. But then, in (shepherd), we could add
> the polling thing when (not (string-contains %host-type
> "linux")).
>
> WDYT?

I don't like the idea of doing this based on the host type. In my
patch I've done it based on whether the prctl call succeeded. If
the prctl call throws a system-error then we poll, otherwise we
rely on SIGCHLD. I don't have a system set up with another kernel,
though, so I don't know how I can easily test whether the fallback
logic is working properly. When I replaced the prctl call with
(throw 'system-error) it seemed to work.

The fallback code still fails in the guix build environment (as my
previous patch did), but when it's using prctl it works properly.
This means that a build on Linux pre-3.4, or on Hurd, will fail,
which probably isn't acceptable given that shepherd is a hard
dependency for starting a GuixSD system. As far as I can tell the
test fails because the processes stick around as zombies, so I
assume that pid 1 in the build container isn't properly reaping
zombie processes. Do you have any ideas how I can force it to do
so?

Toggle quote (4 lines)
> We want to set the “reaper” of child processes to Shepherd
> itself, so we must do that in child processes. The shepherd
> process cannot be its own reaper I suppose.

Reading the manpage, and then running some code, I think you're
wrong about this. Using prctl with PR_SET_CHILD_SUBREAPER marks
the calling process as a child subreaper. That means that any
processes that are orphaned below the current process get
reparented under the current process (or a closer child subreaper,
if there's one further down). If there are no processes marked as
child subreapers, then the orphaned process is reparented under
pid 1. We should only need to call prctl in two places: when
shepherd initially starts, or when we fork for daemonize.

Carlo
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqaVbsACgkQqdyPv9aw
IbwCGg//RSh6Wfle+Rmt4NpdKbr/qstpFYk0KWawZbWGxrWAjyFVYx3b4V3nNGt5
Lol8VwTUeNyeV88iE/7wL5HUEF+P/tzExMUzr6TWCJVGNHYUrjAYMG6FV0nXX6dz
2tHUeCUHBlicYIgqN6XyK3vkccv/hTRANq3W4ZEI5Jrwf43FV9CHZ8aXsG4tRoq9
nQGS84OS//vGOBE7kvS4o/+IAvVAn10Uuz41EYhSFpsiJGtgSO+IA8EIHWmL8lZn
/6jPPZ7P6WQSOeDf7P5/3mcvrXP7S9azIed6q8/EYz+75GkV/EeQbR6uCYuqP+WN
nSg6Et/FhACgxIRmvLqkVxeC+ijHRP9SvtYeg5Fx1H6yz9HV6bHMZa35FrN7HrGC
WjmL6l+0+xRBMC0DuOiOk3W0qdRDR8c88tWD7OCopfsv8Lcx3Ft0YNTOYoV6Lver
u33m7KCfEqOo8l5XQv9QXBzJt6Kh7cDEPi+rW02x6LLXTHg0vdr72qhpZMaXrTaD
J6WSYpBTFW408SqIYvJ23dmxNMB47cDs5zeowQZH/scNE7VHvo88GA7LGz8Yg0qz
/p7bpm4V3kWC++df5yahwoNnfEB9Gmp555t1Ag5JHY0PI3KJbuu0FG6M0xbZI/Fo
LyBoLtXvTO7VBOhvs4WkluzYiQxdR2BfcPPjHODIZnSV7qNbCBs=
=pSfE
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 3 Mar 2018 16:21
(name . Carlo Zancanaro)(address . carlo@zancanaro.id.au)(address . 30637@debbugs.gnu.org)
87h8px9od8.fsf@gnu.org
Hi Carlo,

Overall LGTM! It’s a long reply though, but that’s because there are
lots of details to pay attention to in this Unix quagmire. :-)

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

Toggle quote (4 lines)
> I've re-written my patch, and it's attached in two commits. The first
> one adds the necessary calls to prctl, and the second adds the
> fallback to polling.

If possible I would prefer a commit that only adds prctl, and the next
one would actually use it. I find it clearer and more convenient if we
need to bisect or revert.

Toggle quote (10 lines)
> On Fri, Mar 02 2018, Ludovic Courtès wrote:
>> The ‘prctl’ procedure itself should simply throw to 'system-error on
>> GNU/Hurd. But then, in (shepherd), we could add the polling thing
>> when (not (string-contains %host-type "linux")).
>>
>> WDYT?
>
> I don't like the idea of doing this based on the host type. In my
> patch I've done it based on whether the prctl call succeeded.

Right, I actually agree with feature-based checks. :-)

More on that inline below.

Toggle quote (7 lines)
> The fallback code still fails in the guix build environment (as my
> previous patch did), but when it's using prctl it works properly. This
> means that a build on Linux pre-3.4, or on Hurd, will fail, which
> probably isn't acceptable given that shepherd is a hard dependency for
> starting a GuixSD system. As far as I can tell the test fails because
> the processes stick around as zombies,

If they’re zombies, that means nobody called waitpid(2). Presumably the
polling code would need to do that.

So I suppose ‘check-for-dead-services’ should do something like:

(when (integer? running)
(catch 'system-error
(lambda ()
(match (waitpid* running WNOHANG)
(#f #f) ;uh, not a PID?
((0 . _) #f) ;ditto?
((pid . _)
(local-output "PID ~a (~a) is dead" running (canonical-name service))
(respawn-service service))))
(lambda args
(or (= ECHILD (system-error-errno args)) ;wrong PID?
(= EPERM (system-error-errno args)) ;not a child
(apply throw args)))))

Does that make sense? Please check waitpid(2) carefully though, because
it’s pretty gnarly and I might have forgotten or misinterpreted
something here.

Toggle quote (14 lines)
>> We want to set the “reaper” of child processes to Shepherd itself,
>> so we must do that in child processes. The shepherd process cannot
>> be its own reaper I suppose.
>
> Reading the manpage, and then running some code, I think you're wrong
> about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling
> process as a child subreaper. That means that any processes that are
> orphaned below the current process get reparented under the current
> process (or a closer child subreaper, if there's one further down). If
> there are no processes marked as child subreapers, then the orphaned
> process is reparented under pid 1. We should only need to call prctl
> in two places: when shepherd initially starts, or when we fork for
> daemonize.

Oh you’re right, sorry for the confusion!

Toggle quote (15 lines)
> From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sat, 3 Mar 2018 17:26:05 +1100
> Subject: [PATCH 1/2] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
> process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
> subreaper attribute of newly forked shepherd process.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
> and export it.
> (prctl): Add new procedure and export it.

[...]

Toggle quote (9 lines)
> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -50,6 +50,8 @@
> ;; Main program.
> (define (main . args)
> (initialize-cli)
> + (when (not (= 1 (getpid)))
> + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))

I think it’s a good idea to add a comment, like:

;; Register ourselves to get SIGCHLD when child processes terminate.
;; This is necessary for daemons for which we’d otherwise never get
;; SIGCHLD.

Toggle quote (3 lines)
> +(define prctl
> + (let ((proc (syscall->procedure long "prctl" (list int int))))

On GNU/Hurd libc doesn’t have a “prctl” symbol. So you need something
like:

(if (dynamic-func "prctl" (dynamic-link))
(let ((proc …)) …)
(lambda (process operation)
;; We’re running on an OS that lacks ‘prctl’, such as GNU/Hurd.
(throw 'system-error "prctl" "~A" (list (strerror ENOSYS))
(list ENOSYS))))

Toggle quote (2 lines)
> +function cleanup() {

You need either () or “function” but not both (shells other than Bash
might complain…).

Toggle quote (2 lines)
> +shepherd_pid="$(cat $pid)"

Likewise, we should use `foo` instead of $(foo) to suppose non-Bash
shells (there are several occurrences of $(foo) here.)

Toggle quote (17 lines)
> From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
> non-blocking.
> (main): Use select with a timeout. If prctl failed when shepherd started
> then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
> signal handler.
> (respawn-service): Separate logic for respawning services from handling
> SIGCHLD.
> (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
> symbols.

[...]

Toggle quote (10 lines)
> + (define poll-services
> + (if (= 1 (getpid))
> + (lambda () #f)
> + (catch 'system-error
> + (lambda ()
> + (prctl PR_SET_CHILD_SUBREAPER 1)
> + (lambda () #f))
> + (lambda (key . args)
> + check-for-dead-services))))

Please add a comment in the handler saying that we resort to polling on
OSes that do not support ‘prctl’.

However, perhaps we should do:

(lambda args
(let ((errno (system-error-errno args)))
(if (= ENOSYS errno)
check-for-dead-services
(apply throw args))))

so that important/unexpected errors are not silently ignored.

Toggle quote (3 lines)
> +(define (respawn-service serv)
> + (when serv

Please add a docstring and move ‘when’ to the caller.

Toggle quote (2 lines)
> +(define (check-for-dead-services)

Docstring as well :-), and also a comment explaining that this is a last
resort for prctl-less OSes.

Toggle quote (6 lines)
> (register-services
> (make <service>
> #:provides '(test-loaded)
> - #:start (const 42)
> + #:start (const 'abc)

Perhaps with the ‘check-for-dead-services’ use of ‘waitpid’ I outlined
above we can even keep 42 here?

If not, we should add in shepherd.texi, under “Slots of services”, a few
words saying that when ‘running’ is an integer it is assumed to be a
PID.

Could you send an updated patch?

Thanks for working on this!

Ludo’.
C
C
Carlo Zancanaro wrote on 3 Mar 2018 21:49
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30637@debbugs.gnu.org)
87woys9961.fsf@zancanaro.id.au
Hey Ludo,

On Sat, Mar 03 2018, Ludovic Courtès wrote:
Toggle quote (13 lines)
> If they’re zombies, that means nobody called waitpid(2).
> Presumably the
> polling code would need to do that.
>
> So I suppose ‘check-for-dead-services’ should do something like:
>
> [ ... ]
>
> Does that make sense? Please check waitpid(2) carefully though,
> because
> it’s pretty gnarly and I might have forgotten or misinterpreted
> something here.

Unfortunately we can't do that. We fall back to the polling
approach to handle the fact that the processes that we care about
aren't our children (hence we don't get SIGCHLD). The waitpid
system call only waits for processes which are children of the
calling process.

I looked into the zombie problem a bit more, and I found what the
problem actually is. In the build environment a guile process is
running as pid 1 (the *-guile-builder script for that job). This
guile process never handles SIGCHLD, and never calls wait/waitpid,
so any orphaned processes become zombies. I tried modifying
derivations.scm, but it wanted to rebuild a lot of things, so I
gave up. I think we need to add something like this to the
*-guile-builder script:

(sigaction SIGCHLD
(lambda ()
(let loop ()
(match (waitpid WAIT_ANY WNOHANG)
((0 . _) #f)
((pid . _) (loop))
(_ #f))))
SA_NOCLDSTOP)

I've attached the output of `ps axjf` inside the build container,
so you can see why I think that this is the problem. It's a bit of
a shame that this is different to `guix environment --container`,
where /bin/sh is pid 1, because it meant that it would build
successfully in my container, but would fail in the build
container (which is a confusing experience).
Attachment: process-tree.txt
Toggle quote (14 lines)
> Please add a comment in the handler saying that we resort to
> polling on
> OSes that do not support ‘prctl’.
>
> However, perhaps we should do:
>
> (lambda args
> (let ((errno (system-error-errno args)))
> (if (= ENOSYS errno)
> check-for-dead-services
> (apply throw args))))
>
> so that important/unexpected errors are not silently ignored.

I had quite liked the idea that it would just ignore any error and
do the fallback, because really all we care about is "prctl
failed" when deciding on our fallback logic. I've decided to just
handle ENOSYS (prctl not available) and EINVAL (which is returned
when PR_SET_CHILD_SUBREAPER not available), and throw for
everything else. I'd love to be able to test this on platforms
where prctl will actually fail, though, because I don't like the
idea of committing code that I haven't actually been able to run.

Toggle quote (6 lines)
> If not, we should add in shepherd.texi, under “Slots of
> services”, a few
> words saying that when ‘running’ is an integer it is assumed to
> be a
> PID.

I've done this, but while doing it I realised that this has always
been true. The SIGCHLD handler has always assumed that a number
indicates a running process, my modifications haven't changed the
assumption, they've just widened its scope.

Carlo
From e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sun, 4 Mar 2018 07:01:30 +1100
Subject: [PATCH 1/3] Add prctl syscall wrapper along with with
PR_SET_CHILD_SUBREAPER.

* configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
* modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
and export it.
(prctl): Add new procedure and export it.
---
configure.ac | 4 ++++
modules/shepherd/system.scm.in | 21 ++++++++++++++++++++-
2 files changed, 24 insertions(+), 1 deletion(-)

Toggle diff (57 lines)
diff --git a/configure.ac b/configure.ac
index bb5058d..fbe16f4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -72,7 +72,11 @@ esac
AC_SUBST([RB_AUTOBOOT])
AC_SUBST([RB_HALT_SYSTEM])
AC_SUBST([RB_POWER_OFF])
+AC_MSG_RESULT([done])
+AC_MSG_CHECKING([<sys/prctl.h> constants])
+AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#include <sys/prctl.h>])
+AC_SUBST([PR_SET_CHILD_SUBREAPER])
AC_MSG_RESULT([done])
dnl Manual pages.
diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in
index a54dca7..09d45bf 100644
--- a/modules/shepherd/system.scm.in
+++ b/modules/shepherd/system.scm.in
@@ -23,7 +23,9 @@
#:export (reboot
halt
power-off
- max-file-descriptors))
+ max-file-descriptors
+ prctl
+ PR_SET_CHILD_SUBREAPER))
;; The <sys/reboot.h> constants.
(define RB_AUTOBOOT @RB_AUTOBOOT@)
@@ -130,6 +132,23 @@ the returned procedure is called."
(list err))
result)))))
+(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@)
+
+(define prctl
+ (if (dynamic-func "prctl" (dynamic-link))
+ (let ((proc (syscall->procedure long "prctl" (list int int))))
+ (lambda (process operation)
+ "Perform an operation on the given process"
+ (let-values (((result err) (proc process operation)))
+ (if (= -1 result)
+ (throw 'system-error "prctl" "~A: ~S"
+ (list (strerror err) name)
+ (list err))
+ result))))
+ (lambda (process operation)
+ (throw 'system-error "prctl" "~A" (list strerror ENOSYS)
+ (list ENOSYS)))))
+
(define (max-file-descriptors)
"Return the maximum number of open file descriptors allowed."
(sysconf _SC_OPEN_MAX))
--
2.16.1
From b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001
From: Carlo Zancanaro <carlo@zancanaro.id.au>
Date: Sun, 4 Mar 2018 07:46:13 +1100
Subject: [PATCH 2/3] Handle forked process SIGCHLD signals

* Makefile.am (TESTS): Add tests/forking-service.sh.
* modules/shepherd.scm: Set the child subreaper attribute of main shepherd
process (as long as we're not pid 1).
* modules/shepherd/service.scm (root-service)[daemonize]: Set the child
subreaper attribute of newly forked shepherd process.
---
Makefile.am | 1 +
modules/shepherd.scm | 7 +++++++
modules/shepherd/service.scm | 4 +++-
3 files changed, 11 insertions(+), 1 deletion(-)

Toggle diff (47 lines)
diff --git a/Makefile.am b/Makefile.am
index eafa308..8dad006 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -190,6 +190,7 @@ TESTS = \
tests/no-home.sh \
tests/pid-file.sh \
tests/status-sexp.sh \
+ tests/forking-service.sh \
tests/signals.sh
TEST_EXTENSIONS = .sh
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index df5420f..faa1e47 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -51,6 +51,13 @@
(define (main . args)
(initialize-cli)
+ (when (not (= 1 (getpid)))
+ ;; Register for orphaned processes to be reparented onto us when their
+ ;; original parent dies. This lets us handle SIGCHLD from daemon processes
+ ;; that would otherwise have been reparented under pid 1. This is
+ ;; unnecessary when we are pid 1.
+ (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)))
+
(let ((config-file #f)
(socket-file default-socket-file)
(pid-file #f)
diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index 2224932..b6394f2 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -1274,7 +1274,9 @@ we want to receive these signals."
(local-output "Running as PID 1, so not daemonizing."))
(else
(if (zero? (primitive-fork))
- #t
+ (begin
+ (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))
+ #t)
(primitive-exit 0))))))
(persistency
"Safe the current state of running and non-running services.
--
2.16.1
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqbClYACgkQqdyPv9aw
Ibx21Q//fnbHM9kvJxr+wOxD5Lgqta9xifp0H6XSIxXziEKcqW0Lcvc6fqlZ8Nue
hv+nRY/UZCcPOO9NYEaI2ut0jsebnEcuVskgB8H0bmzOntlLQVhhsYpYDusIH8ZI
KdQHQd1gr3lDbPm7ABuz5SJqgNYbgAr9qyEPrdbC8BOmkMNUy5XQXzk2c65Ftxlp
iQUJYdJ7i4YOgDTi9Mu24f1OZLqaenOpJ36y0S+j0rGS0Wt01c7dPBz+eIfaY1Us
VwrxhSPlnvVsv8wpqy66zkvdsFz1nnO5xIr7TvteNuii2bXE5T5bKpN2OqS+Qf5Y
txLrXmqz2AGykQsvs1xjsC7ygk0qjTfB682eXQ5bNCdgs88w0V+vTrEpEGAskeok
EehE0h/m/PX9cDxAANuiJmOI1wA1tAAHrLLA0QQL1ZKDCwSJpemqPVSpiJXK/1GZ
GBN9z6GLUR/Nt3M8iQCUiu9VnsenEq6QZ9LnklJ7YsbyElPbIGJppsc1kEVO9WQS
xJ8dlCeq1NOP83kxfTEwPCiz84JRnuXPZmuZmtjMnxbMhsmyM8mOxj5W+gnm7kdX
VifbAG4EBxRGv/u9kftrB0uN36r/ckVcXfry8hinwRmKi4p+bJhY9HgpRU06xK1A
GigSQ4Efnnm0L6vsrUyX1i9Hju9YQgjTLh/maZkBj3sMJe2uCd0=
=j13L
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 4 Mar 2018 23:11
(name . Carlo Zancanaro)(address . carlo@zancanaro.id.au)(address . 30637@debbugs.gnu.org)
87371f4hkf.fsf@gnu.org
Hello,

Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

Toggle quote (19 lines)
> On Sat, Mar 03 2018, Ludovic Courtès wrote:
>> If they’re zombies, that means nobody called waitpid(2). Presumably
>> the
>> polling code would need to do that.
>>
>> So I suppose ‘check-for-dead-services’ should do something like:
>>
>> [ ... ]
>>
>> Does that make sense? Please check waitpid(2) carefully though,
>> because
>> it’s pretty gnarly and I might have forgotten or misinterpreted
>> something here.
>
> Unfortunately we can't do that. We fall back to the polling approach
> to handle the fact that the processes that we care about aren't our
> children (hence we don't get SIGCHLD). The waitpid system call only
> waits for processes which are children of the calling process.

Oh right!

Toggle quote (17 lines)
> I looked into the zombie problem a bit more, and I found what the
> problem actually is. In the build environment a guile process is
> running as pid 1 (the *-guile-builder script for that job). This guile
> process never handles SIGCHLD, and never calls wait/waitpid, so any
> orphaned processes become zombies. I tried modifying derivations.scm,
> but it wanted to rebuild a lot of things, so I gave up. I think we
> need to add something like this to the *-guile-builder script:
>
> (sigaction SIGCHLD
> (lambda ()
> (let loop ()
> (match (waitpid WAIT_ANY WNOHANG)
> ((0 . _) #f)
> ((pid . _) (loop))
> (_ #f))))
> SA_NOCLDSTOP)

Good catch. We could add this in gnu-build-system.scm in core-updates,
though it’s no big deal anyway since these are throw-away environments.

Thoughts?

Toggle quote (11 lines)
> From e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sun, 4 Mar 2018 07:01:30 +1100
> Subject: [PATCH 1/3] Add prctl syscall wrapper along with with
> PR_SET_CHILD_SUBREAPER.
>
> * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER.
> * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable
> and export it.
> (prctl): Add new procedure and export it.

Applied with a copyright line for you.

Toggle quote (11 lines)
> From b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Sun, 4 Mar 2018 07:46:13 +1100
> Subject: [PATCH 2/3] Handle forked process SIGCHLD signals
>
> * Makefile.am (TESTS): Add tests/forking-service.sh.
> * modules/shepherd.scm: Set the child subreaper attribute of main shepherd
> process (as long as we're not pid 1).
> * modules/shepherd/service.scm (root-service)[daemonize]: Set the child
> subreaper attribute of newly forked shepherd process.

Applied. However tests/forking-service.sh was missing, so I took it
from the previous version of this patch and added it. I also changed
the “Bashishms” that were in that file, as discussed earlier. Let me
know if anything’s wrong!

Toggle quote (19 lines)
> From 3d3c091660bbbd529af0058b0ba9b5ddbfc6b481 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH 3/3] Poll every 0.5s to find dead forked services
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
> non-blocking.
> (main): Use select with a timeout. If prctl failed when shepherd started
> then call check-for-dead-services between connections/timeouts.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
> signal handler.
> (respawn-service): Separate logic for respawning services from handling
> SIGCHLD.
> (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
> symbols.
> * doc/shepherd.texi (Slots of services): Add note about service running slot
> being a process id.

[...]

Toggle quote (21 lines)
> --- a/modules/shepherd.scm
> +++ b/modules/shepherd.scm
> @@ -42,6 +42,8 @@
> (with-fluids ((%default-port-encoding "UTF-8"))
> (let ((sock (socket PF_UNIX SOCK_STREAM 0))
> (address (make-socket-address AF_UNIX file-name)))
> + (fcntl sock F_SETFL (logior O_NONBLOCK
> + (fcntl sock F_GETFL)))
> (bind sock address)
> (listen sock 10)
> sock)))
> @@ -49,14 +51,28 @@
>
> ;; Main program.
> (define (main . args)
> - (initialize-cli)
> + (define poll-services
> + (if (= 1 (getpid))
> + (lambda () #f) ;; If we're pid 1 then we don't need to set
> + ;; PR_SET_CHILD_SUBREAPER

[...]

Toggle quote (7 lines)
> + (match (select (list sock) (list) (list) 0.5)
> + (((sock) _ _)
> + (read-from sock))
> + (_
> + #f))
> + (poll-services)

Here everyone ends up paying some overhead (the 0.5 second timeout),
which isn’t great.

How about something like:

(define poll-services
(and (not (= 1 (getpid)))
…))

(match (select (list sock) '() '() (if poll-services 0.5 0))
…)

?

Thanks,
Ludo’.
C
C
Carlo Zancanaro wrote on 4 Mar 2018 23:35
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30637@debbugs.gnu.org)
871sgzpiy9.fsf@zancanaro.id.au
On Sun, Mar 04 2018, Ludovic Courtès wrote:
Toggle quote (6 lines)
> Good catch. We could add this in gnu-build-system.scm in
> core-updates, though it’s no big deal anyway since these are
> throw-away environments.
>
> Thoughts?

The current forking-service.sh test fails in that environment, so
we won't be able to build shepherd on Hurd, or systems with Linux
pre 3.4. This is already the case without my third commit, though,
because the prctl fallback logic isn't in place yet.

I think we should add it in core-updates. It does affect the
behaviour of processes within the build environment, and can lead
to test failures if people rely on pid 1 to reap zombie processes
(which, from what I understand, they should be able to). This
could even be leading to test failures in other packages which we
have just disabled.

Toggle quote (20 lines)
>> + (match (select (list sock) (list) (list) 0.5)
>> + (((sock) _ _)
>> + (read-from sock))
>> + (_
>> + #f))
>> + (poll-services)
>
> Here everyone ends up paying some overhead (the 0.5 second
> timeout),
> which isn’t great.
>
> How about something like:
>
> (define poll-services
> (and (not (= 1 (getpid)))
> …))
>
> (match (select (list sock) '() '() (if poll-services 0.5 0))
> …)

The wait for 0.5 seconds is only an upper-bound for the timeout.
Changing it to a 0 would actually be worse, because it would spend
longer polling for running services. The `select` procedure waits
for `sock` to be ready to read from. When it's ready it returns
immediately, but if `sock` takes more than 0.5 seconds to be ready
then it will return anyway (and take the second branch in the
match, which does nothing).

This should incur no (or extremely minuscule) overhead in how long
it takes to respond to a socket, but provides an opportunity every
half a second (at most) for shepherd to poll the running services.

On reflection, we should also change the commit message for this
commit. I have attached a patch with a more accurate commit
message.

Carlo
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqcdM4ACgkQqdyPv9aw
Ibwy9BAAqFvXHBS+P7tUIBC+Espr9Sk3pc3M4Aw4jTxSmtCeE9W6RJnBEu/gRpf8
q8f7mHaWKDaJ4yUvrfudO/OqVVgbuacuqHxaZg5/hVV9O7vw45wnVE/OkgdJNILT
hrMGiETzy8nOperMcGgo9Al5zxxO19vKUtXWCi8J11+y57xhlcgvGb+xlNX/rcNu
hpnR0leS7zmepkQ6BzNEe5Byn+YqqnUxk7MM4pE/hUlTzQiJW4RR/6lUvEkke1ya
KBZp8GRCkpwNSs6OdnU3i0XYhicVZWoLjBHS43vE8xLtdgJcEh8iQQgAqLF6nPvx
lYInJKjAyUYuN4eBcdcZ7tMY19brp1aud5X/AW9YQmJZSpHyVMNOP8FTwKfYJ//N
04h1JTqE9YP5Uu8asGeGcllSvxF4po2TCELAj6e2DZ2+IX+4WQ4LWGo+pF1Uifr4
ut5nREw5tvWA9Q6Bmexkvw5jvUpVjk/xYXtp4+m2mXWN9NBlnW6Z/f1sTWQucnbN
u7YCET2Jf8oDy7VF5w1iL8OXfQIunpy+1Qf4GVADV9zgVvvOa7Eugd5oG+WGeLYA
44MpMqFRjl9ByjF1O5BhqDa9m/JaDwIG3pJyiYq0Ynn/Py6yPzsHkXIDs5y/HLGU
AOtVMrpQX0ngETtsDeq8ZGuJpd44F6Xn2pV70ZTlnb5ciUZBj7k=
=zUD8
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 4 Mar 2018 23:49
(name . Carlo Zancanaro)(address . carlo@zancanaro.id.au)(address . 30637@debbugs.gnu.org)
87o9k33199.fsf@gnu.org
Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

Toggle quote (18 lines)
> On Sun, Mar 04 2018, Ludovic Courtès wrote:
>> Good catch. We could add this in gnu-build-system.scm in
>> core-updates, though it’s no big deal anyway since these are
>> throw-away environments.
>>
>> Thoughts?
>
> The current forking-service.sh test fails in that environment, so we
> won't be able to build shepherd on Hurd, or systems with Linux pre
> 3.4. This is already the case without my third commit, though, because
> the prctl fallback logic isn't in place yet.
>
> I think we should add it in core-updates. It does affect the behaviour
> of processes within the build environment, and can lead to test
> failures if people rely on pid 1 to reap zombie processes (which, from
> what I understand, they should be able to). This could even be leading
> to test failures in other packages which we have just disabled.

Yeah, makes sense.

Toggle quote (27 lines)
>>> + (match (select (list sock) (list) (list) 0.5)
>>> + (((sock) _ _)
>>> + (read-from sock))
>>> + (_
>>> + #f))
>>> + (poll-services)
>>
>> Here everyone ends up paying some overhead (the 0.5 second timeout),
>> which isn’t great.
>>
>> How about something like:
>>
>> (define poll-services
>> (and (not (= 1 (getpid)))
>> …))
>>
>> (match (select (list sock) '() '() (if poll-services 0.5 0))
>> …)
>
> The wait for 0.5 seconds is only an upper-bound for the
> timeout. Changing it to a 0 would actually be worse, because it would
> spend longer polling for running services. The `select` procedure
> waits for `sock` to be ready to read from. When it's ready it returns
> immediately, but if `sock` takes more than 0.5 seconds to be ready
> then it will return anyway (and take the second branch in the match,
> which does nothing).

Sorry, I didn’t mean 0 but rather #f (indefinite wait).

My point is: we shouldn’t wake up every 0.5 seconds for no reason. IOW,
we should wake up periodically only in the non-pid-1-no-prctl case.

Does that make sense?

Ludo’.
C
C
Carlo Zancanaro wrote on 5 Mar 2018 00:08
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30637@debbugs.gnu.org)
87y3j7o2wd.fsf@zancanaro.id.au
On Sun, Mar 04 2018, Ludovic Courtès wrote:
Toggle quote (9 lines)
> Sorry, I didn’t mean 0 but rather #f (indefinite wait).
>
> My point is: we shouldn’t wake up every 0.5 seconds for no
> reason. IOW,
> we should wake up periodically only in the non-pid-1-no-prctl
> case.
>
> Does that make sense?

Yep! That makes a lot more sense. Patch attached.

Carlo
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqcfFIACgkQqdyPv9aw
IbxffQ//SgthHJ/omxGO/YFzMtxFsg7tfrovt/8+coizpTKmnZKUTfmzh9WXxM1J
zZ0efczqW6/0guxFBWZjtHKDkP7LSmXWxNuUq144R7aiubDF+mCBL3KnvaYKqR8E
jnJoq1z3XUOKkaTsmo4SJqajkeANCPhX+1Qdb8A0GqDlrBJwElqm1cS1R91b8Joy
nHVMc0LmQp7HXM+5x63hHFm3GChgFFN9l49azdmnSDIDGOQXXSRHnLcBEUVqBTVI
CokCqXKYJlVs1NXoHXz2d6GBLKLtMw3BPTlbRCtMmQW8imGTeRbrJbNgMa3ao9fJ
MJ2wAWLO6cLgoyx8TcZwheari985J/wQ0+s6oZCQPquEiU+2ealJLN14jMRz3Z+2
n4RWBLhW3+viHrZn84UL69+Q+KdK6g8KWICR2qcWUOiVQ7HyINf9V35mXjRj1X6D
ARCQFDDsa8mXJQyP7df0yA8rgmBMPYYdvwozNeeo6fIzgRJp0/3qYv6AuA8eTFPL
veikvZiI6/DTZNvSE5Z5VDKxBRnZjd3wAPweL0l4Mb5U3KsR/qk6TfQyImF1b0KN
Le6QHQ0frj6BVBSRpT31KFeIFpUlO+OhbJUj+oDnLUrAO+zLvEEh+BeaOjchAW/Y
FtgOcgPO5PJF1DJvqW8ZBtGYknNFlaveHZJoG9Yg/gSUvD9Q8Ic=
=gmH2
-----END PGP SIGNATURE-----

L
L
Ludovic Courtès wrote on 5 Mar 2018 15:15
(name . Carlo Zancanaro)(address . carlo@zancanaro.id.au)(address . 30637-done@debbugs.gnu.org)
87d10ia9sh.fsf@gnu.org
Carlo Zancanaro <carlo@zancanaro.id.au> skribis:

Toggle quote (19 lines)
> From be442ea64e4fd8e235378a5f04d38296c0af9cf3 Mon Sep 17 00:00:00 2001
> From: Carlo Zancanaro <carlo@zancanaro.id.au>
> Date: Wed, 21 Feb 2018 22:57:59 +1100
> Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fails.
>
> * modules/shepherd.scm (open-server-socket): Set socket to be
> non-blocking.
> (main): If we are unable to use prctl/PR_SET_CHILD_SUBREAPER, then poll for
> service processes between client connections, or every 0.5 seconds.
> * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as
> signal handler.
> (respawn-service): Separate logic for respawning services from handling
> SIGCHLD.
> (handle-SIGCHLD, check-for-dead-services): New exported procedures.
> * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with
> symbols.
> * doc/shepherd.texi (Slots of services): Add note about service running slot
> being a process id.

Awesome. Applied with minor cosmetic changes (see below).

Thank you!

Ludo’.
Toggle diff (33 lines)
diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index 9d94881..2b4a7b5 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -52,7 +52,8 @@
;; Main program.
(define (main . args)
(define poll-services?
- (and (not (= 1 (getpid))) ;; if we're pid 1 we don't need to do anything
+ ;; Do we need polling to find out whether services died?
+ (and (not (= 1 (getpid))) ;if we're pid 1, we don't
(catch 'system-error
(lambda ()
;; Register for orphaned processes to be reparented onto us when
@@ -60,14 +61,13 @@
;; daemon processes that would otherwise have been reparented
;; under pid 1. Obviously this is unnecessary when we are pid 1.
(prctl PR_SET_CHILD_SUBREAPER 1)
- #f) ;; don't poll
+ #f) ;don't poll
(lambda args
;; We fall back to polling for services on systems that don't
- ;; support prctl/PR_SET_CHILD_SUBREAPER
+ ;; support prctl/PR_SET_CHILD_SUBREAPER.
(let ((errno (system-error-errno args)))
- (if (or (= ENOSYS errno) ;; prctl not available
- (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available
- #t ;; poll
+ (or (= ENOSYS errno) ;prctl unavailable
+ (= EINVAL errno) ;PR_SET_CHILD_SUBREAPER unavailable
(apply throw args)))))))
(initialize-cli)
Closed
?