Shepherd deadlocks

OpenSubmitted by Mathieu Othacehe.
Details
3 participants
  • Ludovic Courtès
  • Michael Rohleder
  • Mathieu Othacehe
Owner
unassigned
Severity
important
M
M
Mathieu Othacehe wrote on 19 Jun 10:41 +0200
(address . bug-guix@gnu.org)
87h7v75txx.fsf@gnu.org
Hello,
When running "gui-installed-desktop-os-encrypted" test, Shepherd seemsto deadlock when restarting "guix-daemon". This can happen at differentstages:
* In "umount-cow-store" procedure, just before finishing the install.
* During "set-http-proxy" tests inside the marionette. This is not always reproducible. In order to gather some information, Icreated a Shepherd "strace" service that logs what's happening inShepherd itself (patch attached).
It seems that, just after blocking signals, in "fork+exec-command", Iguess, Shepherd is taking a lock:
Toggle snippet (23 lines)183553:1 chdir("/") = 0183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0183562:1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271183563:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>183566:1 <... write resumed>) = 86183575:1 getpgid(169) = 169183576:1 getpgid(0) = 0183577:1 kill(-169, SIGTERM <unfinished ...>183579:1 <... kill resumed>) = 0183582:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>183585:1 <... write resumed>) = 51183594:1 getuid() = 0183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0183596:1 write(6, "\1", 1 <unfinished ...>183598:1 <... write resumed>) = 1183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
and is then blocking forever.
I cannot find any trace of locking in Shepherd, except for"without-automatic-finalization" procedure.
Thanks,
Mathieu
From 13ac475ad953509cbbf88547e4f72efc08d31055 Mon Sep 17 00:00:00 2001From: Mathieu Othacehe <othacehe@gnu.org>Date: Thu, 18 Jun 2020 14:13:15 +0200Subject: [PATCH] strace service
--- gnu/services/base.scm | 3 +++ gnu/services/linux.scm | 25 ++++++++++++++++++++++++- gnu/system/install.scm | 3 +++ 3 files changed, 30 insertions(+), 1 deletion(-)
Toggle diff (84 lines)diff --git a/gnu/services/base.scm b/gnu/services/base.scmindex 6ea7ef8e7e..d80c6cca6d 100644--- a/gnu/services/base.scm+++ b/gnu/services/base.scm@@ -34,6 +34,7 @@ #:use-module (guix deprecation) #:use-module (gnu services) #:use-module (gnu services admin)+ #:use-module (gnu services linux) #:use-module (gnu services shepherd) #:use-module (gnu system pam) #:use-module (gnu system shadow) ; 'user-account', etc.@@ -2386,6 +2387,8 @@ to handle." (service guix-service-type) (service nscd-service-type) + (service strace-service-type 1)+ (service rottlog-service-type) ;; The LVM2 rules are needed as soon as LVM2 or the device-mapper isdiff --git a/gnu/services/linux.scm b/gnu/services/linux.scmindex 12934c2084..96889dfb7f 100644--- a/gnu/services/linux.scm+++ b/gnu/services/linux.scm@@ -42,7 +42,9 @@ earlyoom-configuration-send-notification-command earlyoom-service-type - kernel-module-loader-service-type))+ kernel-module-loader-service-type++ strace-service-type)) ;;;@@ -177,3 +179,24 @@ representation." (compose concatenate) (extend append) (default-value '())))++ +;;;+;;; Strace service.+;;;++(define strace-service-type+ (shepherd-service-type+ 'strace+ (lambda (pid)+ (shepherd-service+ (requirement '(root-file-system user-processes))+ (provision (list (string->symbol+ (string-append "strace-" (number->string pid)))))+ (documentation "Strace a running process.")+ (start #~(make-forkexec-constructor+ (list #$(file-append strace "/bin/strace")+ "-f" "-p" (number->string #$pid)+ "-o" (string-append "/var/log/strace-"+ (number->string #$pid) ".log"))))+ (stop #~(make-kill-destructor))))))diff --git a/gnu/system/install.scm b/gnu/system/install.scmindex d0ff2e7c52..f2f1e9414c 100644--- a/gnu/system/install.scm+++ b/gnu/system/install.scm@@ -34,6 +34,7 @@ #:use-module (gnu installer) #:use-module (gnu system locale) #:use-module (gnu services dbus)+ #:use-module (gnu services linux) #:use-module (gnu services networking) #:use-module (gnu services shepherd) #:use-module (gnu services ssh)@@ -378,6 +379,8 @@ Access documentation at any time by pressing Alt-F2.\x1b[0m ;; regulations-compliant WiFi access. (udev-service #:rules (list lvm2 crda)) + (service strace-service-type 1)+ ;; Add the 'cow-store' service, which users have to start manually ;; since it takes the installation directory as an argument. (cow-store-service)-- 2.26.2
Attachment: strace-1.log.xz
M
M
Mathieu Othacehe wrote on 19 Jun 14:10 +0200
(address . 41948@debbugs.gnu.org)(address . ludo@gnu.org)
87d05v5k8n.fsf@gnu.org
Toggle quote (2 lines)> 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
I think this is caused by a "pthread_join", most probably the one in"stop_finalization_thread" that is called right before forking a newprocess. The fact that we hang here probably means that the finalizerthread itself is hanging, not sure why.
It looks like what was reported by Ludo here:https://issues.guix.info/31925.
Thanks,
Mathieu
M
M
Michael Rohleder wrote on 20 Jun 02:16 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87y2oilhgm.fsf@rohleder.de
Hi Mathieu,
Mathieu Othacehe <othacehe@gnu.org> writes:
Toggle quote (20 lines)> This is not always reproducible. In order to gather some information, I> created a Shepherd "strace" service that logs what's happening in> Shepherd itself (patch attached).
> +(define strace-service-type> + (shepherd-service-type> + 'strace> + (lambda (pid)> + (shepherd-service> + (requirement '(root-file-system user-processes))> + (provision (list (string->symbol> + (string-append "strace-" (number->string pid)))))> + (documentation "Strace a running process.")> + (start #~(make-forkexec-constructor> + (list #$(file-append strace "/bin/strace")> + "-f" "-p" (number->string #$pid)> + "-o" (string-append "/var/log/strace-"> + (number->string #$pid) ".log"))))> + (stop #~(make-kill-destructor))))))
Thank you very much for this nice strace service!
I'am debugging a "shepherd hangs" problem in the context of lvm andlvmcache...This service helps me a lot, ty!

--
Toggle quote (1 lines)>Ever heard of .cshrc?
That's a city in Bosnia. Right?(Discussion in comp.os.linux.misc on the intuitiveness of commands.)
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEdV4t5dDVhcUueCgwfHr/vv7yyyUFAl7tVVkACgkQfHr/vv7yyyWfjQf+JiRafW3ojpVRpu3tojSjO969qSreYbHyL9aqCG9RWmlDAyVTHnIqsk8H+D1uZ11aGBLoYpbW+24JLrCyH0MPa019BXKzweuz15do/vkkqcHyCzsJT8sfGv5e1KTcaJyaEctDQlD3Rbh4rZr58xiKOxfGG9x6AFCg44ugbbEMDXKv08bQO+bxi3MxqOx0RlB40uj4jMvdLtjkEWEJYXyu0BXxCBr8Nw2DSeAL4c4ptTZdXZe+lS0IevhpHTzlrNd3euAGGqy+w6oi9dxJZealzl+F1OMyPlJRVUnnN+mSeyWhMhHsCnEWD71QxxoNH2CzC7dgvTMD63PplpKv46V0fQ===he4k-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 20 Jun 12:31 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 41948@debbugs.gnu.org)
87a70yc9kj.fsf@gnu.org
Hi,
Mathieu Othacehe <othacehe@gnu.org> skribis:
Toggle quote (12 lines)> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems> to deadlock when restarting "guix-daemon". This can happen at different> stages:>> * In "umount-cow-store" procedure, just before finishing the install.>> * During "set-http-proxy" tests inside the marionette.> > This is not always reproducible. In order to gather some information, I> created a Shepherd "strace" service that logs what's happening in> Shepherd itself (patch attached).
We should be able to reproduce it with much simpler tests then, right?Like maybe “while : ; do herd restart guix-daemon ; done” or similar?
Toggle quote (27 lines)> It seems that, just after blocking signals, in "fork+exec-command", I> guess, Shepherd is taking a lock:>> 183553:1 chdir("/") = 0> 183554:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0> 183555:1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0> 183560:1 madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0> 183561:1 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0> 183562:1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271> 183563:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0> 183564:1 write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>> 183566:1 <... write resumed>) = 86> 183575:1 getpgid(169) = 169> 183576:1 getpgid(0) = 0> 183577:1 kill(-169, SIGTERM <unfinished ...>> 183579:1 <... kill resumed>) = 0> 183582:1 stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0> 183583:1 write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>> 183585:1 <... write resumed>) = 51> 183594:1 getuid() = 0> 183595:1 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0> 183596:1 write(6, "\1", 1 <unfinished ...>> 183598:1 <... write resumed>) = 1> 183605:1 futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>>> and is then blocking forever.
When that happens, we should check how many threads exist in PID 1.There should be the finalization thread and the main thread, plus thesignal thread (because there are still ‘sigaction’ calls in the ‘main’procedure), plus the GC marker threads.
In https://issues.guix.gnu.org/31925#6, Andy suggests that the signalthread is not properly handled; indeed it takes locks and we don’t tryto shut it down upon fork. However, when using signalfd, the signalthread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so Idon’t see how it could cause problems.
The GC threads are presumably taken care of by the atfork handler inlibgc.
Thoughts?
Ludo’.
L
L
Ludovic Courtès wrote on 23 Jun 11:28 +0200
control message for bug #41948
(address . control@debbugs.gnu.org)
871rm63zd7.fsf@gnu.org
severity 41948 importantquit
M
M
Mathieu Othacehe wrote on 16 Aug 11:56 +0200
Re: bug#41948: Shepherd deadlocks
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 41948@debbugs.gnu.org)
87k0xyhq22.fsf@gnu.org
Hey Ludo,
Toggle quote (3 lines)> We should be able to reproduce it with much simpler tests then, right?> Like maybe “while : ; do herd restart guix-daemon ; done” or similar?
Well I tried that without success. Then I had a closer look to thestrace log.
Turns out there are two concurrent "finalizer" threads:
Toggle snippet (3 lines)1 clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
and this one,
Toggle snippet (3 lines)217 <... clone resumed>, parent_tid=[253], tls=0x7f1799309700, child_tidptr=0x7f17993099d0) = 253
The first one is spawned from Shepherd directly. The other one isspawned from the forked process in "marionette-shepherd-service".
Those two finalizer threads share the same pipe. When we try tostop the finalizer thread in Shepherd, right before forking a newprocess, we send a '\1' byte to the finalizer pipe.
Toggle snippet (3 lines)1 write(6, "\1", 1 <unfinished ...>
which is received by (line 183597):
Toggle snippet (3 lines)253 <... read resumed>"\1", 1) = 1
the marionette finalizer thread. Then, we pthread_join the Shepherdfinalizer thread, which never stops! Quite unfortunate.
Here's a small reproducer attached. So unless I'm wrong this is a Guileissue, that will cause any program that uses at least two primitive-forkcalls to possibly hang.
I'm quite convinced that those two bugs are directly related:
* https://issues.guix.info/31925* https://issues.guix.gnu.org/42353
Now regarding the fix of this issue, I guess that a process forked with"primitive-fork" in Guile should close it's parent finalizer pipe andopen a new one.
WDYT?
Thanks,
Mathieu
Attachment: t.scm
?