Multiple client 'build-paths' RPCs can lead to daemon deadlock

OpenSubmitted by Ludovic Courtès.
Details
One participant
  • Ludovic Courtès
Owner
unassigned
Severity
important
L
L
Ludovic Courtès wrote on 11 Jun 2018 16:06
(address . bug-guix@gnu.org)
87602ph0yv.fsf@gnu.org
Hello,
I tried running this:
guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort) & \ guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -r)
… also in parallel with this (for good measure):
guix build --max-jobs=200 $(guix gc -R $(guix build -d inkscape --no-grafts) | sort -R)
Since we have 3 clients, that leads to 3 guix-daemon processes, andthose are stuck in a deadlock:
Toggle snippet (153 lines)$ ps aux | grep guix-daemon[…]root 20501 0.6 0.0 39292 12012 ? Ss 15:51 0:01 guix-daemon 20455root 20503 4.3 0.0 39420 12096 ? Ss 15:51 0:12 guix-daemon 20491root 22154 0.0 0.0 39028 11016 ? Ss 15:52 0:00 guix-daemon 22148$ sudo strace -p 22154 -tstrace: Process 22154 attached15:57:03 select(0, 0x7ffdca30b610, NULL, NULL, {tv_sec=1, tv_usec=314189}) = 0 (Timeout)15:57:04 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 815:57:04 fcntl(8, F_GETFD) = 015:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:57:04 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:57:04 close(8) = 015:57:04 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 8^C$ sudo strace -p 20503 -tstrace: Process 20503 attached15:58:23 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=1, tv_usec=595302}) = 0 (Timeout)15:58:25 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 1415:58:25 fcntl(14, F_GETFD) = 015:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 015:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:58:25 close(14) = 015:58:25 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 1415:58:25 fcntl(14, F_GETFD) = 015:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 015:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:58:25 close(14) = 015:58:25 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 1415:58:25 fcntl(14, F_GETFD) = 015:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) = 015:58:25 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:58:25 close(14) = 015:58:25 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=5, tv_usec=0}) = 0 (Timeout)15:58:30 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 1415:58:30 fcntl(14, F_GETFD) = 015:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 015:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:58:30 close(14) = 015:58:30 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lock", O_RDWR|O_CREAT, 0600) = 1415:58:30 fcntl(14, F_GETFD) = 015:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 015:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:58:30 close(14) = 015:58:30 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) = 1415:58:30 fcntl(14, F_GETFD) = 015:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) = 015:58:30 fcntl(14, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:58:30 close(14) = 015:58:30 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec=5, tv_usec=0}^Cstrace: Process 20503 detached$ sudo strace -p 20501 -tstrace: Process 20501 attached15:59:09 select(121, [20 30 120], NULL, NULL, {tv_sec=0, tv_usec=535316}) = 0 (Timeout)15:59:10 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock", O_RDWR|O_CREAT, 0600) = 815:59:10 fcntl(8, F_GETFD) = 015:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) = 015:59:10 fcntl(8, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Risurco dumtempe ne disponeblas)15:59:10 close(8) = 015:59:10 select(121, [20 30 120], NULL, NULL, {tv_sec=5, tv_usec=0}^Cstrace: Process 20501 detached$ sudo lsof | grep '/gnu/store/.*\.lock'guix-daem 20501 root 14uW REG 253,0 0 3150879 /gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-compiled.lockguix-daem 20501 root 26uW REG 253,0 0 3212618 /gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-doc.lockguix-daem 20501 root 116uW REG 253,0 0 3212684 /gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lockguix-daem 20503 root 8uW REG 253,0 0 3212520 /gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.lockguix-daem 20503 root 50uW REG 253,0 0 3212514 /gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstrap-0.lockguix-daem 20503 root 155uW REG 253,0 0 3212547 /gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0.lockguix-daem 20503 root 173uW REG 253,0 0 3212552 /gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc.lockguix-daem 20503 root 317uW REG 253,0 0 3212602 /gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.lockguix-daem 20503 root 407uW REG 253,0 0 3212655 /gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-debug.lockguix-daem 20503 root 410uW REG 253,0 0 3212659 /gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lockguix-daem 20503 root 440uW REG 253,0 0 3212686 /gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock
So for example:
• 20501 tries to acquire aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock, which is held by 20503;
• 20503 tries to acquire amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock, which is held by 20501.
This comes from the fact that ‘LocalStore::buildPaths’ takes theuser-supplied derivation list as is, without sorting it, and thenacquires locks in that order in ‘Worker::run’.
A topological sort (or maybe an alphanumeric sort?) should allow us toguarantee that guix-daemon processes take locks in the same order, andthen don’t end up in a deadlock.
I discovered this bug while monitoring Cuirass on berlin: severalsessions submit batches of 200 derivations in ‘build-paths’ RPCs, andsometimes most of the corresponding guix-daemon processes would end upbeing stuck in a lock-acquiring loop.
Ludo’.
L
L
Ludovic Courtès wrote on 11 Jun 2018 17:42
control message for bug #31785
(address . control@debbugs.gnu.org)
87zi01cost.fsf@gnu.org
severity 31785 important
L
L
Ludovic Courtès wrote on 6 Sep 2019 11:04
Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock
(address . 31785@debbugs.gnu.org)
87lfv1eqxg.fsf@gnu.org
Here’s another example from berlin today:
Toggle snippet (38 lines)ludo@berlin$ sudo guix processesSessionPID: 38649ClientPID: 38611ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guile \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --max-silent-time=36000 --timeout=216000 -s armhf-linux nmoldyn r-msnid r-yamss netcdf-fortran python-netcdf4LockHeld: /gnu/store/v6hcm1gvv3gdfbnq6cd4kn1i6ip3y74s-netcdf-fortran-4.4.4.lockLockHeld: /gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.1.lockChildProcess: 33181: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902458/bin/.guix-real offload x86_64-linux 36000 1 216000
[...]
SessionPID: 44043ClientPID: 44023ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guile \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --max-silent-time=36000 --timeout=216000 -s armhf-linux netcdf-fortran r-msnid nmoldyn r-yamssLockHeld: /gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.0.lockLockHeld: /gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.11.lockChildProcess: 48997: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902458/bin/.guix-real offload x86_64-linux 36000 1 216000
ludo@berlin ~$ sudo strace -p 38649 -e openat,fcntlstrace: Process 38649 attachedopenat(AT_FDCWD, "/gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.11.lock", O_RDWR|O_CREAT, 0600) = 10fcntl(10, F_GETFD) = 0fcntl(10, F_SETFD, FD_CLOEXEC) = 0fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)openat(AT_FDCWD, "/gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.0.lock", O_RDWR|O_CREAT, 0600) = 10fcntl(10, F_GETFD) = 0fcntl(10, F_SETFD, FD_CLOEXEC) = 0fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)^Cstrace: Process 38649 detached
ludo@berlin ~$ sudo strace -p 44043 -e openat,fcntlstrace: Process 44043 attachedopenat(AT_FDCWD, "/gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.1.lock", O_RDWR|O_CREAT, 0600) = 10fcntl(10, F_GETFD) = 0fcntl(10, F_SETFD, FD_CLOEXEC) = 0fcntl(10, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = -1 EAGAIN (Resource temporarily unavailable)^Cstrace: Process 44043 detached
A very palpable deadlock.
Ludo’.
?