Cuirass runs out of build users

  • Done
  • quality assurance status badge
Details
5 participants
  • Andreas Enge
  • Danny Milosavljevic
  • Efraim Flashner
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Andreas Enge
Severity
normal
A
A
Andreas Enge wrote on 28 Feb 2018 10:07
(address . bug-guix@gnu.org)
20180228090714.GA1845@jurong
Hello,

in trying out the new bayfront machine, I raised the number of concurrent
build jobs to 10; the number of build accounts is programmatically set to
twice this number.

Looking at the build directories in /tmp, 9 build jobs were started on
February 26 at around 21:50, a 10th one at 22:24.

Later, I have been getting messages such as these:
2018-02-27T18:31:13 building 14491 derivations in batches of 200
2018-02-27T18:31:13 building batch of 200 jobs (0/14491)
2018-02-27T18:31:13 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:13 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/w56wpx4jar88hfabipvf6gd6hz81p6wk-python2-gevent-1.2.2.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/08pjxws1pd04syqf28d3hhsfrb946bpc-dzen-0.9.5-1.488ab66.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/01r4xm1nqixcxjbbn6xakpg3m5wkr12v-po4a-0.47.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/02chd3k5qc8p8fc46j4w955cfld77cw1-spinsim-0.75-1.66915a7ad.drv'
2018-02-27T18:31:15 build started: '/gnu/store/81b2340psschvvk7ybhnz6nx9jkpdlwa-Net-SMTP-SSL-1.04.tar.gz.drv'
2018-02-27T18:31:15 build started: '/gnu/store/rkkacywdj7yya3wf51i141x5khzld4zl-clx-0.0.0-1.tar.xz.drv'
2018-02-27T18:31:16 build started: '/gnu/store/286gkv0z6g9fb4frcx6yp2fg9v6iwk3b-ecl-alexandria-0.0.0-1.926a066.drv'
2018-02-27T18:31:16 build started: '/gnu/store/hmy3ds71gbv1mspc9p5yvvhz5pibs8wl-elixir-1.5.3.tar.xz.drv'
2018-02-27T18:31:16 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:16 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)

So it looks as if build jobs get stalled, but somehow do not free their
build users.

Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
10 is the default value, but it should have been overwritten.

And in /etc/passwd, I see 20 users. But apparently, they are not used!

Andreas
L
L
Ludovic Courtès wrote on 1 Mar 2018 15:01
(name . Andreas Enge)(address . andreas@enge.fr)(address . 30644@debbugs.gnu.org)
87bmg76gjp.fsf@gnu.org
Hello,

Andreas Enge <andreas@enge.fr> skribis:

Toggle quote (4 lines)
> in trying out the new bayfront machine, I raised the number of concurrent
> build jobs to 10; the number of build accounts is programmatically set to
> twice this number.

Indeed, (sysadmin services) in guix-maintenance.git does that.

Toggle quote (2 lines)
> 2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)

[...]

Toggle quote (5 lines)
> Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
> 10 is the default value, but it should have been overwritten.
>
> And in /etc/passwd, I see 20 users. But apparently, they are not used!

There’s something fishy here. What does “getent group guixbuild”
return?

Thanks,
Ludo’.
A
A
Andreas Enge wrote on 1 Mar 2018 21:35
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30644@debbugs.gnu.org)
20180301203543.GA1549@jurong
On Thu, Mar 01, 2018 at 03:01:30PM +0100, Ludovic Courtès wrote:
Toggle quote (8 lines)
> > Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
> > 10 is the default value, but it should have been overwritten.
> >
> > And in /etc/passwd, I see 20 users. But apparently, they are not used!
>
> There’s something fishy here. What does “getent group guixbuild”
> return?

All 20 build users:
# getent group guixbuild
guixbuild:x:30000:guixbuilder01,guixbuilder02,guixbuilder03,guixbuilder04,guixbuilder05,guixbuilder06,guixbuilder07,guixbuilder08,guixbuilder09,guixbuilder10,guixbuilder11,guixbuilder13,guixbuilder14,guixbuilder15,guixbuilder16,guixbuilder17,guixbuilder18,guixbuilder20

I will deploy the new cuirass with the other bug fixed, and report back if
the problem described here is still present.

Andreas
E
E
Efraim Flashner wrote on 2 Mar 2018 15:08
(name . Andreas Enge)(address . andreas@enge.fr)(address . 30644@debbugs.gnu.org)
20180302140833.GB3213@macbook41
On Wed, Feb 28, 2018 at 10:07:14AM +0100, Andreas Enge wrote:
Toggle quote (44 lines)
> Hello,
>
> in trying out the new bayfront machine, I raised the number of concurrent
> build jobs to 10; the number of build accounts is programmatically set to
> twice this number.
>
> Looking at the build directories in /tmp, 9 build jobs were started on
> February 26 at around 21:50, a 10th one at 22:24.
>
> Later, I have been getting messages such as these:
> 2018-02-27T18:31:13 building 14491 derivations in batches of 200
> 2018-02-27T18:31:13 building batch of 200 jobs (0/14491)
> 2018-02-27T18:31:13 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:13 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/w56wpx4jar88hfabipvf6gd6hz81p6wk-python2-gevent-1.2.2.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/08pjxws1pd04syqf28d3hhsfrb946bpc-dzen-0.9.5-1.488ab66.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/01r4xm1nqixcxjbbn6xakpg3m5wkr12v-po4a-0.47.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/02chd3k5qc8p8fc46j4w955cfld77cw1-spinsim-0.75-1.66915a7ad.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/81b2340psschvvk7ybhnz6nx9jkpdlwa-Net-SMTP-SSL-1.04.tar.gz.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/rkkacywdj7yya3wf51i141x5khzld4zl-clx-0.0.0-1.tar.xz.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/286gkv0z6g9fb4frcx6yp2fg9v6iwk3b-ecl-alexandria-0.0.0-1.926a066.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/hmy3ds71gbv1mspc9p5yvvhz5pibs8wl-elixir-1.5.3.tar.xz.drv'
> 2018-02-27T18:31:16 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
> 2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)
>
> So it looks as if build jobs get stalled, but somehow do not free their
> build users.
>
> Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
> 10 is the default value, but it should have been overwritten.
>
> And in /etc/passwd, I see 20 users. But apparently, they are not used!
>
> Andreas
>

When I use a shell script to build all the packages, if a package hangs
and gets eaten by the timout-limit then it holds the build user and
space in /tmp until all 10 builders are taken and building stops or all
the builds finish and then it releases them.

--
Efraim Flashner <efraim@flashner.co.il> ????? ?????
GPG key = A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEoov0DD5VE3JmLRT3Qarn3Mo9g1EFAlqZWuEACgkQQarn3Mo9
g1Gx3w//aDK7ncgEDjVA2yljOjGVNKmK9Fb4yn/jVxwArN4AsfIe9wOobGZOKy3Q
c2F4rhmCpy7uUKmz7qfkLqbkiurcQ2X/tuZ6f4Rfmu8BCGUyg2IqAqcFC08CpaVQ
VyalkG1NOqF96rUcGvnqL+O1kYs0b37JKZLFxx+VYrSiy4KO0dflVXVhegwDNpoy
nQNeD3EdxEgG+pBmxoqtY6PeLEw1zuzmcB8ieJCfDdFNz5DXSWF6iwAHrEWCuNup
DPktozvp5XQZd4BkZfWYBY7RQ6VT5xHdFANwCh5bYAgSFff+NJWZ1YeIeEXj3W2i
lGLyaaiSPjV+xfvWEMZKn0AmWhNOOQ032ABz6lWXGjDJbRzw3DpIofMpITMI2cm/
TmQVjE4ElsVTSPvKN76+CXk5RC/XFUTOp/Yln0wTUnzG+8TOU97CkcggoGfu2973
G0OZjP8dNS2haqyI8rpuuErKJPT5CnWv40oGPpYxC4jLnDSQz5Kjr0I6BftzNkhZ
+7gULzhvH211yKB1DZe1kP7WtMKPHkbNlLuS0wTokecrCx75vfLW9VUorEExkZwd
APayq08z1a8fb8YNQESztz58rBGLabYKNcYnOX3dvTXzZZtDW/l75tRePlgBABHz
MpQ6onf80o28uugtHFvdEhP3sDSvWt5fEjgUOgY87TaTMNmqf0k=
=vgq9
-----END PGP SIGNATURE-----


A
A
Andreas Enge wrote on 5 Mar 2018 10:02
(name . Efraim Flashner)(address . efraim@flashner.co.il)(address . 30644@debbugs.gnu.org)
20180305090256.GA9171@jurong
On Fri, Mar 02, 2018 at 04:08:33PM +0200, Efraim Flashner wrote:
Toggle quote (5 lines)
> When I use a shell script to build all the packages, if a package hangs
> and gets eaten by the timout-limit then it holds the build user and
> space in /tmp until all 10 builders are taken and building stops or all
> the builds finish and then it releases them.

Right now, I observe something else: Only one package is built at a time,
taking space in /tmp. But it is not the same build user; I saw one package
with guixbuilder06, and now it is building one package with guixbuilder01.

Is there a way to check which builders are considered taken? A lock file
somewhere?

Andreas
A
A
Andreas Enge wrote on 5 Mar 2018 16:53
(name . Efraim Flashner)(address . efraim@flashner.co.il)(address . 30644@debbugs.gnu.org)
20180305155351.GA6054@jurong
Right now, I do not seem to be running out of build users, but I think
that cuirass nevertheless behaves strangely; so I am asking here instead
of opening a new bug (maybe there is no bug, actually).

I reconfigured bayfront this morning with the latest git checkout.
Right now, there are only two packages built in /tmp, if at all:
$ ll /tmp/
total 8
drwx------ 4 guixbuilder02 guixbuild 4096 Mar 5 16:06 guix-build-guile-emacs-20150512.41120e0.drv-0/
drwx------ 5 guixbuilder01 guixbuild 4096 Mar 5 15:05 guix-build-r-mgcv-1.8-23.drv-0/
(It is now 16:54, but "top" shows emacs and r activity, so my guess
is these packages are being built.)
But in any case, we are far from the 10 jobs I configured.

I am attaching the result of
$ grep -v evaluate cuirass.log|grep -v ^\`/gnu/store|grep -v spec|grep -v "next evaluation"|grep -v heap > ~/purged.log
which contains the non-trivial lines of /var/log/cuirass.log.

There is still one occurrence of this:
In ice-9/boot-9.scm:
829:9 7 (catch _ _ #<procedure 7f49ffb6c9e0 at cuirass/base.sc?> ?)
In cuirass/base.scm:
616:26 6 (_)
548:2 5 (build-packages _ #<<sqlite-db> pointer: #<pointer 0x4?> ?)
426:10 4 (spawn-builds #<build-daemon 256.97 42d2320> #<<sqlite?> ?)
In srfi/srfi-1.scm:
640:9 3 (for-each #<procedure update! (drv)> _)
In cuirass/database.scm:
69:18 2 (%sqlite-exec _ _ . _)
In ice-9/eval.scm:
293:34 1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
619:8 0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

There is this, which looks like a problem, but not a fatal one:
;;; Failed to autoload make-session in (gnutls):
;;; missing interface for module (gnutls)
;;; Failed to autoload connection-end/client in (gnutls):
;;; missing interface for module (gnutls)

And what strikes me most is that between two lines of the kind
building batch of 200 jobs
there are not 200 lines of successful or failed packages.

And my impression is that too many packages fail. What are these
"status 100" and "status 101"?

Do things look like this on berlin? (Of course, one difference is that
right now the machine builds everything itself without offloading.)

Andreas
Attachment: purged.log
A
A
Andreas Enge wrote on 5 Mar 2018 20:55
(name . Efraim Flashner)(address . efraim@flashner.co.il)(address . 30644@debbugs.gnu.org)
20180305195520.GA2349@jurong
Okay, this evening, I see 10 packages being built and lots of messages
"build started"/"build succeeded"; so maybe there is no bug at work here.

Andreas
L
L
Ludovic Courtès wrote on 7 Mar 2018 23:35
(name . Andreas Enge)(address . andreas@enge.fr)
87efkvpl8f.fsf@gnu.org
Andreas Enge <andreas@enge.fr> skribis:

Toggle quote (4 lines)
> Right now, I do not seem to be running out of build users, but I think
> that cuirass nevertheless behaves strangely; so I am asking here instead
> of opening a new bug (maybe there is no bug, actually).

Good. :-)

Toggle quote (16 lines)
> There is still one occurrence of this:
> In ice-9/boot-9.scm:
> 829:9 7 (catch _ _ #<procedure 7f49ffb6c9e0 at cuirass/base.sc?> ?)
> In cuirass/base.scm:
> 616:26 6 (_)
> 548:2 5 (build-packages _ #<<sqlite-db> pointer: #<pointer 0x4?> ?)
> 426:10 4 (spawn-builds #<build-daemon 256.97 42d2320> #<<sqlite?> ?)
> In srfi/srfi-1.scm:
> 640:9 3 (for-each #<procedure update! (drv)> _)
> In cuirass/database.scm:
> 69:18 2 (%sqlite-exec _ _ . _)
> In ice-9/eval.scm:
> 293:34 1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
> 619:8 0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
> ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

Bah. :-/

Toggle quote (6 lines)
> There is this, which looks like a problem, but not a fatal one:
> ;;; Failed to autoload make-session in (gnutls):
> ;;; missing interface for module (gnutls)
> ;;; Failed to autoload connection-end/client in (gnutls):
> ;;; missing interface for module (gnutls)

That happens during the evaluation, probably when compiling (guix build
download). Nothing to worry about.

Toggle quote (35 lines)
> 2018-03-05T12:41:34 building batch of 200 jobs (0/14675)
> 2018-03-05T12:41:34 build failed: '/gnu/store/0cfs95nasmfgs5nhylm1fcrfwpwh81pc-gcompris-qt-0.70.drv'
> 2018-03-05T12:41:34 build failed: '/gnu/store/065zrkz0zmfwiwajrb9nanzwvdhmyrrd-python2-stem-1.6.0.drv'
> 2018-03-05T12:41:35 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:35 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:36 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:37 build failed: '/gnu/store/9jjifxnxbv7z54qlcik3j8gcdj6ndd3k-python2-astroid-1.5.3.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/z7f4imd95xw5w86pv59lnqgq2apgg0rc-ghc-hslua-0.9.5.drv'
> 2018-03-05T12:41:39 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:39 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:40 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-03-05T12:41:40 build failed: '/gnu/store/agrdbmvswi81q54riyin4355zcpy9yrr-python2-reno-2.7.0.drv'
> 2018-03-05T12:41:41 build failed: '/gnu/store/9l7f6zfbm3ckziy9wqddwy0jl9r9xj1q-gcc-7.3.0.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/z7f4imd95xw5w86pv59lnqgq2apgg0rc-ghc-hslua-0.9.5.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:43 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-03-05T12:41:43 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:43 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
> 2018-03-05T12:41:44 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:44 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/9l7f6zfbm3ckziy9wqddwy0jl9r9xj1q-gcc-7.3.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'

The fact that all of these failed more or less instantaneously probably
means that these were cached failures. See ‘guix gc --list-failures’.

Why they failed in the first place, I don’t know. Are these x86_64/i686
builds? (You can see that in the .drv file.) If not, no wonder that
they fail.

Thanks,
Ludo’.
D
D
Danny Milosavljevic wrote on 8 Mar 2018 00:01
(name . Ludovic Courtès)(address . ludo@gnu.org)
20180308000111.77cce7d6@scratchpost.org
Hi Ludo,

Toggle quote (8 lines)
> > 69:18 2 (%sqlite-exec _ _ . _)
> > In ice-9/eval.scm:
> > 293:34 1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
> > 619:8 0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
> > ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.
>
> Bah. :-/

sqlite3.h contains quite nice documentation about this case:

/*
** CAPI3REF: Unlock Notification
** METHOD: sqlite3
**
** ^When running in shared-cache mode, a database operation may fail with
** an [SQLITE_LOCKED] error if the required locks on the shared-cache or
** individual tables within the shared-cache cannot be obtained. See
** [SQLite Shared-Cache Mode] for a description of shared-cache locking.
** ^This API may be used to register a callback that SQLite will invoke
** when the connection currently holding the required lock relinquishes it.
** ^This API is only available if the library was compiled with the
** [SQLITE_ENABLE_UNLOCK_NOTIFY] C-preprocessor symbol defined.
**
** See Also: [Using the SQLite Unlock Notification Feature].
**
** ^Shared-cache locks are released when a database connection concludes
** its current transaction, either by committing it or rolling it back.
**
** ^When a connection (known as the blocked connection) fails to obtain a
** shared-cache lock and SQLITE_LOCKED is returned to the caller, the
** identity of the database connection (the blocking connection) that
** has locked the required resource is stored internally. ^After an
** application receives an SQLITE_LOCKED error, it may call the
** sqlite3_unlock_notify() method with the blocked connection handle as
** the first argument to register for a callback that will be invoked
** when the blocking connections current transaction is concluded. ^The
** callback is invoked from within the [sqlite3_step] or [sqlite3_close]
** call that concludes the blocking connections transaction.
**
** ^(If sqlite3_unlock_notify() is called in a multi-threaded application,
** there is a chance that the blocking connection will have already
** concluded its transaction by the time sqlite3_unlock_notify() is invoked.
** If this happens, then the specified callback is invoked immediately,
** from within the call to sqlite3_unlock_notify().)^
**
** ^If the blocked connection is attempting to obtain a write-lock on a
** shared-cache table, and more than one other connection currently holds
** a read-lock on the same table, then SQLite arbitrarily selects one of
** the other connections to use as the blocking connection.
**
** ^(There may be at most one unlock-notify callback registered by a
** blocked connection. If sqlite3_unlock_notify() is called when the
** blocked connection already has a registered unlock-notify callback,
** then the new callback replaces the old.)^ ^If sqlite3_unlock_notify() is
** called with a NULL pointer as its second argument, then any existing
** unlock-notify callback is canceled. ^The blocked connections
** unlock-notify callback may also be canceled by closing the blocked
** connection using [sqlite3_close()].
**
** The unlock-notify callback is not reentrant. If an application invokes
** any sqlite3_xxx API functions from within an unlock-notify callback, a
** crash or deadlock may be the result.
**
** ^Unless deadlock is detected (see below), sqlite3_unlock_notify() always
** returns SQLITE_OK.
**
...
*/

...
** When a call to [sqlite3_step()] returns SQLITE_LOCKED, it is almost
** always appropriate to call sqlite3_unlock_notify(). There is however,
** one exception. When executing a "DROP TABLE" or "DROP INDEX" statement,
** SQLite checks if there are any currently executing SELECT statements
** that belong to the same connection. If there are, SQLITE_LOCKED is
** returned. In this case there is no "blocking connection", so invoking
** sqlite3_unlock_notify() results in the unlock-notify callback being
** invoked immediately. If the application then re-attempts the "DROP TABLE"
** or "DROP INDEX" query, an infinite loop might be the result.
**
** One way around this problem is to check the extended error code returned
** by an sqlite3_step() call. ^(If there is a blocking connection, then the
** extended error code is set to SQLITE_LOCKED_SHAREDCACHE. Otherwise, in
** the special "DROP TABLE/INDEX" case, the extended error code is just
** SQLITE_LOCKED.)^
...
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqgbzcACgkQ5xo1VCww
uqUqvwf+K8RcWUlgI69jHswsWcdxop5DZLO+fmrrke5oucbVwwUK/EHOnIOP1mCF
SiipcVmrQiZEaRqztlo4UVqxgCmzsmnj01pkceQjp3zB/v/inIRdTQtXyUOslC83
DmVfkbHJc0rfpK9kH16/XTqjOgwPZeS3EULuEMaUc0RwwX5lBbQWQAL2UXBoJs7x
5Fa4E8dsdK9nV6g5dgyM/0mMtuk1PFzc75HLErmCaYsuq9+dtB1BpKWHNRIpgWzQ
BFsWbGBxT7VuIq1/7kE+IGrpY9R2LKD9iE06RVr6MpUOGdw/VMd+8JDPnrzl//LW
xf9eN4pSohqeNKS8z4hy79OBeovaZQ==
=k6Xf
-----END PGP SIGNATURE-----


D
D
Danny Milosavljevic wrote on 8 Mar 2018 01:25
(name . Ludovic Courtès)(address . ludo@gnu.org)
20180308012541.517dc4dc@scratchpost.org
As for what's causing this, cuirass's bin/evaluate seems to connect to the sqlite database and then write stuff into it.

If a query is being done by the http server at the same time, you are going to be really careful with the sqlite open mode.

There would be ways to still be able to do this, but it's some seriously unnecessary locking complexity for something small like this.

Is that really necessary? Can't the invoker of bin/evaluate write the result?

It even seems that src/cuirass/base.scm already does the db-add-derivation.

Why is bin/evaluate doing it again?

Same for db-add-evaluation...
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqggwUACgkQ5xo1VCww
uqUbRgf7Bo/weqJyId4Csihojubu9WgWIK3LBFirE8ireF0dFUVetGZNOEZTz9Vx
4K8F+ZfI0u/bMf6Ij8qjB8/qHycjRtn1qugEuHp0LFNZ8GX8E8Yjaie+Sxm2pHa5
0SgF9NSH7A0ydNCP8IPWtOYoqqreoe70J+V62aAFMgoY81xqL8b9rZ7wQfVU4AsT
NXifRIuREO5Rhrvt1MIuiyiXvSSd6SyH734aVF5EAGa1jW+GJI6rmquYI64PiU1n
BI+aSUr1MtlmGZrSb2tOiLcoK0rZ9o5tDN7/4/rz3C2gh+vAPtFMFSrgnpfmiSGe
bZ3TbQpUw5h8/+lxOWgGVhO+R7v8sw==
=m4JA
-----END PGP SIGNATURE-----


D
D
Danny Milosavljevic wrote on 8 Mar 2018 01:29
(name . Ludovic Courtès)(address . ludo@gnu.org)
20180308012923.525342f6@scratchpost.org
Oops, I had an old bin/evaluate . bin/evaluate.in doesn't contain it anymore. My bad...

So please disregard.

Anyway, sqlite contains some cool stuff to find out who's locking it.

But as things are it seems that sqlite access is now done only by one thread. How can it then be locked? Weird...
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqgg+MACgkQ5xo1VCww
uqWkRwf/Xt2wnMO5jh6ePZIx30l7On4i+ZUhT5g79xA75xZpmlESb2Bi6ZgwaEIN
qBIqHZIXUPlQISdgRl09XiGYIAtqH/lK/5GhTLWJn+hOYxSmPxcyEQkWP9leR+YQ
hEO4ZV1dATMjt6NeNCqIR4/DxY5BRanm6hm6LCeqZvU80utVVEKgvqhfnhpMfEk9
D2ZtqrWarYN/B3dDaO7bIsKVpsYW5fwTEsvONUhKgCqB0KtuMvARbUjKvpKvAQuZ
l/4B8e30rH+0Ba4rHAXUiArWG7LsQZ31FiJBGKkutwGPBlhs5tpJaRVdWKOYF7ee
uQ9fRkvUHW5sCN4Y7oMv6IWu/KCBqA==
=anV4
-----END PGP SIGNATURE-----


L
L
Ludovic Courtès wrote on 8 Mar 2018 13:01
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
87efkuojya.fsf@gnu.org
Danny Milosavljevic <dannym@scratchpost.org> skribis:

Toggle quote (2 lines)
> Oops, I had an old bin/evaluate . bin/evaluate.in doesn't contain it anymore. My bad...

Toggle quote (4 lines)
> Anyway, sqlite contains some cool stuff to find out who's locking it.
>
> But as things are it seems that sqlite access is now done only by one thread. How can it then be locked? Weird...

It’s accessed possibly from several threads because Fibers can be
thought of as an N×M threading library.

So what’s your take on this?

Ludo’.
D
D
Danny Milosavljevic wrote on 11 Mar 2018 07:06
(name . Ludovic Courtès)(address . ludo@gnu.org)
20180311070610.788df5b9@scratchpost.org
Hi Ludo,

On Thu, 08 Mar 2018 13:01:01 +0100
ludo@gnu.org (Ludovic Courtès) wrote:

Toggle quote (2 lines)
> So what’s your take on this?

sqlite has a sqlite-unlock-notify. We should wrap it and use it when we get the
"locked" error code.

We should also enable extended error codes https://sqlite.org/c3ref/extended_result_codes.html.

See also https://sqlite.org/unlock_notify.htmlfor a blocking version of _step (to the right).

Do we want cuirass to block until the lock is available?
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqkx1IACgkQ5xo1VCww
uqXb5Af/byq0gTeZ30WYi0G09HwFggbSpQSpZEozM0vHH1xXbcV2D3VnJvnDnm1t
iLaVe3M+HvRVstf/kGj2J9Pl0bqQ1KkLbBcWs5YkcKhCo9Gr0uDiJiO0SvpQfMNa
Y61bIFy06yUgkK/Ft1JrTKUryJFdPDWozYRmIl7gf6SKqmy+a3RWXi/roSurJabm
OmcJKKfSag6STI7L14jNJGzwQF+ddQu7konRdeLLBmNFyl8W3jDy2FnmDNXCnqp/
vm16PUDc5iqAyZ5cKJz++s23f/pf711yEfh8wXG/sa8L9yrCxWZ5FjCxCHk7AE2q
Ft+/5EP7ko/GKtC/VvcZBn4vfVQJtA==
=nUHF
-----END PGP SIGNATURE-----


L
L
Ludovic Courtès wrote on 11 Mar 2018 22:43
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
87zi3ethij.fsf@gnu.org
Hi,

Danny Milosavljevic <dannym@scratchpost.org> skribis:

Toggle quote (12 lines)
> On Thu, 08 Mar 2018 13:01:01 +0100
> ludo@gnu.org (Ludovic Courtès) wrote:
>
>> So what’s your take on this?
>
> sqlite has a sqlite-unlock-notify. We should wrap it and use it when we get the
> "locked" error code.
>
> We should also enable extended error codes https://sqlite.org/c3ref/extended_result_codes.html .
>
> See also https://sqlite.org/unlock_notify.html for a blocking version of _step (to the right).

Interesting.

Toggle quote (2 lines)
> Do we want cuirass to block until the lock is available?

The POSIX thread itself should not block. Rather the fiber should yield
to other fibers.

So we could do the blocking-step within ‘non-blocking’ block as defined
in (cuirass utils), though ‘non-blocking’ as currently implemented is a
bit expensive.

Or we could simply poll: catch those exceptions and try again later. If
this is rare enough, as seems to be the case, that might be an
acceptable approach.

Thoughts?

Ludo’.
D
D
Danny Milosavljevic wrote on 24 Mar 2018 14:15
(name . Ludovic Courtès)(address . ludo@gnu.org)
20180324141547.109dc91a@scratchpost.org
Hi Ludo,

I've checked it some more. The shared cache uses unlock_notify, and the actual
database uses sqlite3_busy_handler (sqlite3_busy_timeout) when BUSY is returned.

The 5 in

ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.
^^

means SQLITE_BUSY, so that's the actual database.

So the easiest way to work around it would be to call

sqlite3_busy_timeout(db, 10); // ms

which would automatically repeatedly retry on busy until 10 ms have accumulated,
and only then give up.

There's also sqlite3_busy_handler which can be used to install a busy handler,
and of course in the future we could also handle SQLITE_BUSY ourselves and yield.

-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlq2T4MACgkQ5xo1VCww
uqWRqQf+LFBG3nX4LDadOXErzJphQ51SV/z4UCEGWedLVhc58mlOnw7FDAcHyplU
L2knIMYOUYoA5c+3muFaLfn43fsrwgUbQEYFBwVEDytyp57nCzr0iYv318A9YCgc
rdIp5WruadE3nieHgBconum9aIw7/f6XNVKP6dULkv3gehdoo8VoVdLoW9+KUyig
Fog72vQzblzkw+l0qqc1I64hYFalk6gmQLDyQAYrgYKzCQfP5TtO7W+QH7tPDKJs
/po89CJG9LfUCQFxkZ6vH/upg4/RQaflzJxLrQK7YPGeAQshdJ9QNluvyvGXxSOe
H8KDJ/jCiszWyNCo01BtRlidJNN2vg==
=xJpK
-----END PGP SIGNATURE-----


L
L
Ludovic Courtès wrote on 25 Mar 2018 14:40
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
87tvt4ib0r.fsf@gnu.org
Hi Danny,

Danny Milosavljevic <dannym@scratchpost.org> skribis:

Toggle quote (17 lines)
> I've checked it some more. The shared cache uses unlock_notify, and the actual
> database uses sqlite3_busy_handler (sqlite3_busy_timeout) when BUSY is returned.
>
> The 5 in
>
> ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.
> ^^
>
> means SQLITE_BUSY, so that's the actual database.
>
> So the easiest way to work around it would be to call
>
> sqlite3_busy_timeout(db, 10); // ms
>
> which would automatically repeatedly retry on busy until 10 ms have accumulated,
> and only then give up.

Indeed, that’s actually what nix/libstore does.

Toggle quote (5 lines)
> There's also sqlite3_busy_handler which can be used to install a busy handler,
> and of course in the future we could also handle SQLITE_BUSY ourselves and yield.
>
> See also https://notabug.org/civodul/guile-sqlite3/pulls/4 for the former.

Awesome.

I’ve merged these guile-sqlite3 bits and added a “PRAGMA busy_timeout”
in Cuirass (the pragma allows us to set the timeout even when using an
older guile-sqlite3.) I’ve updated the ‘cuirass’ package and will test
soonish on berlin.

Thank you!

Ludo’.
M
M
Mathieu Othacehe wrote on 25 Mar 2021 14:17
(address . 30644-done@debbugs.gnu.org)
87sg4j4bkj.fsf@gnu.org
Hello,

This should no longer be an issue with the new remote building mechanism
in Cuirass.

Thanks,

Mathieu
Closed
?