Cuirass runs out of build users

Done

Details

5 participants

Andreas Enge
Danny Milosavljevic
Efraim Flashner
Ludovic Courtès
Mathieu Othacehe

Owner: unassigned

Submitted by: Andreas Enge

Severity: normal

Andreas Enge wrote on 28 Feb 2018 10:07

Recipients:(address . bug-guix@gnu.org)

Message-ID:20180228090714.GA1845@jurong

Hello,

in trying out the new bayfront machine, I raised the number of concurrent
build jobs to 10; the number of build accounts is programmatically set to
twice this number.

Looking at the build directories in /tmp, 9 build jobs were started on
February 26 at around 21:50, a 10th one at 22:24.

Later, I have been getting messages such as these:
2018-02-27T18:31:13 building 14491 derivations in batches of 200
2018-02-27T18:31:13 building batch of 200 jobs (0/14491)
2018-02-27T18:31:13 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:13 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/w56wpx4jar88hfabipvf6gd6hz81p6wk-python2-gevent-1.2.2.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:14 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/08pjxws1pd04syqf28d3hhsfrb946bpc-dzen-0.9.5-1.488ab66.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/01r4xm1nqixcxjbbn6xakpg3m5wkr12v-po4a-0.47.drv'
2018-02-27T18:31:15 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
2018-02-27T18:31:15 build started: '/gnu/store/02chd3k5qc8p8fc46j4w955cfld77cw1-spinsim-0.75-1.66915a7ad.drv'
2018-02-27T18:31:15 build started: '/gnu/store/81b2340psschvvk7ybhnz6nx9jkpdlwa-Net-SMTP-SSL-1.04.tar.gz.drv'
2018-02-27T18:31:15 build started: '/gnu/store/rkkacywdj7yya3wf51i141x5khzld4zl-clx-0.0.0-1.tar.xz.drv'
2018-02-27T18:31:16 build started: '/gnu/store/286gkv0z6g9fb4frcx6yp2fg9v6iwk3b-ecl-alexandria-0.0.0-1.926a066.drv'
2018-02-27T18:31:16 build started: '/gnu/store/hmy3ds71gbv1mspc9p5yvvhz5pibs8wl-elixir-1.5.3.tar.xz.drv'
2018-02-27T18:31:16 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
2018-02-27T18:31:16 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)

So it looks as if build jobs get stalled, but somehow do not free their
build users.

Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
10 is the default value, but it should have been overwritten.

And in /etc/passwd, I see 20 users. But apparently, they are not used!

Andreas

Ludovic Courtès wrote on 1 Mar 2018 15:01

Recipients:(name . Andreas Enge)(address . andreas@enge.fr)(address . 30644@debbugs.gnu.org)

Message-ID:87bmg76gjp.fsf@gnu.org

Hello,

Andreas Enge <andreas@enge.fr> skribis:

Toggle quote (4 lines)

> in trying out the new bayfront machine, I raised the number of concurrent

> build jobs to 10; the number of build accounts is programmatically set to

> twice this number.

Indeed, (sysadmin services) in guix-maintenance.git does that.

Toggle quote (2 lines)

> 2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)

[...]

Toggle quote (5 lines)

> Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.

> 10 is the default value, but it should have been overwritten.

> And in /etc/passwd, I see 20 users. But apparently, they are not used!

There’s something fishy here. What does “getent group guixbuild”

return?

Thanks,

Ludo’.

Andreas Enge wrote on 1 Mar 2018 21:35

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 30644@debbugs.gnu.org)

Message-ID:20180301203543.GA1549@jurong

On Thu, Mar 01, 2018 at 03:01:30PM +0100, Ludovic Courtès wrote:

Toggle quote (8 lines)

> > Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.

> > 10 is the default value, but it should have been overwritten.

> >

> > And in /etc/passwd, I see 20 users. But apparently, they are not used!

> There’s something fishy here. What does “getent group guixbuild”

> return?

All 20 build users:

# getent group guixbuild

guixbuild:x:30000:guixbuilder01,guixbuilder02,guixbuilder03,guixbuilder04,guixbuilder05,guixbuilder06,guixbuilder07,guixbuilder08,guixbuilder09,guixbuilder10,guixbuilder11,guixbuilder13,guixbuilder14,guixbuilder15,guixbuilder16,guixbuilder17,guixbuilder18,guixbuilder20

I will deploy the new cuirass with the other bug fixed, and report back if

the problem described here is still present.

Andreas

Efraim Flashner wrote on 2 Mar 2018 15:08

Recipients:(name . Andreas Enge)(address . andreas@enge.fr)(address . 30644@debbugs.gnu.org)

Message-ID:20180302140833.GB3213@macbook41

On Wed, Feb 28, 2018 at 10:07:14AM +0100, Andreas Enge wrote:

Toggle quote (44 lines)> Hello,
> 
> in trying out the new bayfront machine, I raised the number of concurrent
> build jobs to 10; the number of build accounts is programmatically set to
> twice this number.
> 
> Looking at the build directories in /tmp, 9 build jobs were started on
> February 26 at around 21:50, a 10th one at 22:24.
> 
> Later, I have been getting messages such as these:
> 2018-02-27T18:31:13 building 14491 derivations in batches of 200
> 2018-02-27T18:31:13 building batch of 200 jobs (0/14491)
> 2018-02-27T18:31:13 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:13 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/w56wpx4jar88hfabipvf6gd6hz81p6wk-python2-gevent-1.2.2.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:14 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/08pjxws1pd04syqf28d3hhsfrb946bpc-dzen-0.9.5-1.488ab66.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/01r4xm1nqixcxjbbn6xakpg3m5wkr12v-po4a-0.47.drv'
> 2018-02-27T18:31:15 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/02chd3k5qc8p8fc46j4w955cfld77cw1-spinsim-0.75-1.66915a7ad.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/81b2340psschvvk7ybhnz6nx9jkpdlwa-Net-SMTP-SSL-1.04.tar.gz.drv'
> 2018-02-27T18:31:15 build started: '/gnu/store/rkkacywdj7yya3wf51i141x5khzld4zl-clx-0.0.0-1.tar.xz.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/286gkv0z6g9fb4frcx6yp2fg9v6iwk3b-ecl-alexandria-0.0.0-1.926a066.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/hmy3ds71gbv1mspc9p5yvvhz5pibs8wl-elixir-1.5.3.tar.xz.drv'
> 2018-02-27T18:31:16 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-02-27T18:31:16 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
> 2018-02-27T18:31:16 batch of builds (partially) failed:all build users are currently in use; consider creating additional users and adding them to the `guixbuild' group (status: 1)
> 
> So it looks as if build jobs get stalled, but somehow do not free their
> build users.
> 
> Hm, the directories in /tmp are created by guixbuilder01 to guixbuilder10.
> 10 is the default value, but it should have been overwritten.
> 
> And in /etc/passwd, I see 20 users. But apparently, they are not used!
> 
> Andreas
> 

When I use a shell script to build all the packages, if a package hangs
and gets eaten by the timout-limit then it holds the build user and
space in /tmp until all 10 builders are taken and building stops or all
the builds finish and then it releases them.

-- 
Efraim Flashner   <efraim@flashner.co.il>   ????? ?????
GPG key = A28B F40C 3E55 1372 662D  14F7 41AA E7DC CA3D 8351
Confidentiality cannot be guaranteed on emails sent or received unencrypted

Andreas Enge wrote on 5 Mar 2018 10:02

Recipients:(name . Efraim Flashner)(address . efraim@flashner.co.il)(address . 30644@debbugs.gnu.org)

Message-ID:20180305090256.GA9171@jurong

On Fri, Mar 02, 2018 at 04:08:33PM +0200, Efraim Flashner wrote:

Toggle quote (5 lines)

> When I use a shell script to build all the packages, if a package hangs

> and gets eaten by the timout-limit then it holds the build user and

> space in /tmp until all 10 builders are taken and building stops or all

> the builds finish and then it releases them.

Right now, I observe something else: Only one package is built at a time,

taking space in /tmp. But it is not the same build user; I saw one package

with guixbuilder06, and now it is building one package with guixbuilder01.

Is there a way to check which builders are considered taken? A lock file

somewhere?

Andreas

Andreas Enge wrote on 5 Mar 2018 16:53

Recipients:(name . Efraim Flashner)(address . efraim@flashner.co.il)(address . 30644@debbugs.gnu.org)

Message-ID:20180305155351.GA6054@jurong

Right now, I do not seem to be running out of build users, but I think
that cuirass nevertheless behaves strangely; so I am asking here instead
of opening a new bug (maybe there is no bug, actually).

I reconfigured bayfront this morning with the latest git checkout.
Right now, there are only two packages built in /tmp, if at all:
$ ll /tmp/
total 8
drwx------ 4 guixbuilder02 guixbuild 4096 Mar  5 16:06 guix-build-guile-emacs-20150512.41120e0.drv-0/
drwx------ 5 guixbuilder01 guixbuild 4096 Mar  5 15:05 guix-build-r-mgcv-1.8-23.drv-0/
(It is now 16:54, but "top" shows emacs and r activity, so my guess
is these packages are being built.)
But in any case, we are far from the 10 jobs I configured.

I am attaching the result of
$ grep -v evaluate cuirass.log|grep -v ^\`/gnu/store|grep -v spec|grep -v "next evaluation"|grep -v heap > ~/purged.log
which contains the non-trivial lines of /var/log/cuirass.log.

There is still one occurrence of this:
In ice-9/boot-9.scm:
    829:9  7 (catch _ _ #<procedure 7f49ffb6c9e0 at cuirass/base.sc?> ?)
In cuirass/base.scm:
   616:26  6 (_)
    548:2  5 (build-packages _ #<<sqlite-db> pointer: #<pointer 0x4?> ?)
   426:10  4 (spawn-builds #<build-daemon 256.97 42d2320> #<<sqlite?> ?)
In srfi/srfi-1.scm:
    640:9  3 (for-each #<procedure update! (drv)> _)
In cuirass/database.scm:
    69:18  2 (%sqlite-exec _ _ . _)
In ice-9/eval.scm:
   293:34  1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
    619:8  0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

There is this, which looks like a problem, but not a fatal one:
;;; Failed to autoload make-session in (gnutls):
;;; missing interface for module (gnutls)
;;; Failed to autoload connection-end/client in (gnutls):
;;; missing interface for module (gnutls)

And what strikes me most is that between two lines of the kind
   building batch of 200 jobs
there are not 200 lines of successful or failed packages.

And my impression is that too many packages fail. What are these
"status 100" and "status 101"?

Do things look like this on berlin? (Of course, one difference is that
right now the machine builds everything itself without offloading.)

Andreas

Attachment: purged.log

Andreas Enge wrote on 5 Mar 2018 20:55

Recipients:(name . Efraim Flashner)(address . efraim@flashner.co.il)(address . 30644@debbugs.gnu.org)

Message-ID:20180305195520.GA2349@jurong

Okay, this evening, I see 10 packages being built and lots of messages

"build started"/"build succeeded"; so maybe there is no bug at work here.

Andreas

Ludovic Courtès wrote on 7 Mar 2018 23:35

Recipients:(name . Andreas Enge)(address . andreas@enge.fr)

Message-ID:87efkvpl8f.fsf@gnu.org

Andreas Enge <andreas@enge.fr> skribis:

Toggle quote (4 lines)

> Right now, I do not seem to be running out of build users, but I think

> that cuirass nevertheless behaves strangely; so I am asking here instead

> of opening a new bug (maybe there is no bug, actually).

Good. :-)

Toggle quote (16 lines)> There is still one occurrence of this:
> In ice-9/boot-9.scm:
>     829:9  7 (catch _ _ #<procedure 7f49ffb6c9e0 at cuirass/base.sc?> ?)
> In cuirass/base.scm:
>    616:26  6 (_)
>     548:2  5 (build-packages _ #<<sqlite-db> pointer: #<pointer 0x4?> ?)
>    426:10  4 (spawn-builds #<build-daemon 256.97 42d2320> #<<sqlite?> ?)
> In srfi/srfi-1.scm:
>     640:9  3 (for-each #<procedure update! (drv)> _)
> In cuirass/database.scm:
>     69:18  2 (%sqlite-exec _ _ . _)
> In ice-9/eval.scm:
>    293:34  1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))
>     619:8  0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))
> ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

Bah. :-/

Toggle quote (6 lines)

> There is this, which looks like a problem, but not a fatal one:

> ;;; Failed to autoload make-session in (gnutls):

> ;;; missing interface for module (gnutls)

> ;;; Failed to autoload connection-end/client in (gnutls):

> ;;; missing interface for module (gnutls)

That happens during the evaluation, probably when compiling (guix build

download). Nothing to worry about.

Toggle quote (35 lines)> 2018-03-05T12:41:34 building batch of 200 jobs (0/14675)
> 2018-03-05T12:41:34 build failed: '/gnu/store/0cfs95nasmfgs5nhylm1fcrfwpwh81pc-gcompris-qt-0.70.drv'
> 2018-03-05T12:41:34 build failed: '/gnu/store/065zrkz0zmfwiwajrb9nanzwvdhmyrrd-python2-stem-1.6.0.drv'
> 2018-03-05T12:41:35 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:35 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:36 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:37 build failed: '/gnu/store/9jjifxnxbv7z54qlcik3j8gcdj6ndd3k-python2-astroid-1.5.3.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/i2zmgbsi599pxskqh15s38fpms359jp7-gcc-cross-sans-libc-arm-linux-gnueabihf-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/qd59gqza3ly4rk2xjf40q2539rjgzh5r-gcc-cross-sans-libc-powerpc-linux-gnu-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:38 build failed: '/gnu/store/z7f4imd95xw5w86pv59lnqgq2apgg0rc-ghc-hslua-0.9.5.drv'
> 2018-03-05T12:41:39 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:39 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:40 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-03-05T12:41:40 build failed: '/gnu/store/agrdbmvswi81q54riyin4355zcpy9yrr-python2-reno-2.7.0.drv'
> 2018-03-05T12:41:41 build failed: '/gnu/store/9l7f6zfbm3ckziy9wqddwy0jl9r9xj1q-gcc-7.3.0.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/3qy3s1dy3nnb108ifz9yiscs687g0xkh-gfortran-5.5.0.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/z7f4imd95xw5w86pv59lnqgq2apgg0rc-ghc-hslua-0.9.5.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:42 build failed: '/gnu/store/b7jfj61snk90pkg7dlhyd7jjqbpi2rni-htslib-1.6.drv'
> 2018-03-05T12:41:43 build failed: '/gnu/store/d545d4jzw0mfa61d9ml3vygchr93xp51-python-gevent-1.2.2.drv'
> 2018-03-05T12:41:43 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:43 build started: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'
> 2018-03-05T12:41:44 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:44 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/014v4q1xykd2gl04ildgg3nf30dp51pw-gcc-5.5.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/9l7f6zfbm3ckziy9wqddwy0jl9r9xj1q-gcc-7.3.0.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/1zdxbjmcw1qnmzh2qcbafsrbr8i2j1x3-glib-2.54.2.drv'
> 2018-03-05T12:41:45 build failed: '/gnu/store/y0p2myl0al8x3rlbxphgyjk83zl3kmb5-idris-mode-0.9.19.tar.drv'

The fact that all of these failed more or less instantaneously probably

means that these were cached failures. See ‘guix gc --list-failures’.

Why they failed in the first place, I don’t know. Are these x86_64/i686

builds? (You can see that in the .drv file.) If not, no wonder that

they fail.

Thanks,

Ludo’.

Danny Milosavljevic wrote on 8 Mar 2018 00:01

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:20180308000111.77cce7d6@scratchpost.org

Hi Ludo,

Toggle quote (8 lines)

> > 69:18 2 (%sqlite-exec _ _ . _)

> > In ice-9/eval.scm:

> > 293:34 1 (_ #(#(#<directory (sqlite3) 1b91820> #<<sqlite-stm?> ?)))

> > 619:8 0 (_ #(#(#(#(#<directory (sqlite3) 1b91820> #<?>) #) 5) 5))

> > ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

> Bah. :-/

sqlite3.h contains quite nice documentation about this case:

/*
** CAPI3REF: Unlock Notification
** METHOD: sqlite3
**
** ^When running in shared-cache mode, a database operation may fail with
** an [SQLITE_LOCKED] error if the required locks on the shared-cache or
** individual tables within the shared-cache cannot be obtained. See
** [SQLite Shared-Cache Mode] for a description of shared-cache locking. 
** ^This API may be used to register a callback that SQLite will invoke 
** when the connection currently holding the required lock relinquishes it.
** ^This API is only available if the library was compiled with the
** [SQLITE_ENABLE_UNLOCK_NOTIFY] C-preprocessor symbol defined.
**
** See Also: [Using the SQLite Unlock Notification Feature].
**
** ^Shared-cache locks are released when a database connection concludes
** its current transaction, either by committing it or rolling it back. 
**
** ^When a connection (known as the blocked connection) fails to obtain a
** shared-cache lock and SQLITE_LOCKED is returned to the caller, the
** identity of the database connection (the blocking connection) that
** has locked the required resource is stored internally. ^After an 
** application receives an SQLITE_LOCKED error, it may call the
** sqlite3_unlock_notify() method with the blocked connection handle as 
** the first argument to register for a callback that will be invoked
** when the blocking connections current transaction is concluded. ^The
** callback is invoked from within the [sqlite3_step] or [sqlite3_close]
** call that concludes the blocking connections transaction.
**
** ^(If sqlite3_unlock_notify() is called in a multi-threaded application,
** there is a chance that the blocking connection will have already
** concluded its transaction by the time sqlite3_unlock_notify() is invoked.
** If this happens, then the specified callback is invoked immediately,
** from within the call to sqlite3_unlock_notify().)^
**
** ^If the blocked connection is attempting to obtain a write-lock on a
** shared-cache table, and more than one other connection currently holds
** a read-lock on the same table, then SQLite arbitrarily selects one of 
** the other connections to use as the blocking connection.
**
** ^(There may be at most one unlock-notify callback registered by a 
** blocked connection. If sqlite3_unlock_notify() is called when the
** blocked connection already has a registered unlock-notify callback,
** then the new callback replaces the old.)^ ^If sqlite3_unlock_notify() is
** called with a NULL pointer as its second argument, then any existing
** unlock-notify callback is canceled. ^The blocked connections 
** unlock-notify callback may also be canceled by closing the blocked
** connection using [sqlite3_close()].
**
** The unlock-notify callback is not reentrant. If an application invokes
** any sqlite3_xxx API functions from within an unlock-notify callback, a
** crash or deadlock may be the result.
**
** ^Unless deadlock is detected (see below), sqlite3_unlock_notify() always
** returns SQLITE_OK.
**
...
*/

...
** When a call to [sqlite3_step()] returns SQLITE_LOCKED, it is almost 
** always appropriate to call sqlite3_unlock_notify(). There is however,
** one exception. When executing a "DROP TABLE" or "DROP INDEX" statement,
** SQLite checks if there are any currently executing SELECT statements
** that belong to the same connection. If there are, SQLITE_LOCKED is
** returned. In this case there is no "blocking connection", so invoking
** sqlite3_unlock_notify() results in the unlock-notify callback being
** invoked immediately. If the application then re-attempts the "DROP TABLE"
** or "DROP INDEX" query, an infinite loop might be the result.
**
** One way around this problem is to check the extended error code returned
** by an sqlite3_step() call. ^(If there is a blocking connection, then the
** extended error code is set to SQLITE_LOCKED_SHAREDCACHE. Otherwise, in
** the special "DROP TABLE/INDEX" case, the extended error code is just 
** SQLITE_LOCKED.)^
...

-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqgbzcACgkQ5xo1VCww

uqUqvwf+K8RcWUlgI69jHswsWcdxop5DZLO+fmrrke5oucbVwwUK/EHOnIOP1mCF

SiipcVmrQiZEaRqztlo4UVqxgCmzsmnj01pkceQjp3zB/v/inIRdTQtXyUOslC83

DmVfkbHJc0rfpK9kH16/XTqjOgwPZeS3EULuEMaUc0RwwX5lBbQWQAL2UXBoJs7x

5Fa4E8dsdK9nV6g5dgyM/0mMtuk1PFzc75HLErmCaYsuq9+dtB1BpKWHNRIpgWzQ

BFsWbGBxT7VuIq1/7kE+IGrpY9R2LKD9iE06RVr6MpUOGdw/VMd+8JDPnrzl//LW

xf9eN4pSohqeNKS8z4hy79OBeovaZQ==

=k6Xf

-----END PGP SIGNATURE-----

Danny Milosavljevic wrote on 8 Mar 2018 01:25

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:20180308012541.517dc4dc@scratchpost.org

As for what's causing this, cuirass's bin/evaluate seems to connect to the sqlite database and then write stuff into it.

If a query is being done by the http server at the same time, you are going to be really careful with the sqlite open mode.

There would be ways to still be able to do this, but it's some seriously unnecessary locking complexity for something small like this.

Is that really necessary? Can't the invoker of bin/evaluate write the result?

It even seems that src/cuirass/base.scm already does the db-add-derivation.

Why is bin/evaluate doing it again?

Same for db-add-evaluation...

-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqggwUACgkQ5xo1VCww

uqUbRgf7Bo/weqJyId4Csihojubu9WgWIK3LBFirE8ireF0dFUVetGZNOEZTz9Vx

4K8F+ZfI0u/bMf6Ij8qjB8/qHycjRtn1qugEuHp0LFNZ8GX8E8Yjaie+Sxm2pHa5

0SgF9NSH7A0ydNCP8IPWtOYoqqreoe70J+V62aAFMgoY81xqL8b9rZ7wQfVU4AsT

NXifRIuREO5Rhrvt1MIuiyiXvSSd6SyH734aVF5EAGa1jW+GJI6rmquYI64PiU1n

BI+aSUr1MtlmGZrSb2tOiLcoK0rZ9o5tDN7/4/rz3C2gh+vAPtFMFSrgnpfmiSGe

bZ3TbQpUw5h8/+lxOWgGVhO+R7v8sw==

=m4JA

-----END PGP SIGNATURE-----

Danny Milosavljevic wrote on 8 Mar 2018 01:29

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:20180308012923.525342f6@scratchpost.org

Oops, I had an old bin/evaluate . bin/evaluate.in doesn't contain it anymore. My bad...

So please disregard.

Anyway, sqlite contains some cool stuff to find out who's locking it.

But as things are it seems that sqlite access is now done only by one thread. How can it then be locked? Weird...

-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqgg+MACgkQ5xo1VCww

uqWkRwf/Xt2wnMO5jh6ePZIx30l7On4i+ZUhT5g79xA75xZpmlESb2Bi6ZgwaEIN

qBIqHZIXUPlQISdgRl09XiGYIAtqH/lK/5GhTLWJn+hOYxSmPxcyEQkWP9leR+YQ

hEO4ZV1dATMjt6NeNCqIR4/DxY5BRanm6hm6LCeqZvU80utVVEKgvqhfnhpMfEk9

D2ZtqrWarYN/B3dDaO7bIsKVpsYW5fwTEsvONUhKgCqB0KtuMvARbUjKvpKvAQuZ

l/4B8e30rH+0Ba4rHAXUiArWG7LsQZ31FiJBGKkutwGPBlhs5tpJaRVdWKOYF7ee

uQ9fRkvUHW5sCN4Y7oMv6IWu/KCBqA==

=anV4

-----END PGP SIGNATURE-----

Ludovic Courtès wrote on 8 Mar 2018 13:01

Recipients:(name . Danny Milosavljevic)(address . dannym@scratchpost.org)

Message-ID:87efkuojya.fsf@gnu.org

Danny Milosavljevic <dannym@scratchpost.org> skribis:

Toggle quote (2 lines)

> Oops, I had an old bin/evaluate . bin/evaluate.in doesn't contain it anymore. My bad...

Yes, see https://bugs.gnu.org/30618.

Toggle quote (4 lines)

> Anyway, sqlite contains some cool stuff to find out who's locking it.

> But as things are it seems that sqlite access is now done only by one thread. How can it then be locked? Weird...

It’s accessed possibly from several threads because Fibers can be

thought of as an N×M threading library.

So what’s your take on this?

Ludo’.

Danny Milosavljevic wrote on 11 Mar 2018 07:06

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:20180311070610.788df5b9@scratchpost.org

Hi Ludo,

On Thu, 08 Mar 2018 13:01:01 +0100

ludo@gnu.org (Ludovic Courtès) wrote:

Toggle quote (2 lines)

> So what’s your take on this?

sqlite has a sqlite-unlock-notify. We should wrap it and use it when we get the

"locked" error code.

We should also enable extended error codes https://sqlite.org/c3ref/extended_result_codes.html.

See also https://sqlite.org/unlock_notify.htmlfor a blocking version of _step (to the right).

Do we want cuirass to block until the lock is available?

-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlqkx1IACgkQ5xo1VCww

uqXb5Af/byq0gTeZ30WYi0G09HwFggbSpQSpZEozM0vHH1xXbcV2D3VnJvnDnm1t

iLaVe3M+HvRVstf/kGj2J9Pl0bqQ1KkLbBcWs5YkcKhCo9Gr0uDiJiO0SvpQfMNa

Y61bIFy06yUgkK/Ft1JrTKUryJFdPDWozYRmIl7gf6SKqmy+a3RWXi/roSurJabm

OmcJKKfSag6STI7L14jNJGzwQF+ddQu7konRdeLLBmNFyl8W3jDy2FnmDNXCnqp/

vm16PUDc5iqAyZ5cKJz++s23f/pf711yEfh8wXG/sa8L9yrCxWZ5FjCxCHk7AE2q

Ft+/5EP7ko/GKtC/VvcZBn4vfVQJtA==

=nUHF

-----END PGP SIGNATURE-----

Ludovic Courtès wrote on 11 Mar 2018 22:43

Recipients:(name . Danny Milosavljevic)(address . dannym@scratchpost.org)

Message-ID:87zi3ethij.fsf@gnu.org

Hi,

Danny Milosavljevic <dannym@scratchpost.org> skribis:

Toggle quote (12 lines)> On Thu, 08 Mar 2018 13:01:01 +0100
> ludo@gnu.org (Ludovic Courtès) wrote:
>
>> So what’s your take on this?
>
> sqlite has a sqlite-unlock-notify.  We should wrap it and use it when we get the
> "locked" error code.
>
> We should also enable extended error codes https://sqlite.org/c3ref/extended_result_codes.html .
>
> See also https://sqlite.org/unlock_notify.html for a blocking version of _step (to the right).

Interesting.

Toggle quote (2 lines)

> Do we want cuirass to block until the lock is available?

The POSIX thread itself should not block. Rather the fiber should yield

to other fibers.

So we could do the blocking-step within ‘non-blocking’ block as defined

in (cuirass utils), though ‘non-blocking’ as currently implemented is a

bit expensive.

Or we could simply poll: catch those exceptions and try again later. If

this is rare enough, as seems to be the case, that might be an

acceptable approach.

Thoughts?

Ludo’.

Danny Milosavljevic wrote on 24 Mar 2018 14:15

Recipients:(name . Ludovic Courtès)(address . ludo@gnu.org)

Message-ID:20180324141547.109dc91a@scratchpost.org

Hi Ludo,

I've checked it some more. The shared cache uses unlock_notify, and the actual

database uses sqlite3_busy_handler (sqlite3_busy_timeout) when BUSY is returned.

The 5 in

ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.

means SQLITE_BUSY, so that's the actual database.

So the easiest way to work around it would be to call

sqlite3_busy_timeout(db, 10); // ms

which would automatically repeatedly retry on busy until 10 ms have accumulated,

and only then give up.

There's also sqlite3_busy_handler which can be used to install a busy handler,

and of course in the future we could also handle SQLITE_BUSY ourselves and yield.

See also https://notabug.org/civodul/guile-sqlite3/pulls/4for the former.

-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlq2T4MACgkQ5xo1VCww

uqWRqQf+LFBG3nX4LDadOXErzJphQ51SV/z4UCEGWedLVhc58mlOnw7FDAcHyplU

L2knIMYOUYoA5c+3muFaLfn43fsrwgUbQEYFBwVEDytyp57nCzr0iYv318A9YCgc

rdIp5WruadE3nieHgBconum9aIw7/f6XNVKP6dULkv3gehdoo8VoVdLoW9+KUyig

Fog72vQzblzkw+l0qqc1I64hYFalk6gmQLDyQAYrgYKzCQfP5TtO7W+QH7tPDKJs

/po89CJG9LfUCQFxkZ6vH/upg4/RQaflzJxLrQK7YPGeAQshdJ9QNluvyvGXxSOe

H8KDJ/jCiszWyNCo01BtRlidJNN2vg==

=xJpK

-----END PGP SIGNATURE-----

Ludovic Courtès wrote on 25 Mar 2018 14:40

Recipients:(name . Danny Milosavljevic)(address . dannym@scratchpost.org)

Message-ID:87tvt4ib0r.fsf@gnu.org

Hi Danny,

Danny Milosavljevic <dannym@scratchpost.org> skribis:

Toggle quote (17 lines)> I've checked it some more.  The shared cache uses unlock_notify, and the actual
> database uses sqlite3_busy_handler (sqlite3_busy_timeout) when BUSY is returned.
>
> The 5 in
>
> ice-9/eval.scm:619:8: Throw to key `sqlite-error' with args `(#f 5 "database is locked")'.  
>                                                                  ^^
>
> means SQLITE_BUSY, so that's the actual database.
>
> So the easiest way to work around it would be to call
>
>   sqlite3_busy_timeout(db, 10); // ms
>
> which would automatically repeatedly retry on busy until 10 ms have accumulated,
> and only then give up.

Indeed, that’s actually what nix/libstore does.

Toggle quote (5 lines)

> There's also sqlite3_busy_handler which can be used to install a busy handler,

> and of course in the future we could also handle SQLITE_BUSY ourselves and yield.

> See also https://notabug.org/civodul/guile-sqlite3/pulls/4 for the former.

Awesome.

I’ve merged these guile-sqlite3 bits and added a “PRAGMA busy_timeout”

in Cuirass (the pragma allows us to set the timeout even when using an

older guile-sqlite3.) I’ve updated the ‘cuirass’ package and will test

soonish on berlin.

Thank you!

Ludo’.

Mathieu Othacehe wrote on 25 Mar 2021 14:17

Recipients:(address . 30644-done@debbugs.gnu.org)

Message-ID:87sg4j4bkj.fsf@gnu.org

Hello,

This should no longer be an issue with the new remote building mechanism

in Cuirass.

Thanks,

Mathieu

Closed

Your comment

This issue is archived.

To comment on this conversation send an email to 30644@debbugs.gnu.org

is:open	open issues
is:done	closed issues
submitter:<who>	search issue submitter
author:<who>	search by message author
date:yesterday..now	search by issue date
mdate:3m..2d	search by message date