“SQLite database is busy”: contention on the store database

OpenSubmitted by Ludovic Courtès.
Details
2 participants
  • Ludovic Courtès
  • Mathieu Othacehe
Owner
unassigned
Severity
important
L
L
Ludovic Courtès wrote on 22 Jun 13:49 +0200
“SQLite database is busy” : contention on the store database
(address . bug-guix@gnu.org)
877dvz7228.fsf@gnu.org
Hi,
On berlin.guix.gnu.org, we’ve seen a lot of contention on the storedatabase for a month or so, with messages like:
warning: SQLite database is busy
More often than not, everything slows down to a halt, and builds don’tproceed.
We initially discussed it here:
https://issues.guix.gnu.org/41119#4
When that happens we typically see 30+ processes that have opened thedatabase:
Toggle snippet (37 lines)ludo@berlin ~$ sudo fuser -v /var/guix/db/db.sqlite-shm USER PID ACCESS COMMAND/var/guix/db/db.sqlite-shm: root 1211 F...m guix offload root 1845 F...m guix-daemon root 6070 F...m guix-daemon root 31378 F...m guix-daemon root 31380 F...m guix-daemon root 47381 F...m guix-daemon root 47382 F...m guix-daemon root 47383 F...m guix-daemon root 47385 F...m guix-daemon root 47681 F...m guix-daemon root 48506 F...m guix offload root 48542 F...m guix offload root 49417 F...m guix-daemon root 49472 F...m guix-daemon root 49971 F...m guix-daemon root 50027 F...m guix-daemon root 50847 F...m guix offload root 54625 F...m guix-daemon root 54723 F...m guix-daemon root 55470 F...m guix-daemon root 55614 F...m guix-daemon root 55617 F...m guix-daemon root 56225 F...m guix-daemon root 57732 F...m guix-daemon root 57733 F...m guix-daemon root 57738 F...m guix-daemon root 57739 F...m guix-daemon root 57740 F...m guix-daemon root 58772 F...m guix-daemon root 59056 F...m guix-daemon root 99055 F...m guix-daemon root 127690 F...m guix offload
In this case, the first process seems to be in the deduplication phasein ‘finalize-store-file’:
Toggle snippet (16 lines)ludo@berlin ~$ sudo strace -p 1211strace: Process 1211 attachedutimensat(AT_FDCWD, "/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter", [{tv_sec=1592814617, tv_nsec=396083395} /* 2020-06-22T10:30:17.396083395+0200 */, {tv_sec=1, tv_nsec=0} /* 1970-01-01T01:00:01+0100 */], 0) = 0chmod("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter", 040555) = 0getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1306, tv_usec=334062}, ru_stime={tv_sec=143, tv_usec=819675}, ...}) = 0clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1450, tv_nsec=153748295}) = 0 lstat("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter/xt_connlimit.ko", {st_mode=S_IFLNK|0777, st_size=125, ...}) = 0readlink("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter/xt_connlimit.ko", "/gnu/store/x8d1hmv17jawp4a7nwwb4"..., 100) = 100readlink("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter/xt_connlimit.ko", "/gnu/store/x8d1hmv17jawp4a7nwwb4"..., 200) = 125mkdir("/gnu", 0777) = -1 EEXIST (File exists)mkdir("/gnu/store", 0777) = -1 EEXIST (File exists)mkdir("/gnu/store/.links", 0777) = -1 EEXIST (File exists)lstat("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter/xt_connlimit.ko", {st_mode=S_IFLNK|0777, st_size=125, ...}) = 0stat("/gnu/store/.links/05p6zpj56f00g07c3621747kz3bjc6zbr8bgg8xysdajq18qgvlv", {st_mode=S_IFREG|0444, st_size=6880, ...}) = 0
At that point it has the database opened 3 times: once from‘with-temporary-store-file’, once ‘with-database’ in‘finalize-store-file’, and once from ‘register-items’ in‘finalize-store-file’. Commit d7fb5538013710288e91657499f0e04207115776removed the last one.
Toggle snippet (27 lines)ludo@berlin ~$ sudo ls -l /proc/1211/fdtotal 0lr-x------ 1 root root 64 Jun 22 11:49 0 -> 'pipe:[10239344]'l-wx------ 1 root root 64 Jun 22 11:49 1 -> 'pipe:[10239343]'lr-x------ 1 root root 64 Jun 22 11:49 10 -> 'pipe:[10129708]'l-wx------ 1 root root 64 Jun 22 11:49 11 -> 'pipe:[10129708]'lr-x------ 1 root root 64 Jun 22 11:49 12 -> 'pipe:[10214031]'l-wx------ 1 root root 64 Jun 22 11:49 13 -> 'pipe:[10214031]'l-wx------ 1 root root 64 Jun 22 11:49 14 -> /var/guix/offload/141.80.167.168:22/1lrwx------ 1 root root 64 Jun 22 11:49 15 -> 'socket:[8056368]'lrwx------ 1 root root 64 Jun 22 11:49 16 -> 'socket:[10433466]'lrwx------ 1 root root 64 Jun 22 11:49 17 -> /var/guix/db/db.sqlitelrwx------ 1 root root 64 Jun 22 11:49 18 -> /var/guix/db/db.sqlite-wallrwx------ 1 root root 64 Jun 22 11:49 19 -> /var/guix/db/db.sqlite-shml-wx------ 1 root root 64 Jun 22 11:49 2 -> 'pipe:[10239343]'lrwx------ 1 root root 64 Jun 22 11:49 20 -> /var/guix/db/db.sqlitelrwx------ 1 root root 64 Jun 22 11:49 21 -> /var/guix/db/db.sqlite-wallr-x------ 1 root root 64 Jun 22 11:49 3 -> 'pipe:[10021329]'l-wx------ 1 root root 64 Jun 22 11:49 4 -> 'pipe:[10239345]'lr-x------ 1 root root 64 Jun 22 11:49 40 -> /dev/urandoml-wx------ 1 root root 64 Jun 22 11:49 5 -> 'pipe:[10021329]'lr-x------ 1 root root 64 Jun 22 11:49 6 -> 'pipe:[10222125]'l-wx------ 1 root root 64 Jun 22 11:49 7 -> 'pipe:[10222125]'lr-x------ 1 root root 64 Jun 22 11:49 8 -> 'pipe:[10021332]'l-wx------ 1 root root 64 Jun 22 11:49 9 -> 'pipe:[10021332]'
The are several possible causes of contention: (1) database opened toomany times, (2) database kept opened for too long, and (3) sqlitemishandling contention somehow.
I think we can work on #1 and #2. As for #3, I wonder if there’s achange in sqlite that could have made contention more contentious (ahha!), given that all this happened around the time of the last‘core-updates’ merge, roughly corresponding to an upgrade from 3.24.0 to3.31.0:
http://data.guix.gnu.org/repository/1/branch/master/package/sqlite
Maybe the most likely change that caused contention is that we increasedthe number of machines (or rather the sum of build jobs that can behandled by the build machines), and thus the number of parallel buildprocesses.
Ludo’.
M
M
Mathieu Othacehe wrote on 22 Jun 18:01 +0200
Re: bug#42001: “SQLite database is busy”: contention on the store database
(name . Ludovic Courtès)(address . ludo@gnu.org)(address . 42001@debbugs.gnu.org)
87pn9rdr8r.fsf@gnu.org
Hey Ludo,
Toggle quote (6 lines)> utimensat(AT_FDCWD, "/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter", [{tv_sec=1592814617, tv_nsec=396083395} /* 2020-06-22T10:30:17.396083395+0200 */, {tv_sec=1, tv_nsec=0} /* 1970-01-01T01:00:01+0100 */], 0) = 0> chmod("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter", 040555) = 0> getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1306, tv_usec=334062}, ru_stime={tv_sec=143, tv_usec=819675}, ...}) = 0> clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1450, tv_nsec=153748295}) = 0 > lstat("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter/xt_connlimit.ko", {st_mode=S_IFLNK|0777, st_size=125, ...}) = 0
As discussed on IRC, those "image-root" derivations do exercise storededuplication a lot, are could participate to the observed databasecontention.
Turns out I used those derivations as "staging" directories that arepassed to GNU Xorriso when producing ISO images and "mke2fs" and"mkdosfs" when producing raw disk-images. Using the store was alsoconvenient to ensure root ownership of the image files.
With 1cb9effc3e875ec7bae31bb06069d16ac89f7e1d and7f75a7ec08975eb6d6e01db61bd6b91f447f655e, I remove those "image-root"derivations, and proceed to image creation in a single derivation.
The later commit is a bit more tricky because "mke2fs" is not able tooverride file permissions. I used "fakeroot" to make sure that the rawdisk-image files are owned by "root" and not "guixbuilder".
Hoping it will help,
Mathieu
L
L
Ludovic Courtès wrote on 23 Jun 09:42 +0200
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 42001@debbugs.gnu.org)
87y2oe44ad.fsf@gnu.org
Hi,
Mathieu Othacehe <othacehe@gnu.org> skribis:
Toggle quote (19 lines)>> utimensat(AT_FDCWD, "/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter", [{tv_sec=1592814617, tv_nsec=396083395} /* 2020-06-22T10:30:17.396083395+0200 */, {tv_sec=1, tv_nsec=0} /* 1970-01-01T01:00:01+0100 */], 0) = 0>> chmod("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter", 040555) = 0>> getrusage(RUSAGE_SELF, {ru_utime={tv_sec=1306, tv_usec=334062}, ru_stime={tv_sec=143, tv_usec=819675}, ...}) = 0>> clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1450, tv_nsec=153748295}) = 0 >> lstat("/gnu/store/whhc80cx1zpi9cj3bil3sp8wwzcf8p9p-image-root/gnu/store/n6l40ygclxfysnc2nxwrslnn5ab6wv3v-profile/lib/modules/5.4.47-gnu/kernel/net/netfilter/xt_connlimit.ko", {st_mode=S_IFLNK|0777, st_size=125, ...}) = 0>> As discussed on IRC, those "image-root" derivations do exercise store> deduplication a lot, are could participate to the observed database> contention.>> Turns out I used those derivations as "staging" directories that are> passed to GNU Xorriso when producing ISO images and "mke2fs" and> "mkdosfs" when producing raw disk-images. Using the store was also> convenient to ensure root ownership of the image files.>> With 1cb9effc3e875ec7bae31bb06069d16ac89f7e1d and> 7f75a7ec08975eb6d6e01db61bd6b91f447f655e, I remove those "image-root"> derivations, and proceed to image creation in a single derivation.
Nice, thanks for the prompt change!
I’ve been monitoring berlin a bit and I’ve become quite convinced thatthese big directories are what triggered contention. The good thing isthat this has allowed us to find weaknesses in (guix store database) and(guix store deduplicate).
Toggle quote (4 lines)> The later commit is a bit more tricky because "mke2fs" is not able to> override file permissions. I used "fakeroot" to make sure that the raw> disk-image files are owned by "root" and not "guixbuilder".
OK, makes sense.
Maybe at some point we’ll have to use libext2fs directly or patch mke2fsto get the desired flexibility.
Thank you!
Ludo’.
L
L
Ludovic Courtès wrote on 23 Jun 11:31 +0200
control message for bug #42001
(address . control@debbugs.gnu.org)
87zh8u2knm.fsf@gnu.org
severity 42001 importantquit
L
L
Ludovic Courtès wrote on 23 Jun 11:33 +0200
Re: bug#42001: “SQLite database is busy”: contention on the store database
(address . 42001@debbugs.gnu.org)
87sgem2kki.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:
Toggle quote (4 lines)> The are several possible causes of contention: (1) database opened too> many times, (2) database kept opened for too long, and (3) sqlite> mishandling contention somehow.
Commit fe5de925aa0f2854a679cebdea02b03cac561c8f should makededuplication more efficient (it was an embarrassing bug…), partlyaddressing #2.
Ludo’.
?