Hi Ludovic, Ludovic Courtès writes: > Hi, > > Ludovic Courtès skribis: > > [...] > > So for the medium term, and perhaps for the future, a possible option > would be to preserve tarball metadata so we can reconstruct them: > > tarball = metadata + tree > > After all, tarballs are byproducts and should be no exception: we should > build them from source. :-) > > In , Stefano mentioned > pristine-tar, which does almost that, but not quite: it stores a binary > delta between a tarball and a tree: > > https://manpages.debian.org/unstable/pristine-tar/pristine-tar.1.en.html > > I think we should have something more transparent than a binary delta. > > The code below can “disassemble” and “assemble” a tar. When it > disassembles it, it generates metadata like this: > > (tar-source > (version 0) > (headers > (("guile-3.0.4/" > (mode 493) > (size 0) > (mtime 1593007723) > (chksum 3979) > (typeflag #\5)) > ("guile-3.0.4/m4/" > (mode 493) > (size 0) > (mtime 1593007720) > (chksum 4184) > (typeflag #\5)) > ("guile-3.0.4/m4/pipe2.m4" > (mode 420) > (size 531) > (mtime 1536050419) > (chksum 4812) > (hash (sha256 > "arx6n2rmtf66yjlwkgwp743glcpdsfzgjiqrqhfegutmcwvwvsza"))) > ("guile-3.0.4/m4/time_h.m4" > (mode 420) > (size 5471) > (mtime 1536050419) > (chksum 4974) > (hash (sha256 > "z4py26rmvsk4st7db6vwziwwhkrjjrwj7nra4al6ipqh2ms45kka"))) > […] > > The ’assemble-archive’ procedure consumes that, looks up file contents > by hash on SWH, and reconstructs the original tarball… > > … at least in theory, because in practice we hit the SWH rate limit > after looking up a few files: > > https://archive.softwareheritage.org/api/#rate-limiting > > So it’s a bit ridiculous, but we may have to store a SWH “dir” > identifier for the whole extracted tree—a Git-tree hash—since that would > allow us to retrieve the whole thing in a single HTTP request. > > Besides, we’ll also have to handle compression: storing gzip/xz headers > and compression levels. This jumped out at me because I have been working with compression and tarballs for the bootstrapping effort. I started pulling some threads and doing some research, and ended up prototyping an end-to-end solution for decomposing a Gzip’d tarball into Gzip metadata, tarball metadata, and an SWH directory ID. It can even put them back together! :) There are a bunch of problems still, but I think this project is doable in the short-term. I’ve tested 100 arbitrary Gzip’d tarballs from Guix, and found and fixed a bunch of little gaffes. There’s a ton of work to do, of course, but here’s another small step. I call the thing “Disarchive” as in “disassemble a source code archive”. You can find it at . It has a simple command-line interface so you can do $ disarchive save software-1.0.tar.gz which serializes a disassembled version of “software-1.0.tar.gz” to the database (which is just a directory) specified by the “DISARCHIVE_DB” environment variable. Next, you can run $ disarchive load hash-of-something-in-the-db which will recover an original file from its metadata (stored in the database) and data retrieved from the SWH archive or taken from a cache (again, just a directory) specified by “DISARCHIVE_DIRCACHE”. Now some implementation details. The way I’ve set it up is that all of the assembly happens through Guix. Each step in recreating a compressed tarball is a fixed-output derivation: the download from SWH, the creation of the tarball, and the compression. I wanted an easy way to build and verify things according to a dependency graph without writing any code. Hi Guix Daemon! I’m not sure if this is a good long-term approach, though. It could work well for reproducibility, but it might be easier to let some external service drive my code as a Guix package. Either way, it was an easy way to get started. For disassembly, it takes a Gzip file (containing a single member) and breaks it down like this: (gzip-member (version 0) (name "hungrycat-0.4.1.tar.gz") (input (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1")) (header (mtime 0) (extra-flags 2) (os 3)) (footer (crc 3863610951) (isize 194560)) (compressor gnu-best) (digest (sha256 "03fc1zsrf99lvxa7b4ps6pbi43304wbxh1f6ci4q0vkal370yfwh"))) The header and footer are read directly from the file. Finding the compressor is harder. I followed the approach taken by the pristine-tar project. That is, try a bunch of compressors and hope for a match. Currently, I have: • gnu-best • gnu-best-rsync • gnu • gnu-rsync • gnu-fast • gnu-fast-rsync • zlib-best • zlib • zlib-fast • zlib-best-perl • zlib-perl • zlib-fast-perl • gnu-best-rsync-1.4 • gnu-rsync-1.4 • gnu-fast-rsync-1.4 This list is inspired by pristine-tar. The first couple GNU compressors use modern Gzip from Guix. The zlib and rsync-1.4 ones use the Gzip and zlib wrapper from pristine-tar called “zgz”. The 100 Gzip files I looked at use “gnu”, “gnu-best”, “gnu-best-rsync-1.4”, “zlib”, “zlib-best”, and “zlib-fast-perl”. (As an aside, I had a way to decompose multi-member Gzip files, but it was much, much slower. Since I doubt they exist in the wild, I removed that code.) The “input” field likely points to a tarball, which looks like this: (tarball (version 0) (name "hungrycat-0.4.1.tar") (input (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r")) (default-header) (headers ((name "hungrycat-0.4.1/") (mode 493) (mtime 1513360022) (chksum 5058) (typeflag 53)) ((name "hungrycat-0.4.1/configure") (mode 493) (size 130263) (mtime 1513360022) (chksum 6043)) ...) (padding 3584) (digest (sha256 "1ifzck1b97kjm567qb0prnqag2d01x0v8lghx98w1h2gzwsmxgi1"))) Originally, I used your code, but I ran into some problems. Namely, real tarballs are not well-behaved. I wrote new code to keep track of subtle things like the formatting of the octal values. Even though they are not well-behaved, they are usually self-consistent, so I introduced the “default-header” field to set default values for all headers. Any omitted fields in the headers use the value from the default header, and the default header takes defaults from a “default default header” defined in the code. Here’s a default header from a different tarball: (default-header (uid 1199) (gid 30) (magic "ustar ") (version " \x00") (uname "cagordon") (gname "lhea") (devmajor-format (width 0)) (devminor-format (width 0))) These default values are computed to minimize the noise in the serialized form. Here we see for example that each header should have UID 1199 unless otherwise specified. We also see that the device fields should be null strings instead of octal zeros. Another good example here is that the magic field has a space after “ustar”, which is not what modern POSIX says to do. My tarball reader has minimal support for extended headers, but they are not serialized cleanly (they survive the round-trip, but they are not human-readable). Finally, the “input” field here points to an “swh-directory” object. It looks like this: (swh-directory (version 0) (name "hungrycat-0.4.1") (id "0496abd5a2e9e05c9fe20ae7684f48130ef6124a") (digest (sha256 "02qg3z5cvq6dkdc0mxz4sami1ys668lddggf7bjhszk23xpfjm5r"))) I have a little module for computing the directory hash like SWH does (which is in-turn like what Git does). I did not verify that the 100 packages where in the SWH archive. I did verify a couple of packages, but I hit the rate limit and decided to avoid it for now. To avoid hitting the SWH archive at all, I introduced a directory cache so that I can store the directories locally. If the directory cache is available, directories are stored and retrieved from it. > How would we put that in practice? Good question. :-) > > I think we’d have to maintain a database that maps tarball hashes to > metadata (!). A simple version of it could be a Git repo where, say, > ‘sha256/0mq9fc0ig0if5x9zjrs78zz8gfzczbvykj2iwqqd6salcqdgdwhk’ would > contain the metadata above. The nice thing is that the Git repo itself > could be archived by SWH. :-) You mean like ? :) This was generated by a little script built on top of “fold-packages”. It downloads Gzip’d tarballs used by Guix packages and passes them on to Disarchive for disassembly. I limited the number to 100 because it’s slow and because I’m sure there is a long tail of weird software archives that are going to be hard to process. The metadata directory ended up being 13M and the directory cache 2G. > Thus, if a tarball vanishes, we’d look it up in the database and > reconstruct it from its metadata plus content store in SWH. > > Thoughts? Obviously I like the idea. ;) Even with the code I have so far, I have a lot of questions. Mainly I’m worried about keeping everything working into the future. It would be easy to make incompatible changes. A lot of care would have to be taken. Of course, keeping a Guix commit and a Disarchive commit might be enough to make any assembling reproducible, but there’s a chicken-and-egg problem there. What if a tarball from the closure of one the derivations is missing? I guess you could work around it, but it would be tricky. > Anyhow, we should team up with fellow NixOS and SWH hackers to address > this, and with developers of other distros as well—this problem is not > just that of the functional deployment geeks, is it? I could remove most of the Guix stuff so that it would be easy to package in Guix, Nix, Debian, etc. Then, someone™ could write a service that consumes a “sources.json” file, adds the sources to a Disarchive database, and pushes everything to a Git repo. I guess everyone who cares has to produce a “sources.json” file anyway, so it will be very little extra work. Other stuff like changing the serialization format to JSON would be pretty easy, too. I’m not well connected to these other projects, mind you, so I’m not really sure how to reach out. Sorry about the big mess of code and ideas – I realize I may have taken the “do-ocracy” approach a little far here. :) Even if this is not “the” solution, hopefully it’s useful for discussion! -- Tim