Hi Ludovic, Ludovic Courtès writes: > Wooohoo! Is it that time of the year when people give presents to one > another? I can’t believe it. :-) Not to be too cynical, but I think it’s just the time of year that I get frustrated with what I should be working on, and start fantasizing about green-field projects. :p > Timothy Sample skribis: > >> The header and footer are read directly from the file. Finding the >> compressor is harder. I followed the approach taken by the pristine-tar >> project. That is, try a bunch of compressors and hope for a match. >> Currently, I have: >> >> • gnu-best >> • gnu-best-rsync >> • gnu >> • gnu-rsync >> • gnu-fast >> • gnu-fast-rsync >> • zlib-best >> • zlib >> • zlib-fast >> • zlib-best-perl >> • zlib-perl >> • zlib-fast-perl >> • gnu-best-rsync-1.4 >> • gnu-rsync-1.4 >> • gnu-fast-rsync-1.4 > > I would have used the integers that zlib supports, but I guess that > doesn’t capture this whole gamut of compression setups. And yeah, it’s > not great that we actually have to try and find the right compression > levels, but there’s no way around it it seems, and as you write, we can > expect a couple of variants to be the most commonly used ones. My first instinct was “this is impossible – a DEFLATE compressor can do just about whatever it wants!” Then I looked at pristine-tar and realized that their hack probably works pretty well. If I had infinite time, I would think about some kind of fully general, parameterized LZ77 algorithm that could describe any implementation. If I had a lot of time I would peel back the curtain on Gzip and zlib and expose their tuning parameters. That would be nicer, but keep in mind we will have to cover XZ, bzip2, and ZIP, too! There’s a bit of balance between quality and coverage. Any improvement to the representation of the compression algorithm could be implemented easily: just replace the names with their improved representation. One thing pristine-tar does is reorder the compressor list based on the input metadata. A Gzip member usually stores its compression level, so it makes sense to try everything at that level first before moving one. >> Originally, I used your code, but I ran into some problems. Namely, >> real tarballs are not well-behaved. I wrote new code to keep track of >> subtle things like the formatting of the octal values. > > Yeah I guess I was too optimistic. :-) I wanted to have the > serialization/deserialization code automatically generated by that > macro, but yeah, it doesn’t capture enough details for real-world > tarballs. I enjoyed your implementation! I might even bring back its style. It was a little stiff for trying to figure out exactly what I needed for reproducing the tarballs. > Do you know how frequently you get “weird” tarballs? I was thinking > about having something that works for plain GNU tar, but it’s even > better to have something that works with “unusual” tarballs! I don’t have hard numbers, but I would say that a good handful (5–10%) have “X-format” fields, meaning their octal formatting is unusual. (I’m looking at “grep -A 10 default-header” over all the S-Exp files.) The most charming thing is the “uname” and “gname” fields. For example, “rtmidi-4.0.0” was made by “gary” from “staff”. :) > (BTW the code I posted or the one in Disarchive could perhaps replace > the one in Gash-Utils. I was frustrated to not see a ‘fold-archive’ > procedure there, notably.) I really like “fold-archive”. One of the reasons I started doing this is to possibly share code with Gash-Utils. It’s not as easy as I was hoping, but I’m planning on improving things there based on my experience here. I’ve now worked with four Scheme tar implementations, maybe if I write a really good one I could cap that number at five! >> To avoid hitting the SWH archive at all, I introduced a directory cache >> so that I can store the directories locally. If the directory cache is >> available, directories are stored and retrieved from it. > > I guess we can get back to them eventually to estimate our coverage ratio. It would be nice to know, but pretty hard to find out with the rate limit. I guess it will improve immensely when we set up a “sources.json” file. >> You mean like ? :) > > Woow. :-) > > We could actually have a CI job to create the database: it would > basically do ‘disarchive save’ for each tarball and store that using a > layout like the one you used. Then we could have a job somewhere that > periodically fetches that and adds it to the database. WDYT? Maybe.... I assume that Disarchive would fail for a few of them. We would need a plan for monitoring those failures so that Disarchive can be improved. Also, unless I’m misunderstanding something, this means building the whole database at every commit, no? That would take a lot of time and space. On the other hand, it would be easy enough to try. If it works, it’s a lot easier than setting up a whole other service. > I think we should leave room for other hash algorithms (in the sexps > above too). It works for different hash algorithms, but not for different directory hashing methods (like you mention below). >> This was generated by a little script built on top of “fold-packages”. >> It downloads Gzip’d tarballs used by Guix packages and passes them on to >> Disarchive for disassembly. I limited the number to 100 because it’s >> slow and because I’m sure there is a long tail of weird software >> archives that are going to be hard to process. The metadata directory >> ended up being 13M and the directory cache 2G. > > Neat. > > So it does mean that we could pretty much right away add a fall-back in > (guix download) that looks up tarballs in your database and uses > Disarchive to recontruct it, right? I love solved problems. :-) > > Of course we could improve Disarchive and the database, but it seems to > me that we already have enough to improve the situation. WDYT? I would say that we are darn close! In theory it would work. It would be much more practical if we had better coverage in the SWH archive (i.e., “sources.json”) and a way to get metadata for a source archive without downloading the entire Disarchive database. It’s 13M now, but it will likely be 500M with all the Gzip’d tarballs from a recent commit of Guix. It will only grow after that, too. Of course those are not hard blockers, so ‘(guix download)’ could start using Disarchive as soon as we package it. I’ve starting looking into it, but I’m confused about getting access to Disarchive from the “out-of-band” download system. Would it have to become a dependency of Guix? >> Even with the code I have so far, I have a lot of questions. Mainly I’m >> worried about keeping everything working into the future. It would be >> easy to make incompatible changes. A lot of care would have to be >> taken. Of course, keeping a Guix commit and a Disarchive commit might >> be enough to make any assembling reproducible, but there’s a >> chicken-and-egg problem there. > > The way I see it, Guix would always look up tarballs in the HEAD of the > database (no need to pick a specific commit). Worst that could happen > is we reconstruct a tarball that doesn’t match, and so the daemon errors > out. I was imagining an escape hatch beyond this, where one could look up a provenance record from when Disarchive ingested and verified a source code archive. The provenance record would tell you which version of Guix was used when saving the archive, so you could try your luck with using “guix time-machine” to reproduce Disarchive’s original computation. If we perform database migrations, you would need to travel back in time in the database, too. The idea is that you could work around breakages in Disarchive automatically using the Power of Guix™. Just a stray thought, really. > Regarding future-proofness, I think we must be super careful about the > file formats (the sexps). You did pay attention to not having implicit > defaults, which is perfect. Perhaps one thing to change (or perhaps > it’s already there) is support for other hashes in those sexps: both > hash algorithms and directory hash methods (SWH dir/Git tree, nar, Git > tree with different hash algorithm, IPFS CID, etc.). Also the ability > to specify several hashes. > > That way we could “refresh” the database anytime by adding the hash du > jour for already-present tarballs. The hash algorithm is already configurable, but the directory hash method is not. You’re right that it should be, and that there should be support for multiple digests. >> What if a tarball from the closure of one the derivations is missing? >> I guess you could work around it, but it would be tricky. > > Well, more generally, we’ll have to monitor archive coverage. But I > don’t think the issue is specific to this method. Again, I’m thinking about the case where I want to travel back in time to reproduce a Disarchive computation. It’s really an unlikely scenario, I’m just trying to think of everything that could go wrong. >>> Anyhow, we should team up with fellow NixOS and SWH hackers to address >>> this, and with developers of other distros as well—this problem is not >>> just that of the functional deployment geeks, is it? >> >> I could remove most of the Guix stuff so that it would be easy to >> package in Guix, Nix, Debian, etc. Then, someone™ could write a service >> that consumes a “sources.json” file, adds the sources to a Disarchive >> database, and pushes everything to a Git repo. I guess everyone who >> cares has to produce a “sources.json” file anyway, so it will be very >> little extra work. Other stuff like changing the serialization format >> to JSON would be pretty easy, too. I’m not well connected to these >> other projects, mind you, so I’m not really sure how to reach out. > > If you feel like it, you’re welcome to point them to your work in the > discussion at . There’s one > person from NixOS (lewo) participating in the discussion and I’m sure > they’d be interested. Perhaps they’ll tell whether they care about > having it available as JSON. Good idea. I will work out a few more kinks and then bring it up there. I’ve already rewritten the parts that used the Guix daemon. Disarchive now only needs a handful Guix modules ('base32', 'serialization', and 'swh' are the ones that would be hard to remove). >> Sorry about the big mess of code and ideas – I realize I may have taken >> the “do-ocracy” approach a little far here. :) Even if this is not >> “the” solution, hopefully it’s useful for discussion! > > You did great! I had a very rough sketch and you did the real thing, > that’s just awesome. :-) > > Thanks a lot! My pleasure! Thanks for the feedback so far. -- Tim