Store references in SBCL-compiled code are "invisible"

OpenSubmitted by Ludovic Courtès.
Details
4 participants
  • Danny Milosavljevic
  • Ludovic Courtès
  • Pierre Neidhardt
  • Mark H Weaver
Owner
unassigned
Severity
important
L
L
Ludovic Courtès wrote on 23 Dec 2018 15:19
(name . Bug Guix)(address . bug-guix@gnu.org)
87r2e8jpfx.fsf@gnu.org
Hello,
As discussed with Pierre at the R-B Summit, ‘sbcl-next’ lacks areference to ‘next-gtk-webkit’ even though is invokes it:
Toggle snippet (13 lines)$ guix gc --references $(type -P next) | grep next-/gnu/store/9d66xb8wvggsp0x9pxj61mzqy007978f-sbcl-next-1.1.0/gnu/store/pqy064fw3vkfld6lw95vi0zavj19zvrc-sbcl-next-1.1.0-lib$ ./pre-inst-env guix run next
WARNING: Setting locale failed. Check the following variables for correct values: LANG=en_US.utf8Unhandled SIMPLE-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING {10005885B3}>: Couldn't execute "/gnu/store/7p6pbcmdgr53dff6033gcfl2jq0d762h-next-gtk-webkit-1.1.0/bin/next-gtk-webkit": No such file or directory
(Here ‘guix run’ runs ‘next’ in a container with exactly the closure of‘next’, nothing more, and the ‘next’ binary is grafted.)
So the problem looks a lot like that this GCC issue we fixed a whileback: https://bugs.gnu.org/24703.
Looking at the ‘sbcl-next’ package, the reference to ‘next-gtk-webkit’is inserted in gtk-webkit.lisp:
Toggle snippet (4 lines)(defvar *gtk-webkit-command* "next-gtk-webkit" "Path to the GTK-Webkit platform port executable.")
Through hexl-mode on the ‘next’ binary, we can find that reference:
Toggle snippet (27 lines)01d0bac0: 2f00 0000 6700 0000 6e00 0000 7500 0000 /...g...n...u...01d0bad0: 2f00 0000 7300 0000 7400 0000 6f00 0000 /...s...t...o...01d0bae0: 7200 0000 6500 0000 2f00 0000 3700 0000 r...e.../...7...01d0baf0: 7000 0000 3600 0000 7000 0000 6200 0000 p...6...p...b...01d0bb00: 6300 0000 6d00 0000 6400 0000 6700 0000 c...m...d...g...01d0bb10: 7200 0000 3500 0000 3300 0000 6400 0000 r...5...3...d...01d0bb20: 6600 0000 6600 0000 3600 0000 3000 0000 f...f...6...0...01d0bb30: 3300 0000 3300 0000 6700 0000 6300 0000 3...3...g...c...01d0bb40: 6600 0000 6c00 0000 3200 0000 6a00 0000 f...l...2...j...01d0bb50: 7100 0000 3000 0000 6400 0000 3700 0000 q...0...d...7...01d0bb60: 3600 0000 3200 0000 6800 0000 2d00 0000 6...2...h...-...01d0bb70: 6e00 0000 6500 0000 7800 0000 7400 0000 n...e...x...t...01d0bb80: 2d00 0000 6700 0000 7400 0000 6b00 0000 -...g...t...k...01d0bb90: 2d00 0000 7700 0000 6500 0000 6200 0000 -...w...e...b...01d0bba0: 6b00 0000 6900 0000 7400 0000 2d00 0000 k...i...t...-...01d0bbb0: 3100 0000 2e00 0000 3100 0000 2e00 0000 1.......1.......01d0bbc0: 3000 0000 2f00 0000 6200 0000 6900 0000 0.../...b...i...01d0bbd0: 6e00 0000 2f00 0000 6e00 0000 6500 0000 n.../...n...e...01d0bbe0: 7800 0000 7400 0000 2d00 0000 6700 0000 x...t...-...g...01d0bbf0: 7400 0000 6b00 0000 2d00 0000 7700 0000 t...k...-...w...01d0bc00: 6500 0000 6200 0000 6b00 0000 6900 0000 e...b...k...i...01d0bc10: 7400 0000 0000 0000 0000 0000 0000 0000 t...............01d0bc20: e100 0100 0000 0000 2800 0000 0000 0000 ........(.......01d0bc30: 2a47 544b 2d57 4542 4b49 542d 434f 4d4d *GTK-WEBKIT-COMM01d0bc40: 414e 442a 0000 0000 0000 0000 0000 0000 AND*............
Apparently this string literal is stored as UTF-32 (UCS-4) or similar,which prevents the reference scanner and the grafting code from findingit, and problems ensue. :-)
Pierre, Andy: is there any way to tell SBCL to store this literal asASCII/UTF-8? That would be an easy fix, though we should discuss thepros and cons and whether to enable that globally.
Thanks in advance!
Ludo’.
P
P
Pierre Neidhardt wrote on 23 Dec 2018 16:05
(name . Ludovic Courtès)(address . ludo@gnu.org)
87a7kwjnai.fsf@ambrevar.xyz
Thanks for looking into this, Ludo.
At first glance, I'd say that this is not a compilation option but the waystrings are encoded by default. It seems that multibyte encoding is used allover the place by a few compilers including SBCL (and CCL I think).
One way I know around this (I'm by no mean a Common Lisp expert) is theflexi-streams package for re-encoding.
More generally, shouldn't we make the reference scanner a bit smarter? Inparticular, how does it handle non-ASCII references? Maybe it would not beunreasonable to handle UTF-8 and UCS-4 for instance?
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwfpFUACgkQm9z0l6S7zH/rdggAnQg4EGjzFVDFFNwofXuTBydyu+uLV6FHjagMA1ijxaEPtL27NhmdWUdMoFgRKIKabQpixyDSWJhAPGPIv2JHdrqiBwNRUfaDWSKhoh8/qA654QF8NiSFCHafJpXMcdYVGvQ92Fo9OUFls8CWeWSpaEgQcTeIeeTNLwDCid8ob5gFW8doaqxraGw6dkLeqIOuenB7jI/7cBs4yD4e+r8V/IAY/mVvuTZ+gRFGu+StbMo01KRX2X1xOVaLL4WiaGhJHyDYqw3otCPfZduOvsOyfhSrov3HpPT6vocQ11Wb8tW7t+JipO+w1DLiFMG7JYdAN3ibsDXXncsKcSxgeszj1g===oKMR-----END PGP SIGNATURE-----
M
M
Mark H Weaver wrote on 23 Dec 2018 17:45
Re: bug#33848: Store references in SBCL-compiled code are "invisible"
(name . Ludovic Courtès)(address . ludo@gnu.org)
877eg0i43j.fsf@netris.org
Hi Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (61 lines)> As discussed with Pierre at the R-B Summit, ‘sbcl-next’ lacks a> reference to ‘next-gtk-webkit’ even though is invokes it:>> $ guix gc --references $(type -P next) | grep next-> /gnu/store/9d66xb8wvggsp0x9pxj61mzqy007978f-sbcl-next-1.1.0> /gnu/store/pqy064fw3vkfld6lw95vi0zavj19zvrc-sbcl-next-1.1.0-lib> $ ./pre-inst-env guix run next>> WARNING: Setting locale failed.> Check the following variables for correct values:> LANG=en_US.utf8> Unhandled SIMPLE-ERROR in thread #<SB-THREAD:THREAD "main thread" RUNNING> {10005885B3}>:> Couldn't execute "/gnu/store/7p6pbcmdgr53dff6033gcfl2jq0d762h-next-gtk-webkit-1.1.0/bin/next-gtk-webkit": No such file or directory>>> (Here ‘guix run’ runs ‘next’ in a container with exactly the closure of> ‘next’, nothing more, and the ‘next’ binary is grafted.)>> So the problem looks a lot like that this GCC issue we fixed a while> back: <https://bugs.gnu.org/24703>.>> Looking at the ‘sbcl-next’ package, the reference to ‘next-gtk-webkit’> is inserted in gtk-webkit.lisp:>> (defvar *gtk-webkit-command* "next-gtk-webkit"> "Path to the GTK-Webkit platform port executable.")>>> Through hexl-mode on the ‘next’ binary, we can find that reference:>> 01d0bac0: 2f00 0000 6700 0000 6e00 0000 7500 0000 /...g...n...u...> 01d0bad0: 2f00 0000 7300 0000 7400 0000 6f00 0000 /...s...t...o...> 01d0bae0: 7200 0000 6500 0000 2f00 0000 3700 0000 r...e.../...7...> 01d0baf0: 7000 0000 3600 0000 7000 0000 6200 0000 p...6...p...b...> 01d0bb00: 6300 0000 6d00 0000 6400 0000 6700 0000 c...m...d...g...> 01d0bb10: 7200 0000 3500 0000 3300 0000 6400 0000 r...5...3...d...> 01d0bb20: 6600 0000 6600 0000 3600 0000 3000 0000 f...f...6...0...> 01d0bb30: 3300 0000 3300 0000 6700 0000 6300 0000 3...3...g...c...> 01d0bb40: 6600 0000 6c00 0000 3200 0000 6a00 0000 f...l...2...j...> 01d0bb50: 7100 0000 3000 0000 6400 0000 3700 0000 q...0...d...7...> 01d0bb60: 3600 0000 3200 0000 6800 0000 2d00 0000 6...2...h...-...> 01d0bb70: 6e00 0000 6500 0000 7800 0000 7400 0000 n...e...x...t...> 01d0bb80: 2d00 0000 6700 0000 7400 0000 6b00 0000 -...g...t...k...> 01d0bb90: 2d00 0000 7700 0000 6500 0000 6200 0000 -...w...e...b...> 01d0bba0: 6b00 0000 6900 0000 7400 0000 2d00 0000 k...i...t...-...> 01d0bbb0: 3100 0000 2e00 0000 3100 0000 2e00 0000 1.......1.......> 01d0bbc0: 3000 0000 2f00 0000 6200 0000 6900 0000 0.../...b...i...> 01d0bbd0: 6e00 0000 2f00 0000 6e00 0000 6500 0000 n.../...n...e...> 01d0bbe0: 7800 0000 7400 0000 2d00 0000 6700 0000 x...t...-...g...> 01d0bbf0: 7400 0000 6b00 0000 2d00 0000 7700 0000 t...k...-...w...> 01d0bc00: 6500 0000 6200 0000 6b00 0000 6900 0000 e...b...k...i...> 01d0bc10: 7400 0000 0000 0000 0000 0000 0000 0000 t...............> 01d0bc20: e100 0100 0000 0000 2800 0000 0000 0000 ........(.......> 01d0bc30: 2a47 544b 2d57 4542 4b49 542d 434f 4d4d *GTK-WEBKIT-COMM> 01d0bc40: 414e 442a 0000 0000 0000 0000 0000 0000 AND*............>> Apparently this string literal is stored as UTF-32 (UCS-4) or similar,> which prevents the reference scanner and the grafting code from finding> it, and problems ensue. :-)
IMO, we should consider modifying Guix to search for store referencesencoded in UTF-32 and/or UTF-16. I wouldn't be surprised if some otherprograms use those encodings. I'd be willing to work on it.
What do you think?
Mark
L
L
Ludovic Courtès wrote on 23 Dec 2018 18:32
(name . Mark H Weaver)(address . mhw@netris.org)
87d0psi1xo.fsf@gnu.org
Hi Mark,
Mark H Weaver <mhw@netris.org> skribis:
Toggle quote (2 lines)> Ludovic Courtès <ludo@gnu.org> writes:
[...]
Toggle quote (8 lines)>> Apparently this string literal is stored as UTF-32 (UCS-4) or similar,>> which prevents the reference scanner and the grafting code from finding>> it, and problems ensue. :-)>> IMO, we should consider modifying Guix to search for store references> encoded in UTF-32 and/or UTF-16. I wouldn't be surprised if some other> programs use those encodings. I'd be willing to work on it.
I don’t think we’ve encountered the problem before. This would requirefixing both the scanner and the grafting code (though eventually thatmight be a single code base when the Scheme-implemented daemon ismerged) in non-trivial ways.
One issue is that users of an old daemon would get a different behaviorthan users of a new daemon. It would be the first time we introducesuch a significant change in the daemon since Guix was started.
For now I lean towards looking for a way to address the issuespecifically for SBCL. I’d be tempted to generalize if and only if wefind other occurrences of the problem that would make the benefitsoutweigh the development and maintenance costs.
WDYT?
I remember discussing in the past some sort of “pluggable” referencescanning mechanism that could also work for compressed archives, etc.That also looks like the right thing, but it has a development andmaintenance cost that’s pretty high whereas we might be able to addressthe same problems in much simpler ways.
Thanks,Ludo’.
P
P
Pierre Neidhardt wrote on 23 Dec 2018 23:01
(name . Ludovic Courtès)(address . ludo@gnu.org)
874lb3kin6.fsf@ambrevar.xyz
Toggle quote (2 lines)> I don’t think we’ve encountered the problem before.
Actually it does ring a bell for me. Didn't we have a similar issue with Fish,or some dependency?
Toggle quote (3 lines)> For now I lean towards looking for a way to address the issue> specifically for SBCL.
Don't forget that we currently have 5 Lisp compilers.Besides, it's not clear that this can be fixed on the compiler's side, it couldvery well be that patches will be required on a per-project basis.
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwgBZ0ACgkQm9z0l6S7zH8XGAgAqUkpkfyLkBTmGkB0E4yhMQ2Mo5elI/PvQQN4NGPHl/VDJysQMZJPnYDSN4FAooCf3v5oenfr+VgZKr7NDDkDgVIdZbUCjzEw0La7FFl8DpB4+riJ0WqtghiBjCr4KRNfuSn1tgIenvMFsswH3otTaAllMIlfqMhxYJDtGYTzcjP059xgDQ0rPlKFPoAOv839rILx0AfdXAp7knIV+q4iN623ZEiGFIJQ3K2JuaPoBkBBVUHkk/lJSOvWnaGrIE56gqOnjMKJuTx9FuhhYPtN8ieNYj/VLV3y5V9v5JyvO3zwV49ahemBEK84n9PxAlU4A0D66Gy6ZmCBN38Ewgfhkw===s6io-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 24 Dec 2018 15:55
control message for bug #33848
(address . control@debbugs.gnu.org)
87a7kvgek5.fsf@gnu.org
severity 33848 important
L
L
Ludovic Courtès wrote on 24 Dec 2018 15:57
Re: bug#33848: Store references in SBCL-compiled code are "invisible"
(name . Pierre Neidhardt)(address . mail@ambrevar.xyz)
875zvjgefl.fsf@gnu.org
Hi!
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
Toggle quote (9 lines)> Thanks for looking into this, Ludo.>> At first glance, I'd say that this is not a compilation option but the way> strings are encoded by default. It seems that multibyte encoding is used all> over the place by a few compilers including SBCL (and CCL I think).>> One way I know around this (I'm by no mean a Common Lisp expert) is the> flexi-streams package for re-encoding.
OK, we need to investigate.
Toggle quote (4 lines)> More generally, shouldn't we make the reference scanner a bit smarter? In> particular, how does it handle non-ASCII references? Maybe it would not be> unreasonable to handle UTF-8 and UCS-4 for instance?
Store file names are always ASCII so problems arise when they are storedas UTF-16 or UTF-32/UCS-4.
Ludo’.
L
L
Ludovic Courtès wrote on 24 Dec 2018 16:06
(name . Pierre Neidhardt)(address . mail@ambrevar.xyz)
87sgynezha.fsf@gnu.org
Hi Pierre,
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
Toggle quote (5 lines)>> I don’t think we’ve encountered the problem before.>> Actually it does ring a bell for me. Didn't we have a similar issue with Fish,> or some dependency?
We did have a problem with Fish but I can no longer find it. Do youremember what it was? Something with C++, no?
Toggle quote (7 lines)>> For now I lean towards looking for a way to address the issue>> specifically for SBCL.>> Don't forget that we currently have 5 Lisp compilers.> Besides, it's not clear that this can be fixed on the compiler's side, it could> very well be that patches will be required on a per-project basis.
I know little about CL but maybe we can find a solution that works forall five compilers. At least that would be the first approach I wouldsuggest following.
Thanks,Ludo’.
P
P
Pierre Neidhardt wrote on 24 Dec 2018 18:08
(name . Ludovic Courtès)(address . ludo@gnu.org)
87r2e6j1hw.fsf@ambrevar.xyz
Toggle quote (3 lines)> Store file names are always ASCII so problems arise when they are stored> as UTF-16 or UTF-32/UCS-4.
I understand that most programs stick to ASCII filenames, but what about the oddone using non-English, special characters?
Toggle quote (3 lines)> We did have a problem with Fish but I can no longer find it. Do you> remember what it was? Something with C++, no?
I think bug #30265.
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwhEqsACgkQm9z0l6S7zH8nhAf9Hv2U1ajhnsl50XrKSr629VR3LFtu6whoiU3WJOygmulOIdlaWJ2IRFSRmwCvD8I/pE+BokAgT28BQNpvyG78+vgJeevb4adTD8eUxQsS2aRLPvJ9Js3B4epYtTDxtm6xp/kFKmxk/9WFYX/lxuyXfSYv/A7m8q3qWfngzvizZjCZVY0iQHrDLlfSxP1TVlUoiudIUo9BCjLmQQuyAkgxgDln9idzgXZKWXZMrW6HcK3Q4Ji2ymowCUf0vHRGj2mjBHo+QSYhOz/NduJPG717THk9C+9xG6eOyFa712VIwEJZc5dPKA50J/s0OaeNE8fk/mCyJ2y3yyE2V61wqhgRvA===E6l9-----END PGP SIGNATURE-----
M
M
Mark H Weaver wrote on 24 Dec 2018 19:12
(name . Ludovic Courtès)(address . ludo@gnu.org)
87tvj2yesd.fsf@netris.org
Hi Ludovic,
Ludovic Courtès <ludo@gnu.org> writes:
Toggle quote (13 lines)> Pierre Neidhardt <mail@ambrevar.xyz> skribis:>>>> For now I lean towards looking for a way to address the issue>>> specifically for SBCL.>>>> Don't forget that we currently have 5 Lisp compilers.>> Besides, it's not clear that this can be fixed on the compiler's side, it could>> very well be that patches will be required on a per-project basis.>> I know little about CL but maybe we can find a solution that works for> all five compilers. At least that would be the first approach I would> suggest following.
I can't imagine a solution that would work for all five compilers, butperhaps that's a failure of imagination on my part. Of course, you'rewelcome to search for such a solution. Can you give me a rough outlineof what you have in mind?
Of course, the usual reason to choose UTF-32 is to support non-ASCIIcharacters while retaining fixed-width code points, so that stringlookups are straightforward and efficient. Using UTF-8 improves spaceefficiency, but at the cost of extra code complexity. That extracomplexity is what I guess we would need to add to each program thatcurrently uses UTF-32. Alternatively, we could extend the on-diskformat to support UTF-8 and then add some kind of "load hook" thatconverts the string to UTF-32 at load time. Either way, it's likely tobe a can of worms.
Consider the case of Guile. Years ago we agreed to switch to UTF-8 asits sole internal string encoding, but it hasn't yet been done becauseit's a big job, even for those of us already intimately familiar withthe code.
Now imagine how hard it would be for someone who barely uses Guile, butnevertheless felt compelled to change our internal string representationto use UTF-8. Moreover, imagine that they hoped to find a singlesolution that would work for several different Scheme implementations.
What would you say to them if they proposed to find a general solutionto convert several Scheme implementations to use UTF-8 as their stringrepresentation, to save themselves the trouble of having to understandeach implementation individually?
I really think it would be a mistake to try to force every program andlanguage implementation to use our preferred string representation. Isuspect it would be vastly easier to compromise and support a few otherpopular string representations in Guix, namely UTF-16 and UTF-32.
If you don't want to change the daemon, it could be worked around in ourbuild-side code as follows: we could add a new phase to certain buildsystems (or possibly gnu-build-system) that scans each output forUTF-16/32 encoded store references that are never referenced in UTF-8.If such references exist, a file with an unobtrusive name would be addedto that output containing those references encoded in UTF-8. This wouldenable our daemon's existing reference scanner to find all of thereferences.
Our grafting code would then need to be extended to recognize andtransform store references encoded in UTF-16/32 as well as UTF-8.
What do you think?
Regards, Mark
P
P
Pierre Neidhardt wrote on 25 Dec 2018 00:58
(name . Mark H Weaver)(address . mhw@netris.org)
87pntqiijc.fsf@ambrevar.xyz
I find Mark's points reasonable, although to be honest I have very littleknowledge of the daemon.
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwhcqcACgkQm9z0l6S7zH/OWAf9EAHvRknh8vP7fQlEH0koh30BL0A62gzcn25Aw/Un5zPR7ZZ5vBhm7FqQfOS3fGF+ZYwTd5QWHz/49sNMJes4caaNqxN9x1xm/IBU374/MEkxtvnJqNctYL0zMAIXlruvch+cBYFfAyuCDjkNqFBHuqlFPP1lZCbal6xHvirMLLzNfRhQcFtYXD0Ty3YN0D5T9KfgQcrDEf78ShJSBto7lyBMKe9PqJBeKJexrzkD1XsY+sZB0PXiSrTNCT/tC2MqT8QMRrfGNpEepQIHMqowjVheJ3vcC5NDEKT7IFdY5art5d96+QpFlXNlRrCgC+3b8NGpxoWSR4oGqpR3FrjzEw===Apxv-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 26 Dec 2018 17:07
(name . Pierre Neidhardt)(address . mail@ambrevar.xyz)
87r2e4e0fu.fsf@gnu.org
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
Toggle quote (6 lines)>> Store file names are always ASCII so problems arise when they are stored>> as UTF-16 or UTF-32/UCS-4.>> I understand that most programs stick to ASCII filenames, but what about the odd> one using non-English, special characters?
That’s a separate debate. :-) Essentially this restriction on storefile names has always been there in Guix (and Nix before that). If wewere to change it, that would raise compatibility issues.
Toggle quote (5 lines)>> We did have a problem with Fish but I can no longer find it. Do you>> remember what it was? Something with C++, no?>> I think bug #30265.
Oh I see, UCS-4 as well. (I can’t believe this bug is still open giventhe relatively simple solutions outlined athttps://issues.guix.info/issue/30265#8. :-))
Thanks,Ludo’.
L
L
Ludovic Courtès wrote on 26 Dec 2018 17:14
(name . Mark H Weaver)(address . mhw@netris.org)
877efwe04u.fsf@gnu.org
Hello!
Mark H Weaver <mhw@netris.org> skribis:
Toggle quote (20 lines)> Ludovic Courtès <ludo@gnu.org> writes:>>> Pierre Neidhardt <mail@ambrevar.xyz> skribis:>>>>>> For now I lean towards looking for a way to address the issue>>>> specifically for SBCL.>>>>>> Don't forget that we currently have 5 Lisp compilers.>>> Besides, it's not clear that this can be fixed on the compiler's side, it could>>> very well be that patches will be required on a per-project basis.>>>> I know little about CL but maybe we can find a solution that works for>> all five compilers. At least that would be the first approach I would>> suggest following.>> I can't imagine a solution that would work for all five compilers, but> perhaps that's a failure of imagination on my part. Of course, you're> welcome to search for such a solution. Can you give me a rough outline> of what you have in mind?
I have nothing specific in mind, I’m just brainstorming with everyonehere. :-)
For a similar situation in C++, there’s a fairly simple and localworkaround:
https://issues.guix.info/issue/30265#8
I’m not familiar with CL but I thought that it we could achievesomething similar, that would be great—I’m not suggesting to change theCL compilers in any non-trivial way.
For example I guess we could always store the file name as a literalbyte vector/list and add a call to turn that into a string.
Does that make sense?
Thanks,Ludo’.
P
P
Pierre Neidhardt wrote on 27 Dec 2018 11:37
(name . Ludovic Courtès)(address . ludo@gnu.org)
8736qji7c1.fsf@ambrevar.xyz
Toggle quote (10 lines)> : > Store file names are always ASCII so problems arise when they are stored> : > as UTF-16 or UTF-32/UCS-4.> : > : I understand that most programs stick to ASCII filenames, but what about the odd> : one using non-English, special characters?> > That’s a separate debate. :-) Essentially this restriction on store> file names has always been there in Guix (and Nix before that). If we> were to change it, that would raise compatibility issues.
But what happens if we attempt to store "á" in the store?
Toggle quote (3 lines)> For example I guess we could always store the file name as a literal> byte vector/list and add a call to turn that into a string.
In the case of Next, that would be a simple patch, but other programs could getmuch more complicated. In the end, this approach requires a linear amount ofwork. Conversely, adding UCS-* support to the scanner would fix this issue onceand for all.
Toggle quote (9 lines)> : > We did have a problem with Fish but I can no longer find it. Do you> : > remember what it was? Something with C++, no?> : > : I think bug #30265.> > Oh I see, UCS-4 as well. (I can’t believe this bug is still open given> the relatively simple solutions outlined at> <https://issues.guix.info/issue/30265#8>. :-))
Well, if currently only two packages out of 8500+ suffer from this, then I thinkit's easier to go with Ludo's suggestion of patching the code to use ASCIIstrings.
Does anyone know about more packages with this issue? It could also be thatmore packages suffer from this, unbeknownst to us.
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwkq14ACgkQm9z0l6S7zH9zbAf8C7alkC/FiNu4pb3HkuSWZXKkZ/pccOIXH0ErCiND6SwQC9pBXTgxoYewp9Y3J0SrKyUMVKHidWERkA1EnVR6wBUT3sru6idmiNF2JIBw5JC+UiNdiS5RqvXdKa3eHjqxVXfL2kEINOOSoiB1t6P6chQsxHJjxOs9TTk+8UgFgDMF9VhtYubiaLYfoBOP7FVAIojHHGxth14ekyohT65TD4mgRqK3mTsLxPjrQ43/nAayo6aJWilx5BB1YoRe8bjUNzHS1G0JSsM6E8ZRwwUfwBBhwdqFml2O76LpJoWi/xi358JNldRqD7j/eV0ZNuJAZjONvVJZ9qtfJDifLPJkJQ===IU/3-----END PGP SIGNATURE-----
D
D
Danny Milosavljevic wrote on 27 Dec 2018 14:52
(name . Mark H Weaver)(address . mhw@netris.org)
20181227145258.0c420eac@scratchpost.org
Hi Mark,
On Mon, 24 Dec 2018 13:12:23 -0500Mark H Weaver <mhw@netris.org> wrote:
Toggle quote (4 lines)> Of course, the usual reason to choose UTF-32 is to support non-ASCII> characters while retaining fixed-width code points, so that string> lookups are straightforward and efficient.
This kind of lookup is almost never what is necessary. There are manypeople who assume character is the same as codepoint and to those peopleUTF-32 brings something to the table, but it's really not useful if peopledo text processing correctly, see below.
(Of course whether packages actually do this remains to be seen)
Toggle quote (3 lines)> Using UTF-8 improves space efficiency, but at the cost of extra code>complexity.
I agree.
Toggle quote (4 lines)> That extra> complexity is what I guess we would need to add to each program that> currently uses UTF-32.
Yes, but they usually have to do stream processing even with UTF-32 (becausea character can be composed of possibly infinite number of codepoints),so the infrastructure should be already there and the effort should beminimal.
Toggle quote (5 lines)> Alternatively, we could extend the on-disk> format to support UTF-8 and then add some kind of "load hook" that> converts the string to UTF-32 at load time. Either way, it's likely to> be a can of worms.
If it ever came to that, a pluggable reference scanner would be preferrable. But really, it would irk me to have so much complexityin something so basic (the reference scanner) for no end-user gain(as a distribution we could just mandate UTF-8 for references and theproblem would be gone for the user with no loss of functionality).
It's always easy to add special cases - but more code means more bugsand I think if possible it's best to have only the simple case implementedin the core - because it's less complicated which means more likelyto be correct (for the case it does handle). In the end it depends onwhat would be more code, and more widely used.
Also, if we wanted to debug reference errors, we couldn't use grep anymorebecause it can't handle utf-32 either (neither can any of the other UNIX tools).
Also, I really don't want to return to the time where I had to call iconvonce every three commands to be able to do anything useful on UNIX.
Also, the build daemon is written in C++ and C++ strings are widelyknown to have very very bad codepoint awareness (to say nothing aboutthe horrible conversion facilities).
Also, if both UTF-32 and UTF-8 are used on disk, care needs to not misdetectan UTF-8 sequence as an UTF-32 sequence of different text - or the other wayaround -, but that's unlikely for ASCII strings.
Toggle quote (5 lines)> I really think it would be a mistake to try to force every program and> language implementation to use our preferred string representation. I> suspect it would be vastly easier to compromise and support a few other> popular string representations in Guix, namely UTF-16 and UTF-32.
In 1992, UTF-8 was invented. Subsequently, most of the Internet,all new GNU Linux distributions etc, all UNIX GUI frameworks, Subversionetc standardized on UTF-8, with the eventual goal of standardizing allnetwork transfer and storage to UTF-8. I think that by now the outliersare the ones who need to change, otherwise these senseless encodingconversions will never cease. It's not like different encodings allow forbetter expression of writings or anything useful to the end user.
As a distribution we can't force upstream to change, but just filingbug reports upstream would make us see where they stand on this.
Toggle quote (9 lines)> If you don't want to change the daemon, it could be worked around in our> build-side code as follows: we could add a new phase to certain build> systems (or possibly gnu-build-system) that scans each output for> UTF-16/32 encoded store references that are never referenced in UTF-8.> If such references exist, a file with an unobtrusive name would be added> to that output containing those references encoded in UTF-8. This would> enable our daemon's existing reference scanner to find all of the> references.
I agree that that would be nice. As a first step, even just detectingproblems like that and erroring out would be okay - in order to find themin the first place. Right now, it's difficult to detect and so also difficultto say how wide-spread the problem is. If the problem is wide-spread enoughmy tune could change very quickly.
What you propose is similar to what I did in Java in Guix, only it givesus even more advantages in the Java case (faster class loading andeventual non-propagated inputs).
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEds7GsXJ0tGXALbPZ5xo1VCwwuqUFAlwk2ToACgkQ5xo1VCwwuqUUzAgApbxUHv/XlbjYMXvV4cOY0maxbx92ndZlJiukCN+bIiMqhuCd7PdEoL7Z1d9ABxe+2oXO4Nkjpez71nhK8ym8KwRYNDuTkCSZbzUJwNEee2pF/OlU2Y+Jugz5ICSlYGCFfwx6Buf9bZReYq1e5qjO//QSytgYC061gYURw/abtGSEyvllHWv4qrl6DFfQuQilycHAOqrT/ACBtMgFFnsV7miHs6CrKTSPPBWKKuA3BM4STNUfHlMeb8ungNUH3ijbDviqBgRiqDy50dZ0kbFv8zSm1LytoySX0qZ7j5oidDJGHATGbEXp4sDU1dPmBSYeasLAVfJn4RQSQFAKba6TuA===gkV6-----END PGP SIGNATURE-----

M
M
Mark H Weaver wrote on 27 Dec 2018 15:03
(name . Pierre Neidhardt)(address . mail@ambrevar.xyz)
87tvizvzgk.fsf@netris.org
Pierre Neidhardt <mail@ambrevar.xyz> writes:
Toggle quote (12 lines)>> : > Store file names are always ASCII so problems arise when they are stored>> : > as UTF-16 or UTF-32/UCS-4.>> : >> : I understand that most programs stick to ASCII filenames, but what about the odd>> : one using non-English, special characters?>> >> That’s a separate debate. :-) Essentially this restriction on store>> file names has always been there in Guix (and Nix before that). If we>> were to change it, that would raise compatibility issues.>> But what happens if we attempt to store "á" in the store?
Indeed. Although we might restrict the immediate entries within/gnu/store to ASCII characters, file names deeper within thosedirectories may have non-ASCII characters. More generally, storereferences may occur within larger strings which might include non-ASCIIcharacters.
Mark
M
M
Mark H Weaver wrote on 27 Dec 2018 15:29
(name . Danny Milosavljevic)(address . dannym@scratchpost.org)
87pntnvy8e.fsf@netris.org
Hi Danny,
Danny Milosavljevic <dannym@scratchpost.org> writes:
Toggle quote (21 lines)> On Mon, 24 Dec 2018 13:12:23 -0500> Mark H Weaver <mhw@netris.org> wrote:>>> Of course, the usual reason to choose UTF-32 is to support non-ASCII>> characters while retaining fixed-width code points, so that string>> lookups are straightforward and efficient.>> This kind of lookup is almost never what is necessary. There are many> people who assume character is the same as codepoint and to those people> UTF-32 brings something to the table, but it's really not useful if people> do text processing correctly, see below.>> (Of course whether packages actually do this remains to be seen)>>> That extra>> complexity is what I guess we would need to add to each program that>> currently uses UTF-32.>> Yes, but they usually have to do stream processing even with UTF-32 (because> a character can be composed of possibly infinite number of codepoints),
I agree with you. However, as silly as it might be, the fact remainsthat almost every modern programming language and string library usescode points as the base units by which to index strings.
Toggle quote (3 lines)> so the infrastructure should be already there and the effort should be> minimal.
The infrastructure might or might not be there, depending on thesophistication of the program's unicode support, but even if it _is_there, it will most likely be a layer that expects to iterate overstrings indexed by code point to find graphemes, etc.
Anyway, if you truly believe the effort should be minimal, feel free toinvestigate and propose patches to fix our 5 common lisp compilers andFish to avoid storing UTF-32 in the object code.
Toggle quote (4 lines)> Also, if both UTF-32 and UTF-8 are used on disk, care needs to not misdetect> an UTF-8 sequence as an UTF-32 sequence of different text - or the other way> around -, but that's unlikely for ASCII strings.
This is not an issue because the substrings that the reference scannerand grafter are looking for are ASCII-only, even if they are part of alarger non-ASCII string. Specifically, they only need to look for thenix hashes.
Toggle quote (11 lines)>> I really think it would be a mistake to try to force every program and>> language implementation to use our preferred string representation. I>> suspect it would be vastly easier to compromise and support a few other>> popular string representations in Guix, namely UTF-16 and UTF-32.>> In 1992, UTF-8 was invented. Subsequently, most of the Internet,> all new GNU Linux distributions etc, all UNIX GUI frameworks, Subversion> etc standardized on UTF-8, with the eventual goal of standardizing all> network transfer and storage to UTF-8. I think that by now the outliers> are the ones who need to change,
I agree that we need to standardize on Unicode. However, given theperhaps unfortunate fact that almost everyone has standardized on codepoints as the units by which to index strings, choosing UTF-32 as aninternal representation is a very reasonable choice, IMO.
Anyway, feel free to engage with the developers of the Common Lispimplementations that use UTF-32 and try to convince them to change.
The remaining question is: what to do if upstream refuses to change? Dowe exclude that software in Guix, or do we maintain our own patches tooverride upstream's decision?
Toggle quote (15 lines)>> If you don't want to change the daemon, it could be worked around in our>> build-side code as follows: we could add a new phase to certain build>> systems (or possibly gnu-build-system) that scans each output for>> UTF-16/32 encoded store references that are never referenced in UTF-8.>> If such references exist, a file with an unobtrusive name would be added>> to that output containing those references encoded in UTF-8. This would>> enable our daemon's existing reference scanner to find all of the>> references.>> I agree that that would be nice. As a first step, even just detecting> problems like that and erroring out would be okay - in order to find them> in the first place. Right now, it's difficult to detect and so also difficult> to say how wide-spread the problem is. If the problem is wide-spread enough> my tune could change very quickly.
Sure, it would be useful to have more data on what packages arecurrently affected by this issue.
Regards, Mark
L
L
Ludovic Courtès wrote on 27 Dec 2018 15:45
(name . Mark H Weaver)(address . mhw@netris.org)
87o9979gfn.fsf@gnu.org
Hello,
Mark H Weaver <mhw@netris.org> skribis:
Toggle quote (20 lines)> Pierre Neidhardt <mail@ambrevar.xyz> writes:>>>> : > Store file names are always ASCII so problems arise when they are stored>>> : > as UTF-16 or UTF-32/UCS-4.>>> : >>> : I understand that most programs stick to ASCII filenames, but what about the odd>>> : one using non-English, special characters?>>> >>> That’s a separate debate. :-) Essentially this restriction on store>>> file names has always been there in Guix (and Nix before that). If we>>> were to change it, that would raise compatibility issues.>>>> But what happens if we attempt to store "á" in the store?>> Indeed. Although we might restrict the immediate entries within> /gnu/store to ASCII characters, file names deeper within those> directories may have non-ASCII characters. More generally, store> references may occur within larger strings which might include non-ASCII> characters.
Right. For example ‘nss-certs’ contains non-ASCII, UTF-8-encoded filenames.
For “top-level” store file names, the restriction is enforced by‘checkStoreName’ in libstore/store-api.cc.
Ludo’.
P
P
Pierre Neidhardt wrote on 27 Dec 2018 16:02
(name . Ludovic Courtès)(address . ludo@gnu.org)
87tvizgghs.fsf@ambrevar.xyz
Just to be sure I understand: non-toplevel, non-ASCII file names willnot be scanned properly, right?
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwk6X8ACgkQm9z0l6S7zH/cXQgAnsjU66YtPvV2m1Mu2mRYRZfOFYicrKrjPiUkHKeQDY+CnBfPZQFNbMSLEICwx7JBZ86X48q0jJgJQ2PggQ7L17/4IhLKgL1brVTkuMIWyYXAj6hDF7qJ/ZK1mkUve432gbMkavmBWSqddENa38T/XdUxij3SxeYyDp0YjhRjbeLddZH2eIji+B8mqavVmR/wsyTb0u8+xdScTGaB5QoOiYKiE58g02lrZTF2PUkNXG58LhbmPwl1hpfs0abt5p4k/wvY5d9V8baXLlW9s4NQVE3/vpiE+ycQqPj90iUIummb83trbAshfUHHwhUGRZvYwnWAXVebivC8jWmR2UQdog===3pew-----END PGP SIGNATURE-----
P
P
Pierre Neidhardt wrote on 27 Dec 2018 17:15
(name . Ludovic Courtès)(address . ludo@gnu.org)
87r2e3gd3c.fsf@ambrevar.xyz
Danny Milosavljevic <dannym@scratchpost.org> writes:
Toggle quote (11 lines)> In 1992, UTF-8 was invented. Subsequently, most of the Internet,> all new GNU Linux distributions etc, all UNIX GUI frameworks, Subversion> etc standardized on UTF-8, with the eventual goal of standardizing all> network transfer and storage to UTF-8. I think that by now the outliers> are the ones who need to change, otherwise these senseless encoding> conversions will never cease. It's not like different encodings allow for> better expression of writings or anything useful to the end user.> > As a distribution we can't force upstream to change, but just filing> bug reports upstream would make us see where they stand on this.
I agree with this. Reporting upstream should be a first step.
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwk+rcACgkQm9z0l6S7zH9gUgf+Nns9ka7lqwHLI14NoJfKk77nUli0A0ANr+I8yQqyll3u0KdiqulNStOcKnJtLQYny/co27/mMIMllL8im2pwoVhZ6hUxvwyp1AetR4CW5ArPqma2aFKpDFKxd+T5W1ZUA/fwyB3S1hc3qVIVOzxAHSKQp/Ik/tb++ZDmoHCEg5qlAFxJovlcsCPUnOlee9bqMXfweqZhckl+97xXmK9mJ3tZ3ijZKQ/ceBmvJvcf7t+XEOSOQQ3FQxsqYlkUh0jB39NrSTH/HbLxRzPUaihuwZRCEXJu0c29E6S8u+MmXHF04wdH9TXeHoWBBEAnq4txR6tjKKMEjpDAAeCJfHqnOw===QuoM-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 27 Dec 2018 18:03
(name . Pierre Neidhardt)(address . mail@ambrevar.xyz)
87k1juaomo.fsf@gnu.org
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
Toggle quote (3 lines)> Just to be sure I understand: non-toplevel, non-ASCII file names will> not be scanned properly, right?
Every file in the store is properly scanned for references. It’s justthat users cannot create top-level items with a non-ASCII file name.
I hope this clarifies things!
Ludo’.
P
P
Pierre Neidhardt wrote on 27 Dec 2018 19:57
(name . Ludovic Courtès)(address . ludo@gnu.org)
87muoqhk62.fsf@ambrevar.xyz
Toggle quote (3 lines)> Every file in the store is properly scanned for references. It’s just> that users cannot create top-level items with a non-ASCII file name.
So if '/gnu/store/...-foo/á' is stored as UTF-8 in a binary, then it will befound? Is it because the filesystem encoding is also UTF-8 and Guix scans overbyte arrays?
Sorry for dragging on this, I guess I should look at the code at this point butI have very little time these days.
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwlIKUACgkQm9z0l6S7zH+Efgf/TSVtek2CEmPI+/oF8lD2xe6oSUhog4zvhirSXvsrDvLpY/R7i8lqjhw0ZSqU/yqkPiz/b/sFZsWOdyjWUsniVjcOjujuy7tzwZynifj0RF9sCCnjJcM3j8Dm3ioKwO5ppyAZBQRwt+UbjBoOC9NNyT3oDrjs9DWsWEL9cXBvkBbzoKzh/9kH5aaPvY1HoCx7mVuHeRsmKDR+YnaclArbux5jAseNCnWszqUJjkFvuDgGlmup5supsO3Ht+tFPlVeeEKu414jynZUktyKrJZdRplgmpfqBpuB+6TlQTCG9AW4gqvjoqrgfGbRVrM1kAU01xtEILl9Zx7C2wRE2+RAtg===oyjv-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 27 Dec 2018 22:54
(name . Pierre Neidhardt)(address . mail@ambrevar.xyz)
87zhsq8wkj.fsf@gnu.org
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
Toggle quote (7 lines)>> Every file in the store is properly scanned for references. It’s just>> that users cannot create top-level items with a non-ASCII file name.>> So if '/gnu/store/...-foo/á' is stored as UTF-8 in a binary, then it will be> found? Is it because the filesystem encoding is also UTF-8 and Guix scans over> byte arrays?
The reference scanner, currently written in C++, traverses wholedirectory trees. Being C++ it treats file names as byte arrays so itdoesn’t matter what the file name encoding is.
Note also that the reference scanner only looks for “xyz…-foo”; whatcomes before and after doesn’t matter. So for example if you have“/gnu/store/xyz…-foo/à”, what’s important is the “xyz…-foo” bit.
This is all happening in libstore/references.cc (which is surprisinglysmall) and in (guix build graft) for the grafting part, which Mark wrotea while back.
HTH,Ludo’.
P
P
Pierre Neidhardt wrote on 27 Dec 2018 23:05
(name . Ludovic Courtès)(address . ludo@gnu.org)
87d0pmhbgn.fsf@ambrevar.xyz
Toggle quote (4 lines)> The reference scanner, currently written in C++, traverses whole> directory trees. Being C++ it treats file names as byte arrays so it> doesn’t matter what the file name encoding is.
But what matters then is that the filename encodings on the filesystem and in thebinary match, right?
Toggle quote (4 lines)> Note also that the reference scanner only looks for “xyz…-foo”; what> comes before and after doesn’t matter. So for example if you have> “/gnu/store/xyz…-foo/à”, what’s important is the “xyz…-foo” bit.
OK, makes sense, then my main worry is just moot :)
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwlTLgACgkQm9z0l6S7zH/S3Qf/W9Oy7e1p3LkCiKM2l9t7jW3TezaUlHflLGmVd1zdiaTq/aeLdTfY2r+i+/aEweAHmQGD1oHWmSbnDMyBOQalzBNAQi8dg+oOSVNiMASWk+aHCj5OohE1mxrddSLwTxk0a4BIM03GbEc/qFtI2nOZEJGjphkHJGjSKHB/5gzilsLVXRjajYWXVh/PqVaIH3xJzjA4zIyg711PQTKMqB8qIAKpr0OKA23vpZ1FaRKoNY5NRx/g1wQpFTfi4ZdwgJwPwh3bDkU61C5mPiHcVARi8X3M65h96Aj+9RX9TCZ/+3oQTWexKTv2o0xHZfSYbHpvtAb8xkg9sKLaN47TFoM2Zg===djyT-----END PGP SIGNATURE-----
L
L
Ludovic Courtès wrote on 27 Dec 2018 23:59
(name . Pierre Neidhardt)(address . mail@ambrevar.xyz)
87r2e28tkv.fsf@gnu.org
Pierre Neidhardt <mail@ambrevar.xyz> skribis:
Toggle quote (7 lines)>> The reference scanner, currently written in C++, traverses whole>> directory trees. Being C++ it treats file names as byte arrays so it>> doesn’t matter what the file name encoding is.>> But what matters then is that the filename encodings on the filesystem and in the> binary match, right?
I’m not sure what you call “the binary”. Do you mean the nar?
Ludo’.
P
P
Pierre Neidhardt wrote on 28 Dec 2018 08:47
(name . Ludovic Courtès)(address . ludo@gnu.org)
874laygkiy.fsf@ambrevar.xyz
Toggle quote (2 lines)> I’m not sure what you call “the binary”. Do you mean the nar?
No, in this case I referred to "/bin/next" in sbcl-next. So any file in the narpassed to the reference scanner.
-- Pierre Neidhardthttps://ambrevar.xyz/
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCAAdFiEEUPM+LlsMPZAEJKvom9z0l6S7zH8FAlwl1RUACgkQm9z0l6S7zH/vtAgAq1KOgRvdEXvdvjPy//H0f/nhzu9Z1qYhwxV9tGOoeBcHiG/vUo6mc7/hRkht5zFI35d6L1o09XB6oJKfsVSi0P5VE0APDZzlc3y1pifNcjFekBHUuiva5jR4zCYeOsXc8AKZSSC7Lf+OPYb2CTFV2nvoANo9dhl5OUPZHZ0B7GcCyHZDtnk4jtiZD3p/FjlLcmZ9hom0CbvHjffkDlq/0nmO22kqY5tuKqY/p6UDyJDXROeMdYi2u2lbhM3kgrw5j7MQrQthmt00PUmqDRPoudjnjc4btoGUjUIjZlVgfZidv1qygC4Bxi/C9LZ3luopQVjPL/m2is4gwUp2DctYjg===3PX/-----END PGP SIGNATURE-----
?