From debbugs-submit-bounces@debbugs.gnu.org Tue Apr 30 16:28:24 2019 Received: (at 35350) by debbugs.gnu.org; 30 Apr 2019 20:28:24 +0000 Received: from localhost ([127.0.0.1]:42483 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hLZMV-0003HK-Tm for submit@debbugs.gnu.org; Tue, 30 Apr 2019 16:28:24 -0400 Received: from world.peace.net ([64.112.178.59]:41146) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hLZMU-0003H7-0Y for 35350@debbugs.gnu.org; Tue, 30 Apr 2019 16:28:22 -0400 Received: from mhw by world.peace.net with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hLZMO-0007ai-1J; Tue, 30 Apr 2019 16:28:16 -0400 From: Mark H Weaver To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#35350: Some compile output still leaks through with --verbosity=1 References: <87mukkfd2j.fsf@netris.org> <87r29v2jz2.fsf@gnu.org> <87ftq9silk.fsf@netris.org> <87imv5jai5.fsf@gnu.org> <87k1fgh9c0.fsf@netris.org> <874l6jh0bx.fsf@gnu.org> Date: Tue, 30 Apr 2019 16:26:32 -0400 In-Reply-To: <874l6jh0bx.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Sat, 27 Apr 2019 18:36:34 +0200") Message-ID: <87imuvme7g.fsf@netris.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 35350 Cc: 35350@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hi Ludovic, Ludovic Court=C3=A8s writes: > Mark H Weaver skribis: > >> Ludovic Court=C3=A8s writes: >> >>> The third read(2) call here ends on a partial UTF-8 sequence for LEFT >>> SINGLE QUOTATION MARK (we get the first two bytes of a three byte >>> sequence.) >>> >>> What happens is that =E2=80=98process-stderr=E2=80=99 in (guix store) g= ets that byte >>> string from the daemon, passes it through =E2=80=98read-maybe-utf8-stri= ng=E2=80=99, >>> which replaces the last two bytes with REPLACEMENT CHARACTER, which is >>> itself a 3-byte sequence. >> >> It seems to me that what's needed here is to save the UTF-8 decoder >> state between calls to 'process-stderr'. > > So there are two things. To fix the issue you reported (build output > that goes through), I think we must simply turn off UTF-8 decoding from > =E2=80=98process-stderr=E2=80=99 and leave that entirely to =E2=80=98buil= d-event-output-port=E2=80=99. Can we assume that UTF-8 is the appropriate encoding for (current-build-output-port)? My interpretation of the Guix manual entry for 'current-build-output-port' suggests that the answer should be "no". Also, in your previous message you wrote: The problem is the first layer of UTF-8 decoding that happens in =E2=80=98process-stderr=E2=80=99, in the =E2=80=98%stderr-next=E2=80=99 c= ase. We would need to disable it, but only if the build output port is =E2=80=98build-event-output-port=E2=80=99 (i.e., it=E2=80=99s capable of = interpreting =E2=80=9Cmultiplexed build output=E2=80=9D correctly.) It sounds like you're suggesting that 'process-stderr' should look to see if (current-build-output-port) is a 'build-event-output-port', and in that case it should use binary I/O primitives to write raw binary data to it, otherwise it should use text I/O primitives and write characters to it. Do I understand correctly? IMO, it would be cleaner to treat 'build-event-output-port' uniformly, and specifically as a textual port of unknown encoding. What do you think? > However, =E2=80=98build-event-output-port=E2=80=99 would still fail to pr= operly decode > split UTF-8 sequences, and for that we=E2=80=99d need to preserve decoder > state as you describe. I would suggest changing 'build-event-output-port' to create an R6RS custom *textual* output port, so that it wouldn't have to worry about encodings at all, and it would only be given whole characters. Internally, it would be doing exactly what you suggest above, but those details would be encapsulated within the custom textual port. However, I don't think we can use Guile's current implementation of R6RS custom textual output ports, which are currently built on Guile's legacy soft ports, which I suspect have a similar bug with multibyte characters sometimes being split (see 'soft_port_write' in vports.c). Having said all of this, my suggestions would ultimately entail having two separate places along the stderr pipeline where 'utf8->string!' would be used, and maybe that's too much until we have a more optimized C implementation of it. Thoughts? Mark