From debbugs-submit-bounces@debbugs.gnu.org Fri Apr 26 20:47:14 2019 Received: (at 35350) by debbugs.gnu.org; 27 Apr 2019 00:47:14 +0000 Received: from localhost ([127.0.0.1]:33943 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hKBUo-00053I-0L for submit@debbugs.gnu.org; Fri, 26 Apr 2019 20:47:14 -0400 Received: from world.peace.net ([64.112.178.59]:48904) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hKBUl-000535-9J for 35350@debbugs.gnu.org; Fri, 26 Apr 2019 20:47:12 -0400 Received: from mhw by world.peace.net with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.89) (envelope-from ) id 1hKBUe-0003Bw-D1; Fri, 26 Apr 2019 20:47:04 -0400 From: Mark H Weaver To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#35350: Some compile output still leaks through with --verbosity=1 References: <87mukkfd2j.fsf@netris.org> <87r29v2jz2.fsf@gnu.org> <87ftq9silk.fsf@netris.org> <87imv5jai5.fsf@gnu.org> <87k1fgh9c0.fsf@netris.org> Date: Fri, 26 Apr 2019 20:45:24 -0400 In-Reply-To: <87k1fgh9c0.fsf@netris.org> (Mark H. Weaver's message of "Fri, 26 Apr 2019 15:09:24 -0400") Message-ID: <87bm0sgts0.fsf@netris.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 35350 Cc: 35350@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain Here's an improved version of the code with doc strings. It also properly handles the case of (target-source >= target-end) in 'utf8->string!'. Mark --=-=-= Content-Type: text/plain; charset=utf-8 Content-Disposition: inline; filename=utf8-decoder.scm Content-Transfer-Encoding: quoted-printable Content-Description: UTF-8 decoder, v2 ;;; Copyright =C2=A9 2019 Mark H Weaver ;;; ;;; This program is free software: you can redistribute it and/or modify ;;; it under the terms of the GNU General Public License as published by ;;; the Free Software Foundation, either version 3 of the License, or ;;; (at your option) any later version. ;;; ;;; This program is distributed in the hope that it will be useful, ;;; but WITHOUT ANY WARRANTY; without even the implied warranty of ;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the ;;; GNU General Public License for more details. ;;; ;;; You should have received a copy of the GNU General Public License ;;; along with this program. If not, see . (use-modules (rnrs bytevectors) ;; the following modules are only needed for the test. ;;(srfi srfi-1) ;;(ice-9 iconv) ) ;; Well-formed UTF-8 sequences ;; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D ;; 00..7F ;; C2..DF 80..BF ;; E0 *A0..BF 80..BF ;; E1..EC 80..BF 80..BF ;; ED 80..9F* 80..BF ;; EE..EF 80..BF 80..BF ;; F0 *90..BF 80..BF 80..BF ;; F1..F3 80..BF 80..BF 80..BF ;; F4 80..8F* 80..BF 80..BF ;; UTF-8 Decoder states ;; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D ;; 0 start state ;; C2 .. DF got 1/2 bytes ;; E0 .. EF got 1/3 bytes ;; F0 .. F4 got 1/4 bytes ;; E0A0 .. ED9F got 2/3 bytes (range 1) ;; EE80 .. EFBF got 2/3 bytes (range 2) ;; F090 .. F48F got 2/4 bytes ;; F09080 .. F48FBF got 3/4 bytes (define-syntax-rule (utf8-decode ((j init-expr) ...) (i continue) (output (code-point) e1 e1* ...) (error (maximal-subpart) e2 e2* ...) state-expr bv-expr start-expr end-expr) "Decode part of a UTF-8 byte stream in the bytevector BV-EXPR with indices in the interval from START-EXPR (inclusive) to END-EXPR (exclusive). STATE-EXPR is the initial decoder state, which must be an incomplete prefix of a valid UTF-8 byte sequence. The start state is 0. =20=20 When a valid UTF-8 byte sequence is found, the output expressions (E1 E1* ...) are evaluated, with the following bindings available in the lexical environment: =20=20 CODE-POINT the decoded code point, as an exact integer. I bytevector index immediately after the decoded sequence. J ... the user-provided seeds. CONTINUE procedure with arguments (i j ...) to continue decoding. OUTPUT procedure with arguments (code-point i j ...), whose body consists of (E1 E1* ...), provided by the user. ERROR procedure with arguments (maximal-subpart i j ...), whose body consists of (E2 E2* ...), provided by the user. =20=20 If you wish for decoding to continue, (E1 E1* ...) should end by tail-calling (CONTINUE I J^ ...), where (J^ ...) are the new seeds. Alternatively, if you wish to terminate decoding early, simply return one or more values, which will be returned to the caller of 'utf8-decode'. Normally, (values 0 I J^ ...) should be returned. =20=20 In case of a decoding error, the expressions (E2 E2* ...) will be called with the same bindings listed above, except CODE-POINT is omitted, and MAXIMAL-SUBPART is bound to the 'maximal subpart of an ill-formed subsequence' as defined in section 3.9 of The Unicode Standard 12.0, i.e. the longest code unit subsequence starting at an inconvertible offset that is either (a) the initial subsequence of a well-formed code unit sequence, or (b) a subsequence of length one. MAXIMAL-SUBPART is represented as an exact integer containing the bytes in big-endian order, e.g. #xF48FBF represents the bytes (F4 8F BF). =20=20 The bindings OUTPUT and ERROR are provided for convenience, in case the error expressions (E2 E2* ...) wish to call the user-provided output procedure (e.g. to output a substitution character), or the output expressions (E1 E1* ...) wish to call the user-provided error procedure. =20=20 If the provided bytes in BV-EXPR end with a non-empty but incomplete prefix of a well-formed UTF-8 byte sequence, then the following values are returned: (NEW-STATE BV-POS J ...). =20=20 When decoding finishes, every byte in the input (including in the initial STATE-EXPR) will have been reported in exactly one of the following ways: =20=20 (1) as part of a well-formed UTF-8 byte sequence, via the output expressions (E1 E1* ...), or =20=20 (2) as part of a 'maximal subpart of an ill-formed subsequence', via the error expressions (E2 E2* ...), or =20=20 (3) as part of the new state. =20=20 (4) in the unexamined indices of BV-EXPR starting with BV-POS." (let ((bv bv-expr) (end end-expr)) (define (output code-point i j ...) e1 e1* ...) (define (error maximal-subpart i j ...) e2 e2* ...) =20=20=20=20 (define (continue i j ...) (if (< i end) (let ((byte (bytevector-u8-ref bv i))) (cond ((<=3D byte #x7F) (output byte (+ i 1) j ...)) ((<=3D #xC2 byte #xF4) (got-1 byte (+ i 1) j ...)) (else (error byte (+ i 1) j ...)))) (values 0 i j ...))) (define (got-1 state i j ...) (if (< i end) (let ((byte (bytevector-u8-ref bv i))) (cond ((not (<=3D #x80 byte #xBF)) (error state i j ...)) ((<=3D state #xDF) (output (logior (ash (logand state #x1F) 6) (logand byte #x3F)) (+ i 1) j ...)) (else (let ((state^ (logior (ash state 8) byte))) (cond ((or (<=3D #xE0A0 state^ #xED9F) (<=3D #xEE80 state^ #xEFBF)) (got-2/3 state^ (+ i 1) j ...)) ((<=3D #xF090 state^ #xF48F) (got-2/4 state^ (+ i 1) j ...)) (else (error state i j ...))))))) (values state i j ...))) (define (got-2/3 state i j ...) (if (< i end) (let ((byte (bytevector-u8-ref bv i))) (if (<=3D #x80 byte #xBF) (output (logior (ash (logand state #xF00) 4) (ash (logand state #x3F) 6) (logand byte #x3F)) (+ i 1) j ...) (error state i j ...))) (values state i j ...))) (define (got-2/4 state i j ...) (if (< i end) (let ((byte (bytevector-u8-ref bv i))) (if (<=3D #x80 byte #xBF) (got-3/4 (logior (ash state 8) byte) (+ i 1) j ...) (error state i j ...))) (values state i j ...))) (define (got-3/4 state i j ...) (if (< i end) (let ((byte (bytevector-u8-ref bv i))) (if (<=3D #x80 byte #xBF) (output (logior (ash (logand state #x70000) 2) (ash (logand state #x3F00) 4) (ash (logand state #x3F) 6) (logand byte #x3F)) (+ i 1) j ...) (error state i j ...))) (values state i j ...))) (define (enter state i j ...) (cond ((zero? state) (continue i j ...)) ((<=3D state #xF4) (got-1 state i j ...)) ((<=3D state #xEFBF) (got-2/3 state i j ...)) ((<=3D state #xF48F) (got-2/4 state i j ...)) (else (got-3/4 state i j ...)))) (enter state-expr start-expr init-expr ...))) (define (utf8->string! state source source-start source-end target target-start target-end) "Decode part of a UTF-8 byte stream from the bytevector SOURCE starting at index SOURCE-START and up to (but not including) index SOURCE-END, and writing into the string TARGET starting at index TARGET-START and up to (but not including) index TARGET-END. Returns three values: NEW-STATE, SOURCE-POS, and TARGET-POS. STATE is either 0 (the start state) or the value of NEW-STATE returned by the previous call. In case of errors, each 'maximal subpart of an ill-formed subsequence', as defined in section 3.9 of The Unicode Standard 12.0, is replaced with a Unicode replacement character (U+FFFD)." (if (< target-start target-end) (utf8-decode ((j target-start)) (i continue) (output (code-point) (string-set! target j (integer->char code-point)) (if (< (+ j 1) target-end) (continue i (+ j 1)) (values 0 i (+ j 1)))) (error (maximal-subpart) (output #xFFFD i j)) ;TODO: support other error h= andlers state source source-start source-end) (values state source-start target-start))) ;; Another experimental primitive, slower than the ones above. (define* (utf8-fold* out err seed state bv #:optional (start 0) (end (bytevector-length bv))) "Decode part of a UTF-8 byte stream from the bytevector SOURCE starting at index SOURCE-START and up to (but not including) index SOURCE-END. Returns three values: NEW-STATE, SOURCE-POS, and FINAL-SEED. STATE is either 0 (the start state) or the value of NEW-STATE returned by the previous call. For each valid code point, call (OUT CODE-POINT SOURCE-POS SEED K), which should either call (K SOURCE-POS NEW-SEED) to continue decoding, or return three values (0 SOURCE-POS NEW-SEED) which will terminate decoding and immediately exit. In case of errors, call (ERR MAXIMAL-SUBPART SOURCE-POS SEED K) where MAXIMAL-SUBPART is a 'maximal subpart of an ill-formed subsequence', as defined in section 3.9 of The Unicode Standard 12.0. Similarly, ERR should either call (K SOURCE-POS NEW-SEED) to continue decoding, or return to exit immediately." (utf8-decode ((j seed)) (i continue) (output (code-point) (out code-point i j continue)) (error (maximal-subpart) (err maximal-subpart i j continue)) state bv start end)) ;; Another experimental primitive, slower than the ones above. (define* (utf8-fold out err seed state bv #:optional (start 0) (end (bytevector-length bv))) "Decode part of a UTF-8 byte stream from the bytevector SOURCE starting at index SOURCE-START and up to (but not including) index SOURCE-END. Returns three values: NEW-STATE, SOURCE-POS, and FINAL-SEED. STATE is either 0 (the start state) or the value of NEW-STATE returned by the previous call. For each valid code point, call (OUT CODE-POINT SOURCE-POS SEED), which should return a new SEED. In case of errors, call (ERR MAXIMAL-SUBPART SOURCE-POS SEED) where MAXIMAL-SUBPART is a 'maximal subpart of an ill-formed subsequence', as defined in section 3.9 of The Unicode Standard 12.0. ERR should return two values: a boolean specifying whether to continue decoding, and a new seed." (utf8-fold* (lambda (code-point i j continue) (continue i (out code-point i j))) (lambda (maximal-subpart i j continue) (call-with-values (lambda () (err maximal-subpart i j)) (lambda (continue? j^) (if continue? (continue i j^) (values 0 i j^))))) seed state bv start end)) ;; A not-so-quick test of all valid characters. ;; TODO: Tests of strictness and error handling. #; (let () (define ss (string-tabulate (lambda (i) (if (< i #xD800) (integer->char i) (integer->char (+ i #x800)))) (- #x110000 #x800))) (define bv (string->utf8 ss)) (define bv-len (bytevector-length bv)) (define slen (* 2 (string-length ss))) (define s (make-string slen)) (every (lambda (incr) (string-fill! s #\a) (call-with-values (lambda () (let loop ((state 0) (i 0) (j 0)) (if (< i bv-len) (call-with-values (lambda () (utf8->string! state bv i (min bv-len (+ i incr)) s j slen)) loop) (values state i j)))) (lambda (state i j) (and (zero? state) (=3D i bv-len) (=3D j (string-length ss)) (string=3D? ss (substring s 0 j)))))) (iota 5 1))) --=-=-=--