From debbugs-submit-bounces@debbugs.gnu.org Tue Mar 03 14:22:06 2020 Received: (at 39258) by debbugs.gnu.org; 3 Mar 2020 19:22:06 +0000 Received: from localhost ([127.0.0.1]:39964 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j9D7F-000556-Uj for submit@debbugs.gnu.org; Tue, 03 Mar 2020 14:22:06 -0500 Received: from mail-qk1-f196.google.com ([209.85.222.196]:34309) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1j9D7D-00054a-BE for 39258@debbugs.gnu.org; Tue, 03 Mar 2020 14:22:04 -0500 Received: by mail-qk1-f196.google.com with SMTP id 11so4635986qkd.1 for <39258@debbugs.gnu.org>; Tue, 03 Mar 2020 11:22:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=stLyBu/NQkc+ySqkSnQV1vi0adZuLk32p1ShQSSK00k=; b=gWxvI/F+VR6uSDibOSDnq6n6xgQ3IltY79Yvar+O+RcfP1DQTdjcOpX3HqlJXPD/wN uNe6kXQiDRZWuG2Ve4Elnx2fYTEMreN/C1pL8TLva7j694YOqhDytWcg1CW4dUbJbxLw CAlPPri/iuELIsWi6DqxGHwRxoM1xlgmFa3bTfNmYUj1lwf2ZymSoGCR1u8+YRrMTr86 YHN0xDlcPaqmLTCBvCg6X8PKHEiyxRB65EfHP/SE5hfVALITo5cF/jjKQGR4YG9QceeW Q5N1rvFHeasYzefVzXLry/XNU/Pwcdoj5Ex2XMMelUST66J9znS2vQ68lm2tQHOVMcpR Ymjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=stLyBu/NQkc+ySqkSnQV1vi0adZuLk32p1ShQSSK00k=; b=ZX+hwXbWRtSuc35SDaRQEylgfoeUk5D7EcYdwYLYXCgo3sf9GvtEEfsANT5sw6daOQ 0rkhQJsglLpw6VKP6FO0zENgo+yj0weUbV8Sj+iTrL9X0yrU9/iYvXB6ytR/sgwLVqZV hK4RwnjVQP30loRk6UtVHjX56UzaPZtfxsMMCpIcvHPOwj09MZlA8EdJbuyoFxpZGhpM NJhiQi25cBA+xeFn3MAeF2MTYnJNVz50HZAVo8SpKQhVzCelw+ieX3znftDYRhKk4D6d o8p55J6s+mA4gltqVTXVdFt4cN2Svdh00iwYlZr0enhHv1o7F84faeeo1tDbReUqde8V uf0A== X-Gm-Message-State: ANhLgQ0PpWjXg9KNsq9iKnzCGVjoKhR7XQpsHChk/xADjavP2sIasygs Y6dDRjedlD1mu01ru0wDw4Roc7HJNyHxVumVw94= X-Google-Smtp-Source: ADFU+vuBDaW/Y0RQnI0JW4lINJ16kohl6sCSw13F7Br2g4Xb9kz4688xfAqgmqqLHpF4T3/taFw+pAtFWt/PMN/t3DA= X-Received: by 2002:a05:620a:2282:: with SMTP id o2mr5134427qkh.304.1583263317489; Tue, 03 Mar 2020 11:21:57 -0800 (PST) MIME-Version: 1.0 References: <20200227204150.30985-1-arunisaac@systemreboot.net> <20200227204150.30985-5-arunisaac@systemreboot.net> In-Reply-To: <20200227204150.30985-5-arunisaac@systemreboot.net> From: zimoun Date: Tue, 3 Mar 2020 20:21:46 +0100 Message-ID: Subject: Re: [PATCH 4/4] gnu: Use xapian index for package search. To: Arun Isaac Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 39258 Cc: =?UTF-8?Q?Ludovic_Court=C3=A8s?= , 39258@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.5 (--) Hi Arun, On Thu, 27 Feb 2020 at 21:42, Arun Isaac wrote= : > > * gnu/packages.scm (search-package-index): New function. > * guix/scripts/package.scm (find-packages-by-description): Search using t= he > xapian package index if search patterns are literal strings. Else, search > using fold-packages. > --- > gnu/packages.scm | 17 +++++++++++- > guix/scripts/package.scm | 57 +++++++++++++++++++++++----------------- > 2 files changed, 49 insertions(+), 25 deletions(-) > > diff --git a/gnu/packages.scm b/gnu/packages.scm > index e91753e2a8..5b5b29bf84 100644 > --- a/gnu/packages.scm > +++ b/gnu/packages.scm > @@ -67,7 +67,8 @@ > specifications->manifest > > generate-package-cache > - generate-package-search-index)) > + generate-package-search-index > + search-package-index)) > > ;;; Commentary: > ;;; > @@ -453,6 +454,20 @@ reducing the memory footprint." > > db-path) > > +(define (search-package-index profile querystring) > + (let ((offset 0) > + (pagesize 10)) Why this value of 10? This fix the number of packages returned. Hum? I have tried to replace by 100 and I got 100 packages. :-) > + (call-with-database (string-append profile %package-search-index) > + (lambda (db) > + (let ((query (parse-query querystring #:stemmer (make-stem "en")= ))) > + (mset-fold (lambda (item result) I do not know what is the convention for the bindings. But there is 'fold-packages' so I would be inclined to 'fold-msets' or something in this flavour. > + (match (find-packages-by-name > + (document-data (mset-item-document item))= ) > + ((package _ ...) > + (append result `((,package . ,(mset-item-weigh= t item))))))) > + '() > + (enquire-mset (enquire db query) offset pagesize)))= )))) > + > > (define %sigint-prompt > ;; The prompt to jump to upon SIGINT. > diff --git a/guix/scripts/package.scm b/guix/scripts/package.scm > index 1cb0d382bf..6a3b9002dd 100644 > --- a/guix/scripts/package.scm > +++ b/guix/scripts/package.scm > @@ -7,6 +7,7 @@ > ;;; Copyright =C2=A9 2016 Benz Schenk > ;;; Copyright =C2=A9 2016 Chris Marusich > ;;; Copyright =C2=A9 2019 Tobias Geerinckx-Rice > +;;; Copyright =C2=A9 2020 Arun Isaac > ;;; > ;;; This file is part of GNU Guix. > ;;; > @@ -178,31 +179,40 @@ hooks\" run when building the profile." > ;;; Package specifications. > ;;; > > -(define (find-packages-by-description regexps) > +(define (find-packages-by-description patterns) > "Return a list of pairs: packages whose name, synopsis, description, > or output matches at least one of REGEXPS sorted by relevance, and its > non-zero relevance score." > - (let ((matches (fold-packages (lambda (package result) > - (if (package-superseded package) > - result > - (match (package-relevance package > - regexps) > - ((? zero?) > - result) > - (score > - (cons (cons package score) > - result))))) > - '()))) > - (sort matches > - (lambda (m1 m2) > - (match m1 > - ((package1 . score1) > - (match m2 > - ((package2 . score2) > - (if (=3D score1 score2) > - (string>? (package-full-name package1) > - (package-full-name package2)) > - (> score1 score2)))))))))) > + (define (regexp? str) > + (string-any > + (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$) > + str)) Instead of reverting this, I would let the current 'find-packages-by-description' and would add 'find-packages-by-description-indexed' doing just '(search-package-index (current-profile) (string-join patterns " "))'. And maybe refactoring the sort of scores. Then I would put the test branch in 'guix/scripts/packages.scm'... > + (if (and (current-profile) > + (not (any regexp? patterns))) > + (search-package-index (current-profile) (string-join patterns " ")= ) > + (let* ((regexps (map (cut make-regexp* <> regexp/icase) patterns)) > + (matches (fold-packages (lambda (package result) > + (if (package-superseded package) > + result > + (match (package-relevance pac= kage Note that I am in the process of implementing the BM25 weights as 'package-relevance'; at least really thinking about it! :-) I have already talked about TF-IDF as relevance, for example here [1]. And reading the Xapian documentation [2], it seems affordable. Or not ;-) because of the regexp... Need some thoughts... I mean "in the process". ;-) And in this case, it is almost a drop-in replacement of 'fold-packages' by 'mset-fold'; well it should add some flexibility and a more unified code. (Aside the searching, IMHO 'package-relevance' should help too in the linting process of bad written descriptions, another story. ;-) [1] https://lists.gnu.org/archive/html/guix-devel/2019-07/msg00252.html [2] https://xapian.org/docs/bm25.html > + reg= exps) > + ((? zero?) > + result) > + (score > + (cons (cons package score) > + result))))) > + '()))) > + (sort matches > + (lambda (m1 m2) > + (match m1 > + ((package1 . score1) > + (match m2 > + ((package2 . score2) > + (if (=3D score1 score2) > + (string>? (package-full-name package1) > + (package-full-name package2)) > + (> score1 score2))))))))))) > > (define (transaction-upgrade-entry store entry transaction) > "Return a variant of TRANSACTION that accounts for the upgrade of ENTR= Y, a > @@ -777,8 +787,7 @@ processed, #f otherwise." ...here. + (define (regexp? str) + (string-any + (char-set #\. #\[ #\{ #\} #\( #\) #\\ #\* #\+ #\? #\| #\^ #\$) + str)) > (('query 'search rx) rx) > (_ #f)) > opts)) > > - (regexps (map (cut make-regexp* <> regexp/icase) patterns= )) > - (matches (find-packages-by-description regexps))) + (if (any regexp? patterns) + (matches (find-packages-by-description regexps)) + (matches (find-packages-by-description-indexed patterns)) I mean something like that. > (leave-on-EPIPE > (display-search-results matches (current-output-port))) > #t)) > -- > 2.23.0 All the best, simon