From debbugs-submit-bounces@debbugs.gnu.org Mon Mar 09 08:28:36 2020 Received: (at 39258) by debbugs.gnu.org; 9 Mar 2020 12:28:36 +0000 Received: from localhost ([127.0.0.1]:49925 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jBHWN-0001uE-Vi for submit@debbugs.gnu.org; Mon, 09 Mar 2020 08:28:36 -0400 Received: from mail-qt1-f193.google.com ([209.85.160.193]:37096) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jBHWK-0001ty-GG for 39258@debbugs.gnu.org; Mon, 09 Mar 2020 08:28:33 -0400 Received: by mail-qt1-f193.google.com with SMTP id l20so5131750qtp.4 for <39258@debbugs.gnu.org>; Mon, 09 Mar 2020 05:28:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=No12Ae9jpp1K4RC70LbWxeSMbSmRS6/GpY3BfUS2+j4=; b=Bf2rA/mOW19nqXTojWr6qLwBxzX3qM1b5mCu9SGIyXy6hv9bQXrjLLcaekk1vdOKD0 d3gv4UXf/JduZVZpU8mf6A4PgDTI2ojaa03a/Z+PTX1MRe+jvFdWlft/KB51SkRLwJVz xZTXTT1a+oZhlD35iJi7PfA3XomhMgGfYwKyM94nYmGrdtmuj2X7ffm0FkasngkwUUxa 3cKLwzy0aR7rhdSD4A2pUI63qVPY/9Y/AdZGWJz6/ih1aPeg9PCUVGz1oXxv0H9PbJF8 9mykeSG8wSg2qi/TiJeTPwJxZtVCVnzjODzBoZzdtL2NYgaVp5cZgPeVxLSIMBVPk2Xr zHLg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=No12Ae9jpp1K4RC70LbWxeSMbSmRS6/GpY3BfUS2+j4=; b=SyaqdeLxoIyW42DEikbkMz8nvM7D8zi6SRhQ2GldlxaYd+9WZA7IGec08m+EAZByFS kzRWpdNlasJEqSu7T7K/iUR2oYRZ3KhYuF2KxkSbRJDjOIU7dyIdSw2FSYiARm36FmmA tfO/OtUeJ7lmdKEfMLN4VF9GmZ+q7fXOGx6HrXfnXLsJ9E0s7N7NYfTPH6vt0fwN3oqE maZ1y3HLmXfatdFicUdxJdu3vG0suqaIEsOwDYVp8ntT0KHtKR3ZJ+Nzw1DqQnahVu9r XFG0u4Yt8svt7GPAbP6GV+C2iNTeW512XJ5+KMPpL510ZiwcEqQz+Bbq8tfQp+CDWMe2 H2Ug== X-Gm-Message-State: ANhLgQ3T3Z7f24iLZtcL59/fBQHxXeqLXKrDglAqXVzyeDeWeTVWZhdt tm0dwOd2Ln4xusThw6DaRa6M8aP9Z//DRMor+CQ= X-Google-Smtp-Source: ADFU+vtEm58t6CU8k58cDT5/+TrrGebfqqu8IHiw0c+DMBEZA2M3RR/PenDsdKctZar2uRdpBp48msC7o8uHf1Iz+4U= X-Received: by 2002:ac8:6b44:: with SMTP id x4mr3010736qts.186.1583756906844; Mon, 09 Mar 2020 05:28:26 -0700 (PDT) MIME-Version: 1.0 References: <20200307133116.11443-1-arunisaac@systemreboot.net> In-Reply-To: <20200307133116.11443-1-arunisaac@systemreboot.net> From: zimoun Date: Mon, 9 Mar 2020 13:28:08 +0100 Message-ID: Subject: Re: [PATCH v2 0/3] Xapian for Guix package search To: Arun Isaac Content-Type: text/plain; charset="UTF-8" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 39258 Cc: =?UTF-8?Q?Ludovic_Court=C3=A8s?= , Pierre Neidhardt , 39258@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Hi, On Sat, 7 Mar 2020 at 14:31, Arun Isaac wrote: > --8<---------------cut here---------------start------------->8--- > With a warm cache, > $ time guix search inkscape > > real 0m1.787s > user 0m1.745s > sys 0m0.111s > --8<---------------cut here---------------end--------------->8--- > > --8<---------------cut here---------------start------------->8--- > $ time /tmp/test/bin/guix search inkscape > > real 0m0.199s > user 0m0.182s > sys 0m0.024s > --8<---------------cut here---------------end--------------->8--- IMHO, it is interesting to compare the list of results and the order of the both query; as i did with Emacs. Speed is one thing, the initial motivation. But accuracy is maybe more important. > - The package cache would grow in size, and lookup would be slowed down > because we need to load the entire cache into memory. Xapian, on the other > hand, need only look up the specific packages that match the search query. I agree that 'fold-packages' could become soon a bottleneck. IMHO, 'mset-fold' should be a drop-in replacement of 'fold-package' in the search function. > - Xapian can provide superior search results due to it stemming and language > models. > - Xapian can provide spelling correction and query expansion -- that is, > suggest search terms to improve search results. Note that I haven't > implemented this yet and is out of scope in this patchset. I agree too that Xapian should improve the user experience when searching. > * Simplify our package search results > > Why not use a simpler package search results format like Arch Linux or Debian > does? We could just display the package name, version and synopsis like so. > > inkscape 0.92.4 > Vector graphics editor > inklingreader 0.8 > Wacom Inkling sketch format conversion and manipulation2 > > Why do we need the entire recutils format? If the user is interested, they can > always use `guix package --show` to get the full recutils formatted > info. Having shorter search results will make everything even faster and much > more readable. WDYT? I disagree. What I proposed some time ago was to have different flavour of the ouput of search; as e.g., 'git log --pretty=oneline' etc.. For example by default, it should be what you suggest. Then "guix search --format=full" should output the current. And we could imagine mimick the Git log strategy: "guix search --format="%name %version\n%license" etc. WDYT? > > Is (make-stem "en") for the locale? > > I still have English hard-coded. I haven't yet figured out how to detect the > locale and stem accordingly. But, there is a larger problem. Since we cannot > anticipate what locale the user will run guix search with, should we build the > Xapian index for all locales? That is, should we index not only the English > versions of the packages but also all other translations as well? I understand. Let consider that for the next round. > > package-search-index and package-cache-file could be refactored > > because they share all the same code. > > Yes, they could be. However, I'll postpone to the next iteration of the > patchset. Ok. > > I do not know what is the convention for the bindings. > > But there is 'fold-packages' so I would be inclined to 'fold-msets' or > > something in this flavour. > > Well, everywhere else in guile we have such things as vhash-fold, string-fold, > hash-fold, stream-fold, etc. That's why I went with mset-fold. Also, we are > folding over a single mset (match-set). So, mset should be in the singular. I understand. > > And more importantly, 'make as-derivations' to avoid a "guix pull" breakage, > > Ah do not forget to adapt some tests. > > Will do this once we have consensus about the other features of this patchset. And we should test that on different machines and states. > > Xapian does not return the package 'emacs' itself as the first. And worse, > > it is not returned at all. > > In this patchset, since we're indexing the package name as well, emacs is > returned but it is still far from the beginning. This is an issue. IMHO, it is because of the BM25 score. It is too rough and some weight should be applied. But that another story. The fix is: a- provide a scoring function to Xapian as the doc explains b- adapt 'fold-package' to 'mset-fold' in 'find-packages-by-description' and implement our version of BM25 then use it in 'relevance' > > I propose the value of 4294967295 for pagesize. > > In this patchset, I pass (database-document-count db) as the #:maximum-items > keyword argument to enquire-mset. This is the upstream recommended way to get > all search results. I hadn't done this earlier since I hadn't yet wrapped > database-document-count in guile-xapian. Cool! > My laptop is quite old with a particularly slow HDD. Hence my motivation to > improve guix search performance! I agree. But performance is not all. Accuracy counts more! :-) > > I think we should weigh the pros and cons on all these aspects: speed, > > complexity and maintenance cost, search result quality, search features, > > etc. > > I agree. I agree too. We should write a benchmark. For example, using Emacs as query or more complex we could think of. All the best, simon