Bug 278424 - deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer
Summary: deskutils/py-paperless-ngx: man page doesn't mention NLTK's Snowball Stemmer
Status: Closed FIXED
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Some People
Assignee: Michael Gmelin
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-18 07:21 UTC by freebsd.bugzilla
Modified: 2024-04-27 10:10 UTC (History)
0 users

See Also:
bugzilla: maintainer-feedback? (grembo)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description freebsd.bugzilla 2024-04-18 07:21:48 UTC
The man page states, about setting up NLTK:

> NLTK DATA
>     In order to process scanned documents using machine learning, paperless-
>     ngx requires NLTK (natural language toolkit) data.  The required files
>     can be downloaded by using these commands:
>
>           /usr/local/bin/python3.9 -m nltk.downloader \
>             stopwords punkt -d /var/db/paperless/nltkdata

It is missing the "snowball_data" file to be downloaded. The file is referred to in the project's doc (https://docs.paperless-ngx.com/setup/#bare_metal):

> Optional: If using the NLTK machine learning processing (see PAPERLESS_ENABLE_NLTK for details), download the NLTK data for the Snowball Stemmer, Stopwords and Punkt tokenizer to your PAPERLESS_DATA_DIR/nltk. Refer to the NLTK instructions for details on how to download the data.

I can't vouch for how handy it is to have that in NLTK or not, but it sounds very useful from its description (https://github.com/snowballstem/snowball?tab=readme-ov-file#what-is-stemming):

> What is Stemming?
> Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a search for connected would also find documents which only have the other forms.

I suggest "snowball_data" is added to the man page's sample NLTK download command so it is in line with the project's docs and can be useful to users of this port (thanks for it, btw!).
Comment 1 Michael Gmelin freebsd_committer freebsd_triage 2024-04-18 18:05:09 UTC
Thanks, that's a good addition, will be in the next port revision.
Comment 2 commit-hook freebsd_committer freebsd_triage 2024-04-24 22:00:12 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=d716cbc8e98a9410782bbad8139f7b6457920e5b

commit d716cbc8e98a9410782bbad8139f7b6457920e5b
Author:     Michael Gmelin <grembo@FreeBSD.org>
AuthorDate: 2024-04-24 12:39:17 +0000
Commit:     Michael Gmelin <grembo@FreeBSD.org>
CommitDate: 2024-04-24 21:56:50 +0000

    deskutils/py-paperless-ngx: Fixes and improvements

    - Fix NLTK configuration in paperless.conf.sample
    - Mention snowball_data in NLTK instructions
    - Fix sad defect that prevented rc scripts from picking
      up configuration data from rc.conf
    - Improve startup script documentation
    - Document required ACL when using a remote redis instance
    - Document how to enable JBIG2 and its potential pitfalls

    PR:             278424, 278425, 278426
    Reported by:    anonymous bugzilla user, netchild
    MFH:            2024Q2

 deskutils/py-paperless-ngx/Makefile                |  1 +
 deskutils/py-paperless-ngx/files/paperless-beat.in | 12 ++++++-----
 .../py-paperless-ngx/files/paperless-consumer.in   |  8 +++++---
 .../py-paperless-ngx/files/paperless-flower.in     | 12 ++++++-----
 .../py-paperless-ngx/files/paperless-migrate.in    | 12 ++++++-----
 .../py-paperless-ngx/files/paperless-ngx.7.in      | 24 +++++++++++++++++++---
 .../py-paperless-ngx/files/paperless-webui.in      | 20 +++++++++++-------
 .../py-paperless-ngx/files/paperless-worker.in     | 12 ++++++-----
 .../files/patch-paperless.conf.example             | 10 ++++-----
 deskutils/py-paperless-ngx/files/pkg-message.in    | 21 +++++++++++++++++++
 10 files changed, 93 insertions(+), 39 deletions(-)
Comment 3 commit-hook freebsd_committer freebsd_triage 2024-04-24 23:11:26 UTC
A commit in branch 2024Q2 references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=27a97f780c56cf51b482aa828ad0e20e24f23fcc

commit 27a97f780c56cf51b482aa828ad0e20e24f23fcc
Author:     Michael Gmelin <grembo@FreeBSD.org>
AuthorDate: 2024-04-24 12:39:17 +0000
Commit:     Michael Gmelin <grembo@FreeBSD.org>
CommitDate: 2024-04-24 23:09:18 +0000

    deskutils/py-paperless-ngx: Fixes and improvements

    - Fix NLTK configuration in paperless.conf.sample
    - Mention snowball_data in NLTK instructions
    - Fix sad defect that prevented rc scripts from picking
      up configuration data from rc.conf
    - Improve startup script documentation
    - Document required ACL when using a remote redis instance
    - Document how to enable JBIG2 and its potential pitfalls

    PR:             278424, 278425, 278426
    Reported by:    anonymous bugzilla user, netchild
    MFH:            2024Q2

    (cherry picked from commit d716cbc8e98a9410782bbad8139f7b6457920e5b)

 deskutils/py-paperless-ngx/Makefile                |  1 +
 deskutils/py-paperless-ngx/files/paperless-beat.in | 12 ++++++-----
 .../py-paperless-ngx/files/paperless-consumer.in   |  8 +++++---
 .../py-paperless-ngx/files/paperless-flower.in     | 12 ++++++-----
 .../py-paperless-ngx/files/paperless-migrate.in    | 12 ++++++-----
 .../py-paperless-ngx/files/paperless-ngx.7.in      | 24 +++++++++++++++++++---
 .../py-paperless-ngx/files/paperless-webui.in      | 20 +++++++++++-------
 .../py-paperless-ngx/files/paperless-worker.in     | 12 ++++++-----
 .../files/patch-paperless.conf.example             | 10 ++++-----
 deskutils/py-paperless-ngx/files/pkg-message.in    | 21 +++++++++++++++++++
 10 files changed, 93 insertions(+), 39 deletions(-)