Latest update seems to have similar issue as before, resources not found
-
@girish Weird. Adding that to the conf, while still on v1.5.1, causes it to fail again.
The contents of that directory are: corpora, stemmers, tokenizers
Despite the error, which reads thusly:
********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/local/share/ntlk_data' **********************************************************************
there is /usr/local/share/ntlk_data/corpora/stopwords, which includes a stopwords.zip file.
I noticed that in the conf file, the env
PAPERLESS_NLTK_DIR=/usr/local/share/ntlk_data
has the NLTK highlighted in brown, which must mean that the syntax is off, right?Finally, I also noticed at the bottom of the env file that the following are all commented out:
# Binaries #PAPERLESS_CONVERT_BINARY=/usr/bin/convert #PAPERLESS_GS_BINARY=/usr/bin/gs #PAPERLESS_OPTIPNG_BINARY=/usr/bin/optipng
Commenting out the line you suggested I add (running v1.5.1) sees the app functioning properly.
I guess I will now try to update again, then add that line to the conf file and see what happens.
-
@scooke I've updated to v1.5.2, and neither with or without that conf setting does it work. Still saying it can't find corpora/stopwords, even though it's there.
Logs:
Jan 31 22:31:41 [2023-01-31 21:31:41,532] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] received Jan 31 22:31:41 [2023-01-31 21:31:41,652] [INFO] [paperless.consumer] Consuming Doc - May 25, 2014, 11-08 AM.pdf Jan 31 22:31:44 [2023-01-31 21:31:44,150] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 4.01 - no change Jan 31 22:31:53 [2023-01-31 21:31:53,935] [INFO] [ocrmypdf._sync] Postprocessing... Jan 31 22:31:54 [2023-01-31 21:31:54,867] [INFO] [ocrmypdf._pipeline] Optimize ratio: 1.40 savings: 28.3% Jan 31 22:31:54 [2023-01-31 21:31:54,872] [INFO] [ocrmypdf._sync] Output file is a PDF/A-2B (as expected) Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.078 * 10 changes in 300 seconds. Saving... Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.079 * Background saving started by pid 1303 Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.083 * DB saved on disk Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.084 * RDB: 0 MB of memory used by copy-on-write Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.180 * Background saving terminated with success Jan 31 22:32:00 [2023-01-31 21:32:00,895] [ERROR] [paperless.consumer] The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 [2023-01-31 21:32:00,915] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] raised unexpected: ConsumerError("Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: \n**********************************************************************\n Resource \x1b[93mstopwords\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mcorpora/stopwords\x1b[0m\n\n Searched in:\n - '/usr/local/share/ntlk_data'\n**********************************************************************\n") Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/asgiref/sync.py", line 302, in main_wrap Jan 31 22:32:00 raise exc_info[1] Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 The above exception was the direct cause of the following exception: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 451, in trace_task Jan 31 22:32:00 R = retval = fun(*args, **kwargs) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 734, in __protected_call__ Jan 31 22:32:00 return self.run(*args, **kwargs) Jan 31 22:32:00 File "/app/code/src/documents/tasks.py", line 192, in consume_file Jan 31 22:32:00 document = Consumer().try_consume_file( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 468, in try_consume_file Jan 31 22:32:00 self._fail( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 93, in _fail Jan 31 22:32:00 raise ConsumerError(f"{self.filename}: {log_message or message}") from exception Jan 31 22:32:00 documents.consumer.ConsumerError: Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00
-
@scooke said in Latest update seems to have similar issue as before, resources not found:
Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/
So, this is wrong. It should be corpora/stopwords/ , I think.
root@07a7fcf3-f9ff-4051-9d89-dec6ed4a4777:/usr/local/share/nltk_data/corpora/stopwords# ls README basque chinese english german hinglish italian norwegian russian swedish arabic bengali danish finnish greek hungarian kazakh portuguese slovene tajik azerbaijani catalan dutch french hebrew indonesian nepali romanian spanish turkish
-
@girish Although there is definitely a stopwords.zip in the corpora/stopwords directory, there is also a stopwords folder (unzipped, I suppose). That is one thing to correct then.
I found this link: https://aur.archlinux.org/packages/paperless-ngx. Six comments from the bottom ammo wrote that executing
ln -s /usr/share/nltk_data /usr/local/share/nltk_data
fixed it for him. I tried it but, of course, these are read-only directories. So maybe this is a second thing to fix? -
@girish said in Latest update seems to have similar issue as before, resources not found:
/usr/local/share/nltk_data/corpora/stopwords
Yep
root@mypaperlessimagelongnumbernamethingy:/usr/local/share/nltk_data/corpora/stopwords# ls -al total 160 drwxr-xr-x 2 root root 4096 Jan 26 09:05 . drwxr-xr-x 3 root root 4096 Jan 26 09:05 .. -rw-r--r-- 1 root root 909 Jan 26 09:05 README -rw-r--r-- 1 root root 6348 Jan 26 09:05 arabic -rw-r--r-- 1 root root 967 Jan 26 09:05 azerbaijani -rw-r--r-- 1 root root 2202 Jan 26 09:05 basque -rw-r--r-- 1 root root 5443 Jan 26 09:05 bengali -rw-r--r-- 1 root root 1558 Jan 26 09:05 catalan -rw-r--r-- 1 root root 5560 Jan 26 09:05 chinese -rw-r--r-- 1 root root 424 Jan 26 09:05 danish -rw-r--r-- 1 root root 453 Jan 26 09:05 dutch -rw-r--r-- 1 root root 936 Jan 26 09:05 english -rw-r--r-- 1 root root 1579 Jan 26 09:05 finnish -rw-r--r-- 1 root root 813 Jan 26 09:05 french -rw-r--r-- 1 root root 1362 Jan 26 09:05 german -rw-r--r-- 1 root root 2167 Jan 26 09:05 greek -rw-r--r-- 1 root root 1836 Jan 26 09:05 hebrew -rw-r--r-- 1 root root 5958 Jan 26 09:05 hinglish -rw-r--r-- 1 root root 1227 Jan 26 09:05 hungarian -rw-r--r-- 1 root root 6446 Jan 26 09:05 indonesian -rw-r--r-- 1 root root 1654 Jan 26 09:05 italian -rw-r--r-- 1 root root 3880 Jan 26 09:05 kazakh -rw-r--r-- 1 root root 3610 Jan 26 09:05 nepali -rw-r--r-- 1 root root 851 Jan 26 09:05 norwegian -rw-r--r-- 1 root root 1286 Jan 26 09:05 portuguese -rw-r--r-- 1 root root 1910 Jan 26 09:05 romanian -rw-r--r-- 1 root root 1235 Jan 26 09:05 russian -rw-r--r-- 1 root root 15980 Jan 26 09:05 slovene -rw-r--r-- 1 root root 2176 Jan 26 09:05 spanish -rw-r--r-- 1 root root 559 Jan 26 09:05 swedish -rw-r--r-- 1 root root 1818 Jan 26 09:05 tajik -rw-r--r-- 1 root root 260 Jan 26 09:05 turkish
-
-
@scooke Can you try with 1.5.4 please?
I think I found the issue. First, the classifier needs to be created with
document_create_classifier
. One has to also put some tags, categories etc for documents. Once all that is setup, the nltk stuff kicks in. Without the classifier, nltk is just skipped. -