Latest update seems to have similar issue as before, resources not found
-
Package v.1.5.2 has a problem. Fortunately I had a working backup for v.1.5.1.
Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('stopwords')
PDFs would upload but wouldn't get processed.
-
@scooke actually, I can't reproduce this. Where do you see that message "Please use the NLTK Downloader to obtain the resource:" ?
NTLK data is already included in the image now since package 1.5.0.
wrote on Jan 31, 2023, 6:30 PM last edited by@girish It had worked when I restored to v1.5.1, but just today the problem returned.
In the actual web apps Files dashboard, under the "Failed" tab, is this error:
Drop Everything and Read—but How_.pdf: The following error occurred while consuming Drop Everything and Read—but How_.pdf:
********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/share/nltk_data' **********************************************************************
Here are logs from the Cloudron app dashboard:
Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Jan 31 19:27:48 [2023-01-31 18:27:48,579] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[fa64d1a1-7726-4390-ad33-1094d0600517] raised unexpected: ConsumerError("Englishlanguagelearners.pdf: The following error occurred while consuming Englishlanguagelearners.pdf: \n**********************************************************************\n Resource \x1b[93mstopwords\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mcorpora/stopwords\x1b[0m\n\n Searched in:\n - '/usr/share/nltk_data'\n**********************************************************************\n") Jan 31 19:27:48 Traceback (most recent call last): Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 19:27:48 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 19:27:48 raise LookupError(resource_not_found) Jan 31 19:27:48 LookupError: Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Resource stopwords not found. Jan 31 19:27:48 Please use the NLTK Downloader to obtain the resource: Jan 31 19:27:48 Jan 31 19:27:48 >>> import nltk Jan 31 19:27:48 >>> nltk.download('stopwords') Jan 31 19:27:48 Jan 31 19:27:48 For more information see: https://www.nltk.org/data.html Jan 31 19:27:48 Jan 31 19:27:48 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 19:27:48 Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Jan 31 19:27:48 Jan 31 19:27:48 During handling of the above exception, another exception occurred: Jan 31 19:27:48 Jan 31 19:27:48 Traceback (most recent call last): Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/asgiref/sync.py", line 302, in main_wrap Jan 31 19:27:48 raise exc_info[1] Jan 31 19:27:48 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 19:27:48 document_consumption_finished.send( Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 19:27:48 return [ Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 19:27:48 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 19:27:48 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 19:27:48 matched_tags = matching.match_tags(document, classifier) Jan 31 19:27:48 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 19:27:48 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 19:27:48 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 19:27:48 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 19:27:48 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 19:27:48 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 19:27:48 self.__load() Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 19:27:48 raise e Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 19:27:48 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 19:27:48 raise LookupError(resource_not_found) Jan 31 19:27:48 LookupError: Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Resource stopwords not found. Jan 31 19:27:48 Please use the NLTK Downloader to obtain the resource: Jan 31 19:27:48 Jan 31 19:27:48 >>> import nltk Jan 31 19:27:48 >>> nltk.download('stopwords') Jan 31 19:27:48 Jan 31 19:27:48 For more information see: https://www.nltk.org/data.html Jan 31 19:27:48 Jan 31 19:27:48 Attempted to load corpora/stopwords Jan 31 19:27:48 Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Jan 31 19:27:48 Jan 31 19:27:48 The above exception was the direct cause of the following exception: Jan 31 19:27:48 Jan 31 19:27:48 Traceback (most recent call last): Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 451, in trace_task Jan 31 19:27:48 R = retval = fun(*args, **kwargs) Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 734, in __protected_call__ Jan 31 19:27:48 return self.run(*args, **kwargs) Jan 31 19:27:48 File "/app/code/src/documents/tasks.py", line 192, in consume_file Jan 31 19:27:48 document = Consumer().try_consume_file( Jan 31 19:27:48 File "/app/code/src/documents/consumer.py", line 468, in try_consume_file Jan 31 19:27:48 self._fail( Jan 31 19:27:48 File "/app/code/src/documents/consumer.py", line 93, in _fail Jan 31 19:27:48 raise ConsumerError(f"{self.filename}: {log_message or message}") from exception Jan 31 19:27:48 documents.consumer.ConsumerError: Englishlanguagelearners.pdf: The following error occurred while consuming Englishlanguagelearners.pdf: Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Resource stopwords not found. Jan 31 19:27:48 Please use the NLTK Downloader to obtain the resource: Jan 31 19:27:48 Jan 31 19:27:48 >>> import nltk Jan 31 19:27:48 >>> nltk.download('stopwords') Jan 31 19:27:48 Jan 31 19:27:48 For more information see: https://www.nltk.org/data.html Jan 31 19:27:48 Jan 31 19:27:48 Attempted to load corpora/stopwords Jan 31 19:27:48 Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 **********************************************************************
-
@girish It had worked when I restored to v1.5.1, but just today the problem returned.
In the actual web apps Files dashboard, under the "Failed" tab, is this error:
Drop Everything and Read—but How_.pdf: The following error occurred while consuming Drop Everything and Read—but How_.pdf:
********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/share/nltk_data' **********************************************************************
Here are logs from the Cloudron app dashboard:
Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Jan 31 19:27:48 [2023-01-31 18:27:48,579] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[fa64d1a1-7726-4390-ad33-1094d0600517] raised unexpected: ConsumerError("Englishlanguagelearners.pdf: The following error occurred while consuming Englishlanguagelearners.pdf: \n**********************************************************************\n Resource \x1b[93mstopwords\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mcorpora/stopwords\x1b[0m\n\n Searched in:\n - '/usr/share/nltk_data'\n**********************************************************************\n") Jan 31 19:27:48 Traceback (most recent call last): Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 19:27:48 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 19:27:48 raise LookupError(resource_not_found) Jan 31 19:27:48 LookupError: Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Resource stopwords not found. Jan 31 19:27:48 Please use the NLTK Downloader to obtain the resource: Jan 31 19:27:48 Jan 31 19:27:48 >>> import nltk Jan 31 19:27:48 >>> nltk.download('stopwords') Jan 31 19:27:48 Jan 31 19:27:48 For more information see: https://www.nltk.org/data.html Jan 31 19:27:48 Jan 31 19:27:48 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 19:27:48 Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Jan 31 19:27:48 Jan 31 19:27:48 During handling of the above exception, another exception occurred: Jan 31 19:27:48 Jan 31 19:27:48 Traceback (most recent call last): Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/asgiref/sync.py", line 302, in main_wrap Jan 31 19:27:48 raise exc_info[1] Jan 31 19:27:48 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 19:27:48 document_consumption_finished.send( Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 19:27:48 return [ Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 19:27:48 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 19:27:48 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 19:27:48 matched_tags = matching.match_tags(document, classifier) Jan 31 19:27:48 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 19:27:48 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 19:27:48 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 19:27:48 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 19:27:48 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 19:27:48 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 19:27:48 self.__load() Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 19:27:48 raise e Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 19:27:48 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 19:27:48 raise LookupError(resource_not_found) Jan 31 19:27:48 LookupError: Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Resource stopwords not found. Jan 31 19:27:48 Please use the NLTK Downloader to obtain the resource: Jan 31 19:27:48 Jan 31 19:27:48 >>> import nltk Jan 31 19:27:48 >>> nltk.download('stopwords') Jan 31 19:27:48 Jan 31 19:27:48 For more information see: https://www.nltk.org/data.html Jan 31 19:27:48 Jan 31 19:27:48 Attempted to load corpora/stopwords Jan 31 19:27:48 Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Jan 31 19:27:48 Jan 31 19:27:48 The above exception was the direct cause of the following exception: Jan 31 19:27:48 Jan 31 19:27:48 Traceback (most recent call last): Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 451, in trace_task Jan 31 19:27:48 R = retval = fun(*args, **kwargs) Jan 31 19:27:48 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 734, in __protected_call__ Jan 31 19:27:48 return self.run(*args, **kwargs) Jan 31 19:27:48 File "/app/code/src/documents/tasks.py", line 192, in consume_file Jan 31 19:27:48 document = Consumer().try_consume_file( Jan 31 19:27:48 File "/app/code/src/documents/consumer.py", line 468, in try_consume_file Jan 31 19:27:48 self._fail( Jan 31 19:27:48 File "/app/code/src/documents/consumer.py", line 93, in _fail Jan 31 19:27:48 raise ConsumerError(f"{self.filename}: {log_message or message}") from exception Jan 31 19:27:48 documents.consumer.ConsumerError: Englishlanguagelearners.pdf: The following error occurred while consuming Englishlanguagelearners.pdf: Jan 31 19:27:48 ********************************************************************** Jan 31 19:27:48 Resource stopwords not found. Jan 31 19:27:48 Please use the NLTK Downloader to obtain the resource: Jan 31 19:27:48 Jan 31 19:27:48 >>> import nltk Jan 31 19:27:48 >>> nltk.download('stopwords') Jan 31 19:27:48 Jan 31 19:27:48 For more information see: https://www.nltk.org/data.html Jan 31 19:27:48 Jan 31 19:27:48 Attempted to load corpora/stopwords Jan 31 19:27:48 Jan 31 19:27:48 Searched in: Jan 31 19:27:48 - '/usr/share/nltk_data' Jan 31 19:27:48 **********************************************************************
-
wrote on Jan 31, 2023, 6:55 PM last edited by
Hello,
Same Problem on our sites.
Package Versioncom.paperlessng.cloudronapp@1.5.2The last step "saving the Dokument failed"
Thx
Axelabc.pdf: The following error occurred while consuming abc.pdf: ********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/share/nltk_data' **********************************************************************
-
Hello,
Same Problem on our sites.
Package Versioncom.paperlessng.cloudronapp@1.5.2The last step "saving the Dokument failed"
Thx
Axelabc.pdf: The following error occurred while consuming abc.pdf: ********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/share/nltk_data' **********************************************************************
-
Hello,
Same Problem on our sites.
Package Versioncom.paperlessng.cloudronapp@1.5.2The last step "saving the Dokument failed"
Thx
Axelabc.pdf: The following error occurred while consuming abc.pdf: ********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/share/nltk_data' **********************************************************************
-
@axel0681 Actually, can you set
PAPERLESS_NLTK_DIR=/usr/local/share/ntlk_data
in paperless.conf and restart the app ?wrote on Jan 31, 2023, 9:21 PM last edited by@girish Weird. Adding that to the conf, while still on v1.5.1, causes it to fail again.
The contents of that directory are: corpora, stemmers, tokenizers
Despite the error, which reads thusly:
********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/local/share/ntlk_data' **********************************************************************
there is /usr/local/share/ntlk_data/corpora/stopwords, which includes a stopwords.zip file.
I noticed that in the conf file, the env
PAPERLESS_NLTK_DIR=/usr/local/share/ntlk_data
has the NLTK highlighted in brown, which must mean that the syntax is off, right?Finally, I also noticed at the bottom of the env file that the following are all commented out:
# Binaries #PAPERLESS_CONVERT_BINARY=/usr/bin/convert #PAPERLESS_GS_BINARY=/usr/bin/gs #PAPERLESS_OPTIPNG_BINARY=/usr/bin/optipng
Commenting out the line you suggested I add (running v1.5.1) sees the app functioning properly.
I guess I will now try to update again, then add that line to the conf file and see what happens.
-
@girish Weird. Adding that to the conf, while still on v1.5.1, causes it to fail again.
The contents of that directory are: corpora, stemmers, tokenizers
Despite the error, which reads thusly:
********************************************************************** Resource [93mstopwords[0m not found. Please use the NLTK Downloader to obtain the resource: [31m>>> import nltk >>> nltk.download('stopwords') [0m For more information see: https://www.nltk.org/data.html Attempted to load [93mcorpora/stopwords[0m Searched in: - '/usr/local/share/ntlk_data' **********************************************************************
there is /usr/local/share/ntlk_data/corpora/stopwords, which includes a stopwords.zip file.
I noticed that in the conf file, the env
PAPERLESS_NLTK_DIR=/usr/local/share/ntlk_data
has the NLTK highlighted in brown, which must mean that the syntax is off, right?Finally, I also noticed at the bottom of the env file that the following are all commented out:
# Binaries #PAPERLESS_CONVERT_BINARY=/usr/bin/convert #PAPERLESS_GS_BINARY=/usr/bin/gs #PAPERLESS_OPTIPNG_BINARY=/usr/bin/optipng
Commenting out the line you suggested I add (running v1.5.1) sees the app functioning properly.
I guess I will now try to update again, then add that line to the conf file and see what happens.
wrote on Jan 31, 2023, 9:34 PM last edited by@scooke I've updated to v1.5.2, and neither with or without that conf setting does it work. Still saying it can't find corpora/stopwords, even though it's there.
Logs:
Jan 31 22:31:41 [2023-01-31 21:31:41,532] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] received Jan 31 22:31:41 [2023-01-31 21:31:41,652] [INFO] [paperless.consumer] Consuming Doc - May 25, 2014, 11-08 AM.pdf Jan 31 22:31:44 [2023-01-31 21:31:44,150] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 4.01 - no change Jan 31 22:31:53 [2023-01-31 21:31:53,935] [INFO] [ocrmypdf._sync] Postprocessing... Jan 31 22:31:54 [2023-01-31 21:31:54,867] [INFO] [ocrmypdf._pipeline] Optimize ratio: 1.40 savings: 28.3% Jan 31 22:31:54 [2023-01-31 21:31:54,872] [INFO] [ocrmypdf._sync] Output file is a PDF/A-2B (as expected) Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.078 * 10 changes in 300 seconds. Saving... Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.079 * Background saving started by pid 1303 Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.083 * DB saved on disk Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.084 * RDB: 0 MB of memory used by copy-on-write Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.180 * Background saving terminated with success Jan 31 22:32:00 [2023-01-31 21:32:00,895] [ERROR] [paperless.consumer] The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 [2023-01-31 21:32:00,915] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] raised unexpected: ConsumerError("Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: \n**********************************************************************\n Resource \x1b[93mstopwords\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mcorpora/stopwords\x1b[0m\n\n Searched in:\n - '/usr/local/share/ntlk_data'\n**********************************************************************\n") Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/asgiref/sync.py", line 302, in main_wrap Jan 31 22:32:00 raise exc_info[1] Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 The above exception was the direct cause of the following exception: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 451, in trace_task Jan 31 22:32:00 R = retval = fun(*args, **kwargs) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 734, in __protected_call__ Jan 31 22:32:00 return self.run(*args, **kwargs) Jan 31 22:32:00 File "/app/code/src/documents/tasks.py", line 192, in consume_file Jan 31 22:32:00 document = Consumer().try_consume_file( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 468, in try_consume_file Jan 31 22:32:00 self._fail( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 93, in _fail Jan 31 22:32:00 raise ConsumerError(f"{self.filename}: {log_message or message}") from exception Jan 31 22:32:00 documents.consumer.ConsumerError: Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00
-
@scooke I've updated to v1.5.2, and neither with or without that conf setting does it work. Still saying it can't find corpora/stopwords, even though it's there.
Logs:
Jan 31 22:31:41 [2023-01-31 21:31:41,532] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] received Jan 31 22:31:41 [2023-01-31 21:31:41,652] [INFO] [paperless.consumer] Consuming Doc - May 25, 2014, 11-08 AM.pdf Jan 31 22:31:44 [2023-01-31 21:31:44,150] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 4.01 - no change Jan 31 22:31:53 [2023-01-31 21:31:53,935] [INFO] [ocrmypdf._sync] Postprocessing... Jan 31 22:31:54 [2023-01-31 21:31:54,867] [INFO] [ocrmypdf._pipeline] Optimize ratio: 1.40 savings: 28.3% Jan 31 22:31:54 [2023-01-31 21:31:54,872] [INFO] [ocrmypdf._sync] Output file is a PDF/A-2B (as expected) Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.078 * 10 changes in 300 seconds. Saving... Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.079 * Background saving started by pid 1303 Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.083 * DB saved on disk Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.084 * RDB: 0 MB of memory used by copy-on-write Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.180 * Background saving terminated with success Jan 31 22:32:00 [2023-01-31 21:32:00,895] [ERROR] [paperless.consumer] The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 [2023-01-31 21:32:00,915] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] raised unexpected: ConsumerError("Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: \n**********************************************************************\n Resource \x1b[93mstopwords\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mcorpora/stopwords\x1b[0m\n\n Searched in:\n - '/usr/local/share/ntlk_data'\n**********************************************************************\n") Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/asgiref/sync.py", line 302, in main_wrap Jan 31 22:32:00 raise exc_info[1] Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 The above exception was the direct cause of the following exception: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 451, in trace_task Jan 31 22:32:00 R = retval = fun(*args, **kwargs) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 734, in __protected_call__ Jan 31 22:32:00 return self.run(*args, **kwargs) Jan 31 22:32:00 File "/app/code/src/documents/tasks.py", line 192, in consume_file Jan 31 22:32:00 document = Consumer().try_consume_file( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 468, in try_consume_file Jan 31 22:32:00 self._fail( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 93, in _fail Jan 31 22:32:00 raise ConsumerError(f"{self.filename}: {log_message or message}") from exception Jan 31 22:32:00 documents.consumer.ConsumerError: Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00
@scooke said in Latest update seems to have similar issue as before, resources not found:
Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/
So, this is wrong. It should be corpora/stopwords/ , I think.
root@07a7fcf3-f9ff-4051-9d89-dec6ed4a4777:/usr/local/share/nltk_data/corpora/stopwords# ls README basque chinese english german hinglish italian norwegian russian swedish arabic bengali danish finnish greek hungarian kazakh portuguese slovene tajik azerbaijani catalan dutch french hebrew indonesian nepali romanian spanish turkish
-
@scooke I've updated to v1.5.2, and neither with or without that conf setting does it work. Still saying it can't find corpora/stopwords, even though it's there.
Logs:
Jan 31 22:31:41 [2023-01-31 21:31:41,532] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] received Jan 31 22:31:41 [2023-01-31 21:31:41,652] [INFO] [paperless.consumer] Consuming Doc - May 25, 2014, 11-08 AM.pdf Jan 31 22:31:44 [2023-01-31 21:31:44,150] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 4.01 - no change Jan 31 22:31:53 [2023-01-31 21:31:53,935] [INFO] [ocrmypdf._sync] Postprocessing... Jan 31 22:31:54 [2023-01-31 21:31:54,867] [INFO] [ocrmypdf._pipeline] Optimize ratio: 1.40 savings: 28.3% Jan 31 22:31:54 [2023-01-31 21:31:54,872] [INFO] [ocrmypdf._sync] Output file is a PDF/A-2B (as expected) Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.078 * 10 changes in 300 seconds. Saving... Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.079 * Background saving started by pid 1303 Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.083 * DB saved on disk Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.084 * RDB: 0 MB of memory used by copy-on-write Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.180 * Background saving terminated with success Jan 31 22:32:00 [2023-01-31 21:32:00,895] [ERROR] [paperless.consumer] The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 [2023-01-31 21:32:00,915] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] raised unexpected: ConsumerError("Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: \n**********************************************************************\n Resource \x1b[93mstopwords\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mcorpora/stopwords\x1b[0m\n\n Searched in:\n - '/usr/local/share/ntlk_data'\n**********************************************************************\n") Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/ Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 During handling of the above exception, another exception occurred: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/asgiref/sync.py", line 302, in main_wrap Jan 31 22:32:00 raise exc_info[1] Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file Jan 31 22:32:00 document_consumption_finished.send( Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send Jan 31 22:32:00 return [ Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp> Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named)) Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier) Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)]) Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE)) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__ Jan 31 22:32:00 self.__load() Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load Jan 31 22:32:00 raise e Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}") Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find Jan 31 22:32:00 raise LookupError(resource_not_found) Jan 31 22:32:00 LookupError: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Jan 31 22:32:00 Jan 31 22:32:00 The above exception was the direct cause of the following exception: Jan 31 22:32:00 Jan 31 22:32:00 Traceback (most recent call last): Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 451, in trace_task Jan 31 22:32:00 R = retval = fun(*args, **kwargs) Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 734, in __protected_call__ Jan 31 22:32:00 return self.run(*args, **kwargs) Jan 31 22:32:00 File "/app/code/src/documents/tasks.py", line 192, in consume_file Jan 31 22:32:00 document = Consumer().try_consume_file( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 468, in try_consume_file Jan 31 22:32:00 self._fail( Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 93, in _fail Jan 31 22:32:00 raise ConsumerError(f"{self.filename}: {log_message or message}") from exception Jan 31 22:32:00 documents.consumer.ConsumerError: Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00 Resource stopwords not found. Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource: Jan 31 22:32:00 Jan 31 22:32:00 >>> import nltk Jan 31 22:32:00 >>> nltk.download('stopwords') Jan 31 22:32:00 Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html Jan 31 22:32:00 Jan 31 22:32:00 Attempted to load corpora/stopwords Jan 31 22:32:00 Jan 31 22:32:00 Searched in: Jan 31 22:32:00 - '/usr/local/share/ntlk_data' Jan 31 22:32:00 ********************************************************************** Jan 31 22:32:00
-
@scooke said in Latest update seems to have similar issue as before, resources not found:
Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/
So, this is wrong. It should be corpora/stopwords/ , I think.
root@07a7fcf3-f9ff-4051-9d89-dec6ed4a4777:/usr/local/share/nltk_data/corpora/stopwords# ls README basque chinese english german hinglish italian norwegian russian swedish arabic bengali danish finnish greek hungarian kazakh portuguese slovene tajik azerbaijani catalan dutch french hebrew indonesian nepali romanian spanish turkish
wrote on Jan 31, 2023, 9:41 PM last edited by@girish Although there is definitely a stopwords.zip in the corpora/stopwords directory, there is also a stopwords folder (unzipped, I suppose). That is one thing to correct then.
I found this link: https://aur.archlinux.org/packages/paperless-ngx. Six comments from the bottom ammo wrote that executing
ln -s /usr/share/nltk_data /usr/local/share/nltk_data
fixed it for him. I tried it but, of course, these are read-only directories. So maybe this is a second thing to fix? -
wrote on Jan 31, 2023, 9:42 PM last edited by
@girish said in Latest update seems to have similar issue as before, resources not found:
/usr/local/share/nltk_data/corpora/stopwords
Yep
root@mypaperlessimagelongnumbernamethingy:/usr/local/share/nltk_data/corpora/stopwords# ls -al total 160 drwxr-xr-x 2 root root 4096 Jan 26 09:05 . drwxr-xr-x 3 root root 4096 Jan 26 09:05 .. -rw-r--r-- 1 root root 909 Jan 26 09:05 README -rw-r--r-- 1 root root 6348 Jan 26 09:05 arabic -rw-r--r-- 1 root root 967 Jan 26 09:05 azerbaijani -rw-r--r-- 1 root root 2202 Jan 26 09:05 basque -rw-r--r-- 1 root root 5443 Jan 26 09:05 bengali -rw-r--r-- 1 root root 1558 Jan 26 09:05 catalan -rw-r--r-- 1 root root 5560 Jan 26 09:05 chinese -rw-r--r-- 1 root root 424 Jan 26 09:05 danish -rw-r--r-- 1 root root 453 Jan 26 09:05 dutch -rw-r--r-- 1 root root 936 Jan 26 09:05 english -rw-r--r-- 1 root root 1579 Jan 26 09:05 finnish -rw-r--r-- 1 root root 813 Jan 26 09:05 french -rw-r--r-- 1 root root 1362 Jan 26 09:05 german -rw-r--r-- 1 root root 2167 Jan 26 09:05 greek -rw-r--r-- 1 root root 1836 Jan 26 09:05 hebrew -rw-r--r-- 1 root root 5958 Jan 26 09:05 hinglish -rw-r--r-- 1 root root 1227 Jan 26 09:05 hungarian -rw-r--r-- 1 root root 6446 Jan 26 09:05 indonesian -rw-r--r-- 1 root root 1654 Jan 26 09:05 italian -rw-r--r-- 1 root root 3880 Jan 26 09:05 kazakh -rw-r--r-- 1 root root 3610 Jan 26 09:05 nepali -rw-r--r-- 1 root root 851 Jan 26 09:05 norwegian -rw-r--r-- 1 root root 1286 Jan 26 09:05 portuguese -rw-r--r-- 1 root root 1910 Jan 26 09:05 romanian -rw-r--r-- 1 root root 1235 Jan 26 09:05 russian -rw-r--r-- 1 root root 15980 Jan 26 09:05 slovene -rw-r--r-- 1 root root 2176 Jan 26 09:05 spanish -rw-r--r-- 1 root root 559 Jan 26 09:05 swedish -rw-r--r-- 1 root root 1818 Jan 26 09:05 tajik -rw-r--r-- 1 root root 260 Jan 26 09:05 turkish
-
wrote on Jan 31, 2023, 9:43 PM last edited by
Thank you for troubleshooting this with me. Weird that I'm the only one!
-
@scooke I think I am not testing this correctly because as per the code atleast it should fail for me but it doesn't.
Should I enable some flag inside paperless for nltk handling ? There is a complete absence of nltk in my logs.
-
@scooke Can you try with 1.5.4 please?
I think I found the issue. First, the classifier needs to be created with
document_create_classifier
. One has to also put some tags, categories etc for documents. Once all that is setup, the nltk stuff kicks in. Without the classifier, nltk is just skipped. -
@scooke Can you try with 1.5.4 please?
I think I found the issue. First, the classifier needs to be created with
document_create_classifier
. One has to also put some tags, categories etc for documents. Once all that is setup, the nltk stuff kicks in. Without the classifier, nltk is just skipped. -
wrote on Jan 31, 2023, 11:47 PM last edited by
Latest Update 1.5.4 is working for me .
Thanks
-
@scooke said in Latest update seems to have similar issue as before, resources not found:
@girish I"m not seeing v.1.5.4... is there a way to force it?
Strange, it's published normally. Did you check for updates manually?
-
@scooke said in Latest update seems to have similar issue as before, resources not found:
@girish I"m not seeing v.1.5.4... is there a way to force it?
Strange, it's published normally. Did you check for updates manually?
-
@girish Do I need to be upgraded to v1.5.2 to see it? Or do I need to have auto-updates turned on. I've turned them off and am checking manually as I'd rather have an older version that works than a newer one that doesn't.