@scooke I've updated to v1.5.2, and neither with or without that conf setting does it work. Still saying it can't find corpora/stopwords, even though it's there.
Logs:
Jan 31 22:31:41 [2023-01-31 21:31:41,532] [INFO] [celery.worker.strategy] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] received
Jan 31 22:31:41 [2023-01-31 21:31:41,652] [INFO] [paperless.consumer] Consuming Doc - May 25, 2014, 11-08 AM.pdf
Jan 31 22:31:44 [2023-01-31 21:31:44,150] [INFO] [ocrmypdf._pipeline] page is facing ⇧, confidence 4.01 - no change
Jan 31 22:31:53 [2023-01-31 21:31:53,935] [INFO] [ocrmypdf._sync] Postprocessing...
Jan 31 22:31:54 [2023-01-31 21:31:54,867] [INFO] [ocrmypdf._pipeline] Optimize ratio: 1.40 savings: 28.3%
Jan 31 22:31:54 [2023-01-31 21:31:54,872] [INFO] [ocrmypdf._sync] Output file is a PDF/A-2B (as expected)
Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.078 * 10 changes in 300 seconds. Saving...
Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.079 * Background saving started by pid 1303
Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.083 * DB saved on disk
Jan 31 22:31:57 1303:C 31 Jan 2023 21:31:57.084 * RDB: 0 MB of memory used by copy-on-write
Jan 31 22:31:57 1263:M 31 Jan 2023 21:31:57.180 * Background saving terminated with success
Jan 31 22:32:00 [2023-01-31 21:32:00,895] [ERROR] [paperless.consumer] The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf:
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00 Resource stopwords not found.
Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource:
Jan 31 22:32:00
Jan 31 22:32:00 >>> import nltk
Jan 31 22:32:00 >>> nltk.download('stopwords')
Jan 31 22:32:00
Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html
Jan 31 22:32:00
Jan 31 22:32:00 Attempted to load corpora/stopwords
Jan 31 22:32:00
Jan 31 22:32:00 Searched in:
Jan 31 22:32:00 - '/usr/local/share/ntlk_data'
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00 Traceback (most recent call last):
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load
Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}")
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find
Jan 31 22:32:00 raise LookupError(resource_not_found)
Jan 31 22:32:00 LookupError:
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00 Resource stopwords not found.
Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource:
Jan 31 22:32:00
Jan 31 22:32:00 >>> import nltk
Jan 31 22:32:00 >>> nltk.download('stopwords')
Jan 31 22:32:00
Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html
Jan 31 22:32:00
Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/
Jan 31 22:32:00
Jan 31 22:32:00 Searched in:
Jan 31 22:32:00 - '/usr/local/share/ntlk_data'
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00
Jan 31 22:32:00
Jan 31 22:32:00 During handling of the above exception, another exception occurred:
Jan 31 22:32:00
Jan 31 22:32:00 Traceback (most recent call last):
Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file
Jan 31 22:32:00 document_consumption_finished.send(
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send
Jan 31 22:32:00 return [
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named))
Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags
Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier)
Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags
Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content)
Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags
Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)])
Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content
Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE))
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__
Jan 31 22:32:00 self.__load()
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load
Jan 31 22:32:00 raise e
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load
Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}")
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find
Jan 31 22:32:00 raise LookupError(resource_not_found)
Jan 31 22:32:00 LookupError:
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00 Resource stopwords not found.
Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource:
Jan 31 22:32:00
Jan 31 22:32:00 >>> import nltk
Jan 31 22:32:00 >>> nltk.download('stopwords')
Jan 31 22:32:00
Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html
Jan 31 22:32:00
Jan 31 22:32:00 Attempted to load corpora/stopwords
Jan 31 22:32:00
Jan 31 22:32:00 Searched in:
Jan 31 22:32:00 - '/usr/local/share/ntlk_data'
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00
Jan 31 22:32:00 [2023-01-31 21:32:00,915] [ERROR] [celery.app.trace] Task documents.tasks.consume_file[3e572bd2-9ccc-4ee4-9201-fb350b47cfd9] raised unexpected: ConsumerError("Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf: \n**********************************************************************\n Resource \x1b[93mstopwords\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('stopwords')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mcorpora/stopwords\x1b[0m\n\n Searched in:\n - '/usr/local/share/ntlk_data'\n**********************************************************************\n")
Jan 31 22:32:00 Traceback (most recent call last):
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 84, in __load
Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{zip_name}")
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find
Jan 31 22:32:00 raise LookupError(resource_not_found)
Jan 31 22:32:00 LookupError:
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00 Resource stopwords not found.
Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource:
Jan 31 22:32:00
Jan 31 22:32:00 >>> import nltk
Jan 31 22:32:00 >>> nltk.download('stopwords')
Jan 31 22:32:00
Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html
Jan 31 22:32:00
Jan 31 22:32:00 Attempted to load corpora/stopwords.zip/stopwords/
Jan 31 22:32:00
Jan 31 22:32:00 Searched in:
Jan 31 22:32:00 - '/usr/local/share/ntlk_data'
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00
Jan 31 22:32:00
Jan 31 22:32:00 During handling of the above exception, another exception occurred:
Jan 31 22:32:00
Jan 31 22:32:00 Traceback (most recent call last):
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/asgiref/sync.py", line 302, in main_wrap
Jan 31 22:32:00 raise exc_info[1]
Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 410, in try_consume_file
Jan 31 22:32:00 document_consumption_finished.send(
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 176, in send
Jan 31 22:32:00 return [
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>
Jan 31 22:32:00 (receiver, receiver(signal=self, sender=sender, **named))
Jan 31 22:32:00 File "/app/code/src/documents/signals/handlers.py", line 194, in set_tags
Jan 31 22:32:00 matched_tags = matching.match_tags(document, classifier)
Jan 31 22:32:00 File "/app/code/src/documents/matching.py", line 50, in match_tags
Jan 31 22:32:00 predicted_tag_ids = classifier.predict_tags(document.content)
Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 370, in predict_tags
Jan 31 22:32:00 X = self.data_vectorizer.transform([self.preprocess_content(content)])
Jan 31 22:32:00 File "/app/code/src/documents/classifier.py", line 331, in preprocess_content
Jan 31 22:32:00 self._stop_words = set(stopwords.words(settings.NLTK_LANGUAGE))
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 121, in __getattr__
Jan 31 22:32:00 self.__load()
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 86, in __load
Jan 31 22:32:00 raise e
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/corpus/util.py", line 81, in __load
Jan 31 22:32:00 root = nltk.data.find(f"{self.subdir}/{self.__name}")
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/nltk/data.py", line 583, in find
Jan 31 22:32:00 raise LookupError(resource_not_found)
Jan 31 22:32:00 LookupError:
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00 Resource stopwords not found.
Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource:
Jan 31 22:32:00
Jan 31 22:32:00 >>> import nltk
Jan 31 22:32:00 >>> nltk.download('stopwords')
Jan 31 22:32:00
Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html
Jan 31 22:32:00
Jan 31 22:32:00 Attempted to load corpora/stopwords
Jan 31 22:32:00
Jan 31 22:32:00 Searched in:
Jan 31 22:32:00 - '/usr/local/share/ntlk_data'
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00
Jan 31 22:32:00
Jan 31 22:32:00 The above exception was the direct cause of the following exception:
Jan 31 22:32:00
Jan 31 22:32:00 Traceback (most recent call last):
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 451, in trace_task
Jan 31 22:32:00 R = retval = fun(*args, **kwargs)
Jan 31 22:32:00 File "/usr/local/lib/python3.10/dist-packages/celery/app/trace.py", line 734, in __protected_call__
Jan 31 22:32:00 return self.run(*args, **kwargs)
Jan 31 22:32:00 File "/app/code/src/documents/tasks.py", line 192, in consume_file
Jan 31 22:32:00 document = Consumer().try_consume_file(
Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 468, in try_consume_file
Jan 31 22:32:00 self._fail(
Jan 31 22:32:00 File "/app/code/src/documents/consumer.py", line 93, in _fail
Jan 31 22:32:00 raise ConsumerError(f"{self.filename}: {log_message or message}") from exception
Jan 31 22:32:00 documents.consumer.ConsumerError: Doc - May 25, 2014, 11-08 AM.pdf: The following error occurred while consuming Doc - May 25, 2014, 11-08 AM.pdf:
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00 Resource stopwords not found.
Jan 31 22:32:00 Please use the NLTK Downloader to obtain the resource:
Jan 31 22:32:00
Jan 31 22:32:00 >>> import nltk
Jan 31 22:32:00 >>> nltk.download('stopwords')
Jan 31 22:32:00
Jan 31 22:32:00 For more information see: https://www.nltk.org/data.html
Jan 31 22:32:00
Jan 31 22:32:00 Attempted to load corpora/stopwords
Jan 31 22:32:00
Jan 31 22:32:00 Searched in:
Jan 31 22:32:00 - '/usr/local/share/ntlk_data'
Jan 31 22:32:00 **********************************************************************
Jan 31 22:32:00