Multi Language OCR Support
-
Is it possible to enable multi-language OCR support for paperless-ng?
As far as I understand, docker has variables for this. And I also found this modification.
-
@wisemetalhead according to this, we can enable additional languages under
PAPERLESS_OCR_LANGUAGE
inOCR settings
options. This can be found in thepaperless.conf
file when you click thefile manager
icon of the paperless-ng app's settings in Cloudron. -
@nebulon @WiseMetalhead there is some more information here. Looks like there are two options,
PAPERLESS_OCR_LANGUAGES
andPAPERLESS_OCR_LANGUAGE
. The first one determines which packages to install and the second one, which to use. The first option is missing in thepaperless.conf
file. May be we can add this manually and restart the app? -
-
@wisemetalhead according to the docs, the
PAPERLESS_OCR_LANGUAGES
option should be configured indocker-compose.env
and not inpaperless.conf
. Perhaps @nebulon can help here.. -
@neurokrish I have to try this here myself, since the Cloudron app package has nothing to do with their upstream docker image, the default self-hosting config docs would apply instead, so https://paperless-ng.readthedocs.io/en/latest/configuration.html?highlight=languages#ocr-settings
-
@nebulon then may be a standard apt-get install for additional language packs is the way to go?
May be we can install all OCR languages by default using
sudo apt-get install tesseract-ocr-all
as mentioned here. This way, the app has all languages installed by default and users can choose a specific language by modifying thePAPERLESS_OCR_LANGUAGE
flag inpaperless.conf
. -
@neurokrish thanks for the suggestion, at least it solved the issue for me with a
deu+eng
setting. The just updated package v0.7.0 has those changes. -
@nebulon great! @WiseMetalhead, can you confirm that the latest update solves your issue with OCR?
-
@neurokrish @nebulon Now it works perfectly! Thanks for the help.
-
@neurokrish tesseract has a Docker file and it would be nice to support Tesseract on Cloudron.
-
-
I have same issue i need to integrate tamil ocr(tam) i already installed tamil and done all steps as like said above still its not supported. it is throwing error like this when docker up
"?: The selected ocr language tam is not installed. Paperless cannot OCR your documents without it. Please fix PAPERLESS_OCR_LANGUAGE.
"
@nebulon @neurokrish @WiseMetalhead @LoudLemur -
@vyshnavR said in Multi Language OCR Support:
I have same issue i need to integrate tamil ocr(tam) i already installed tamil and done all steps as like said above still its not supported. it is throwing error like this when docker up
The thread here (and this forum) is about the Cloudron package of paperless-ngx . It looks like you are using docker installation. You have to take this up with the upstream project. Cloudron also uses docker but it does not use the upstream dockerfiles.