Unsolved indexing of office documents?
-
@doodlemania2 Thanks for packaging it. Does your container also include the indexing of office documents? There was another software required as far as I remember.
-
@fbartels not sure - I am using it at the moment for about a bazillion PDFs. Haven't tried Office Files. Let me know, I'm sure we can tweak if it requires a bit more software.
-
@doodlemania2 maybe we can discuss further package shortcomings in the paperless-ng forum section then
-
girish
-
I moved this to a new topic.
-
@doodlemania2 I was looking for some further information on this.
Paperless-ng uses Tika (to extract data from other file types) and Gotenberg (for pdf conversion) for this.
More at https://paperless-ng.readthedocs.io/en/latest/configuration.html#tika-settings
-
This is something I need as well, working on rolling out PaperlessNGX to replace an existing storage system and need to be able to store office documents as well so that all documents related to an entity are together.
It looks like Tika and Gotenberg both have docker containers available.
Do we need to have separate Tika and Gotenberg apps like we have OnlyOffice and Collabra Online as a separate apps though they are not usable on their own and are used by NextCloud for handling office documents?
-
For anyone wanting to get this up and running quickly if you have docker running on another system you can run the following:
docker run -d --restart unless-stopped -p 3000:3000 gotenberg/gotenberg docker run -d --restart unless-stopped -p 9998:9998 apache/tika
And then add the following to your
paperless.conf
:# Tika PAPERLESS_TIKA_ENABLED=true PAPERLESS_TIKA_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:9998 PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:3000
After this you can upload xlsx, docx, etc. to paperless-ngx.
In my testing if the Docker host running the Tika and Gotenberg containers goes down paperless-ngx keeps working fine but you won't be able to upload additional xlsx/docx/etc. documents until you restart the containers which works out fine as the reliability of paperless-ngx being accessible is way more important than this one feature working for us.
-
@ChristopherMag thanks for sharing!
@staff can we get this into the package somehow?
-
nebulon
-
@jdaviescoates yes, I will put this as a point to investigate for next release. I think we have to investigate what nextcloud needs for FTS as well as apps like these and design accordingly. (Maybe they have to become addons or alternately we can just put them in the app itself).
-
@girish said in indexing of office documents?:
I think we have to investigate what nextcloud needs for FTS as well
On that, I spotted recently that they do now at least mention Solr under Platform Apps over on https://github.com/nextcloud/fulltextsearch but as far as I can tell from the linked wiki https://github.com/nextcloud/fulltextsearch/wiki to date there is still only an Elastic Search Platform App.