Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Bookmarks
  • Search
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
Brand Logo

Cloudron Forum

Apps | Demo | Docs | Install
  1. Cloudron Forum
  2. Paperless-ngx
  3. indexing of office documents?

indexing of office documents?

Scheduled Pinned Locked Moved Unsolved Paperless-ngx
21 Posts 9 Posters 3.1k Views 8 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • doodlemania2D doodlemania2

      @neurokrish I may be able to assist here - I'll submit my package to the app store and I think that will fix custom to app store update (but maybe not hehe)

      fbartelsF Offline
      fbartelsF Offline
      fbartels
      App Dev
      wrote on last edited by
      #1

      @doodlemania2 Thanks for packaging it. Does your container also include the indexing of office documents? There was another software required as far as I remember.

      doodlemania2D 1 Reply Last reply
      3
      • fbartelsF fbartels

        @doodlemania2 Thanks for packaging it. Does your container also include the indexing of office documents? There was another software required as far as I remember.

        doodlemania2D Offline
        doodlemania2D Offline
        doodlemania2
        App Dev
        wrote on last edited by
        #2

        @fbartels not sure - I am using it at the moment for about a bazillion PDFs. Haven't tried Office Files. Let me know, I'm sure we can tweak if it requires a bit more software.

        nebulonN fbartelsF 2 Replies Last reply
        1
        • doodlemania2D doodlemania2

          @fbartels not sure - I am using it at the moment for about a bazillion PDFs. Haven't tried Office Files. Let me know, I'm sure we can tweak if it requires a bit more software.

          nebulonN Offline
          nebulonN Offline
          nebulon
          Staff
          wrote on last edited by
          #3

          @doodlemania2 maybe we can discuss further package shortcomings in the paperless-ng forum section then

          1 Reply Last reply
          2
          • girishG girish moved this topic from Off-topic on
          • girishG Offline
            girishG Offline
            girish
            Staff
            wrote on last edited by
            #4

            I moved this to a new topic.

            1 Reply Last reply
            0
            • doodlemania2D doodlemania2

              @fbartels not sure - I am using it at the moment for about a bazillion PDFs. Haven't tried Office Files. Let me know, I'm sure we can tweak if it requires a bit more software.

              fbartelsF Offline
              fbartelsF Offline
              fbartels
              App Dev
              wrote on last edited by
              #5

              @doodlemania2 I was looking for some further information on this.

              Paperless-ng uses Tika (to extract data from other file types) and Gotenberg (for pdf conversion) for this.

              More at https://paperless-ng.readthedocs.io/en/latest/configuration.html#tika-settings

              1 Reply Last reply
              2
              • ChristopherMagC Offline
                ChristopherMagC Offline
                ChristopherMag
                wrote on last edited by ChristopherMag
                #6

                This is something I need as well, working on rolling out PaperlessNGX to replace an existing storage system and need to be able to store office documents as well so that all documents related to an entity are together.

                It looks like Tika and Gotenberg both have docker containers available.

                Do we need to have separate Tika and Gotenberg apps like we have OnlyOffice and Collabra Online as a separate apps though they are not usable on their own and are used by NextCloud for handling office documents?

                ChristopherMagC 1 Reply Last reply
                3
                • ChristopherMagC ChristopherMag

                  This is something I need as well, working on rolling out PaperlessNGX to replace an existing storage system and need to be able to store office documents as well so that all documents related to an entity are together.

                  It looks like Tika and Gotenberg both have docker containers available.

                  Do we need to have separate Tika and Gotenberg apps like we have OnlyOffice and Collabra Online as a separate apps though they are not usable on their own and are used by NextCloud for handling office documents?

                  ChristopherMagC Offline
                  ChristopherMagC Offline
                  ChristopherMag
                  wrote on last edited by
                  #7

                  For anyone wanting to get this up and running quickly if you have docker running on another system you can run the following:

                  docker run -d --restart unless-stopped -p 3000:3000 gotenberg/gotenberg
                  docker run -d --restart unless-stopped -p 9998:9998 apache/tika
                  

                  And then add the following to your paperless.conf:

                  # Tika
                  PAPERLESS_TIKA_ENABLED=true
                  PAPERLESS_TIKA_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:9998
                  PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:3000
                  

                  After this you can upload xlsx, docx, etc. to paperless-ngx.

                  In my testing if the Docker host running the Tika and Gotenberg containers goes down paperless-ngx keeps working fine but you won't be able to upload additional xlsx/docx/etc. documents until you restart the containers which works out fine as the reliability of paperless-ngx being accessible is way more important than this one feature working for us.

                  jdaviescoatesJ timconsidineT 2 Replies Last reply
                  6
                  • ChristopherMagC ChristopherMag

                    For anyone wanting to get this up and running quickly if you have docker running on another system you can run the following:

                    docker run -d --restart unless-stopped -p 3000:3000 gotenberg/gotenberg
                    docker run -d --restart unless-stopped -p 9998:9998 apache/tika
                    

                    And then add the following to your paperless.conf:

                    # Tika
                    PAPERLESS_TIKA_ENABLED=true
                    PAPERLESS_TIKA_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:9998
                    PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:3000
                    

                    After this you can upload xlsx, docx, etc. to paperless-ngx.

                    In my testing if the Docker host running the Tika and Gotenberg containers goes down paperless-ngx keeps working fine but you won't be able to upload additional xlsx/docx/etc. documents until you restart the containers which works out fine as the reliability of paperless-ngx being accessible is way more important than this one feature working for us.

                    jdaviescoatesJ Offline
                    jdaviescoatesJ Offline
                    jdaviescoates
                    wrote on last edited by
                    #8

                    @ChristopherMag thanks for sharing!

                    @staff can we get this into the package somehow?

                    I use Cloudron with Gandi & Hetzner

                    girishG 1 Reply Last reply
                    0
                    • nebulonN nebulon marked this topic as a question on
                    • jdaviescoatesJ jdaviescoates

                      @ChristopherMag thanks for sharing!

                      @staff can we get this into the package somehow?

                      girishG Offline
                      girishG Offline
                      girish
                      Staff
                      wrote on last edited by
                      #9

                      @jdaviescoates yes, I will put this as a point to investigate for next release. I think we have to investigate what nextcloud needs for FTS as well as apps like these and design accordingly. (Maybe they have to become addons or alternately we can just put them in the app itself).

                      jdaviescoatesJ 1 Reply Last reply
                      3
                      • girishG girish

                        @jdaviescoates yes, I will put this as a point to investigate for next release. I think we have to investigate what nextcloud needs for FTS as well as apps like these and design accordingly. (Maybe they have to become addons or alternately we can just put them in the app itself).

                        jdaviescoatesJ Offline
                        jdaviescoatesJ Offline
                        jdaviescoates
                        wrote on last edited by
                        #10

                        @girish said in indexing of office documents?:

                        I think we have to investigate what nextcloud needs for FTS as well

                        On that, I spotted recently that they do now at least mention Solr under Platform Apps over on https://github.com/nextcloud/fulltextsearch but as far as I can tell from the linked wiki https://github.com/nextcloud/fulltextsearch/wiki to date there is still only an Elastic Search Platform App.

                        I use Cloudron with Gandi & Hetzner

                        1 Reply Last reply
                        0
                        • ChristopherMagC ChristopherMag

                          For anyone wanting to get this up and running quickly if you have docker running on another system you can run the following:

                          docker run -d --restart unless-stopped -p 3000:3000 gotenberg/gotenberg
                          docker run -d --restart unless-stopped -p 9998:9998 apache/tika
                          

                          And then add the following to your paperless.conf:

                          # Tika
                          PAPERLESS_TIKA_ENABLED=true
                          PAPERLESS_TIKA_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:9998
                          PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:3000
                          

                          After this you can upload xlsx, docx, etc. to paperless-ngx.

                          In my testing if the Docker host running the Tika and Gotenberg containers goes down paperless-ngx keeps working fine but you won't be able to upload additional xlsx/docx/etc. documents until you restart the containers which works out fine as the reliability of paperless-ngx being accessible is way more important than this one feature working for us.

                          timconsidineT Online
                          timconsidineT Online
                          timconsidine
                          App Dev
                          wrote on last edited by
                          #11

                          @ChristopherMag do you or anyone else have experience of using this kind of setup for Mac OS document formats like Pages and Numbers ?

                          ChristopherMagC 1 Reply Last reply
                          0
                          • timconsidineT timconsidine

                            @ChristopherMag do you or anyone else have experience of using this kind of setup for Mac OS document formats like Pages and Numbers ?

                            ChristopherMagC Offline
                            ChristopherMagC Offline
                            ChristopherMag
                            wrote on last edited by ChristopherMag
                            #12

                            @timconsidine It looks like Apache Tika supports the document formats from the iWork suite like pages.

                            I tried to upload a .pages file to paperless-ngx with Tika and gotenberg configured and paperless popped up a failure message with the the error File type application/zip not supported.

                            I believe this signals.py file in the paperless-ngx project would need to add support for the various iWork software suite formats to resolve this error and get this working assuming you already have Tika and gotenberg setup and working with paperless-ngx.

                            You could probably open a github issue in the paperless-ngx repository on Github and see if they can assist with adding support fort his.

                            1 Reply Last reply
                            2
                            • necrevistonnezrN Offline
                              necrevistonnezrN Offline
                              necrevistonnezr
                              wrote on last edited by
                              #13

                              Any updates on this?

                              girishG 1 Reply Last reply
                              1
                              • necrevistonnezrN necrevistonnezr

                                Any updates on this?

                                girishG Offline
                                girishG Offline
                                girish
                                Staff
                                wrote on last edited by
                                #14

                                @necrevistonnezr nothing yet...

                                1 Reply Last reply
                                0
                                • neurokrishN Offline
                                  neurokrishN Offline
                                  neurokrish
                                  wrote on last edited by neurokrish
                                  #15

                                  Hi, is there an update on this? I tried @ChristopherMag 's suggestion. However, I get connection refused for Tika. Is this something to do with iptables? How can I allow connection to the container for paperless app to access?

                                  EDIT: I must say that I have installed docker - tika and gotenberg in the same system as Cloudron.

                                  1 Reply Last reply
                                  0
                                  • nebulonN Offline
                                    nebulonN Offline
                                    nebulon
                                    Staff
                                    wrote on last edited by
                                    #16

                                    I don't think we have an update on this yet. Possibly your containers are not within the same docker network on the system? Either way adding docker container on the side of Cloudron will break on Cloudron updates, so this is not very useful to investigate as such. Have you instead tried to run the required services on a separate isolated server instead?

                                    1 Reply Last reply
                                    0
                                    • neurokrishN Offline
                                      neurokrishN Offline
                                      neurokrish
                                      wrote on last edited by
                                      #17

                                      @nebulon , thanks for your reply. Tried both ways, containers outside and inside Cloudron network. Good to know doing the later will break updates. Removed those containers now. Is it difficult to pre-install these containers via the app itself? Alternatively, may be provide them as separate installations as separate Cloudron apps which can be linked to paperless?

                                      1 Reply Last reply
                                      0
                                      • nebulonN Offline
                                        nebulonN Offline
                                        nebulon
                                        Staff
                                        wrote on last edited by
                                        #18

                                        Unless Tika and Gotenburg are useful for other apps, it may make more sense to actually package them as part of paperless and pre-configure everything.

                                        Does anyone have experience on the memory requirement for those?

                                        1 Reply Last reply
                                        1
                                        • ChristopherMagC Offline
                                          ChristopherMagC Offline
                                          ChristopherMag
                                          wrote on last edited by
                                          #19

                                          Fyi gotenberg publishes cloudron specific images now. Not sure the history of how or why that was started but I would assume those are meant to be used as a cloudron app though I don't see any app in the app store for it.

                                          PS, don't use these images for your own gotenberg instance that your integrating with paperless, they exist hopefully to make it easier one day to run gotenberg on cloudron directly.

                                          necrevistonnezrN 1 Reply Last reply
                                          0
                                          • ChristopherMagC ChristopherMag

                                            Fyi gotenberg publishes cloudron specific images now. Not sure the history of how or why that was started but I would assume those are meant to be used as a cloudron app though I don't see any app in the app store for it.

                                            PS, don't use these images for your own gotenberg instance that your integrating with paperless, they exist hopefully to make it easier one day to run gotenberg on cloudron directly.

                                            necrevistonnezrN Offline
                                            necrevistonnezrN Offline
                                            necrevistonnezr
                                            wrote on last edited by
                                            #20

                                            @ChristopherMag It says ‘cloudrun’ - sure it’s just a typo or does it mean something like ‘cloud-run’?

                                            1 Reply Last reply
                                            0
                                            Reply
                                            • Reply as topic
                                            Log in to reply
                                            • Oldest to Newest
                                            • Newest to Oldest
                                            • Most Votes


                                              • Login

                                              • Don't have an account? Register

                                              • Login or register to search.
                                              • First post
                                                Last post
                                              0
                                              • Categories
                                              • Recent
                                              • Tags
                                              • Popular
                                              • Bookmarks
                                              • Search