Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.


    Cloudron Forum

    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular

    Unsolved indexing of office documents?

    Paperless-ngx
    7
    12
    428
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • fbartels
      fbartels App Dev @doodlemania2 last edited by

      @doodlemania2 Thanks for packaging it. Does your container also include the indexing of office documents? There was another software required as far as I remember.

      doodlemania2 1 Reply Last reply Reply Quote 3
      • doodlemania2
        doodlemania2 App Dev @fbartels last edited by

        @fbartels not sure - I am using it at the moment for about a bazillion PDFs. Haven't tried Office Files. Let me know, I'm sure we can tweak if it requires a bit more software.

        nebulon fbartels 2 Replies Last reply Reply Quote 1
        • nebulon
          nebulon Staff @doodlemania2 last edited by

          @doodlemania2 maybe we can discuss further package shortcomings in the paperless-ng forum section then

          1 Reply Last reply Reply Quote 2
          • Moved from Off-topic by  girish girish 
          • girish
            girish Staff last edited by

            I moved this to a new topic.

            1 Reply Last reply Reply Quote 0
            • fbartels
              fbartels App Dev @doodlemania2 last edited by

              @doodlemania2 I was looking for some further information on this.

              Paperless-ng uses Tika (to extract data from other file types) and Gotenberg (for pdf conversion) for this.

              More at https://paperless-ng.readthedocs.io/en/latest/configuration.html#tika-settings

              1 Reply Last reply Reply Quote 2
              • ChristopherMag
                ChristopherMag last edited by ChristopherMag

                This is something I need as well, working on rolling out PaperlessNGX to replace an existing storage system and need to be able to store office documents as well so that all documents related to an entity are together.

                It looks like Tika and Gotenberg both have docker containers available.

                Do we need to have separate Tika and Gotenberg apps like we have OnlyOffice and Collabra Online as a separate apps though they are not usable on their own and are used by NextCloud for handling office documents?

                ChristopherMag 1 Reply Last reply Reply Quote 2
                • ChristopherMag
                  ChristopherMag @ChristopherMag last edited by

                  For anyone wanting to get this up and running quickly if you have docker running on another system you can run the following:

                  docker run -d --restart unless-stopped -p 3000:3000 gotenberg/gotenberg
                  docker run -d --restart unless-stopped -p 9998:9998 apache/tika
                  

                  And then add the following to your paperless.conf:

                  # Tika
                  PAPERLESS_TIKA_ENABLED=true
                  PAPERLESS_TIKA_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:9998
                  PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://<DockerHostnameOrIPGoesHere>:3000
                  

                  After this you can upload xlsx, docx, etc. to paperless-ngx.

                  In my testing if the Docker host running the Tika and Gotenberg containers goes down paperless-ngx keeps working fine but you won't be able to upload additional xlsx/docx/etc. documents until you restart the containers which works out fine as the reliability of paperless-ngx being accessible is way more important than this one feature working for us.

                  jdaviescoates timconsidine 2 Replies Last reply Reply Quote 5
                  • jdaviescoates
                    jdaviescoates @ChristopherMag last edited by

                    @ChristopherMag thanks for sharing!

                    @staff can we get this into the package somehow?

                    I use Cloudron with Gandi & Hetzner

                    girish 1 Reply Last reply Reply Quote 0
                    • Topic has been marked as a question  nebulon nebulon 
                    • girish
                      girish Staff @jdaviescoates last edited by

                      @jdaviescoates yes, I will put this as a point to investigate for next release. I think we have to investigate what nextcloud needs for FTS as well as apps like these and design accordingly. (Maybe they have to become addons or alternately we can just put them in the app itself).

                      jdaviescoates 1 Reply Last reply Reply Quote 3
                      • jdaviescoates
                        jdaviescoates @girish last edited by

                        @girish said in indexing of office documents?:

                        I think we have to investigate what nextcloud needs for FTS as well

                        On that, I spotted recently that they do now at least mention Solr under Platform Apps over on https://github.com/nextcloud/fulltextsearch but as far as I can tell from the linked wiki https://github.com/nextcloud/fulltextsearch/wiki to date there is still only an Elastic Search Platform App.

                        I use Cloudron with Gandi & Hetzner

                        1 Reply Last reply Reply Quote 0
                        • timconsidine
                          timconsidine App Dev @ChristopherMag last edited by

                          @ChristopherMag do you or anyone else have experience of using this kind of setup for Mac OS document formats like Pages and Numbers ?

                          ChristopherMag 1 Reply Last reply Reply Quote 0
                          • ChristopherMag
                            ChristopherMag @timconsidine last edited by ChristopherMag

                            @timconsidine It looks like Apache Tika supports the document formats from the iWork suite like pages.

                            I tried to upload a .pages file to paperless-ngx with Tika and gotenberg configured and paperless popped up a failure message with the the error File type application/zip not supported.

                            I believe this signals.py file in the paperless-ngx project would need to add support for the various iWork software suite formats to resolve this error and get this working assuming you already have Tika and gotenberg setup and working with paperless-ngx.

                            You could probably open a github issue in the paperless-ngx repository on Github and see if they can assist with adding support fort his.

                            1 Reply Last reply Reply Quote 1
                            • First post
                              Last post
                            Powered by NodeBB