Installing Gotenberg and Tika was simpler than I expected (if you have spare VPSes)!
-
Someone HAS shared this info already, but I wanted to share it again.
I was really wanting to upload Word files. But obviously they won't get processed unless you have Gotenbern and Tika installed. Thanks to the post above from @ChristopherMag I thought I'd try to get those installed.
My challenges: I don't want to install on the VPS with Cloudron because I don't know if that will mess it up. Conceptually I also wasn't sure what address to use for a same-VPS-install-as-Cloudron since they wouldn't be in the same Docker world ( I don't know what it's called) as Cloudron. So, I looked at my other VPS.
VPS 2 runs CapRover, and CapRover offers Gotenberg. My first attempt failed due to a 504 error, which is some kind of timeout error. So I lengthened the access times and wait times for Gotenberg in the CapRover install form, and the second time it worked. I then copied and pasted the four lines that @ChristopherMag had in his post and edited it for my Gotenberg url*.
Then, I needed another VPS since the RAM on VPS 2 was close to max with that latest app. Fortunately, I have other VPSes! One of them is running a LAMP setup plus a Presearch install. I figured that adding Tika to that should be fine, and if something borked up, it wouldn't be so tragic a loss. I followed https://github.com/apache/tika-docker, running
docker run -d -p 127.0.0.1:9998:9998 apache/tika:<tag>
(replacing localhost with my VPS' IP). It worked. Then I entered that url in the paperless-ngx paperless.conf. Voila! It all works.- However, I had tried to install Gotenberg using Easypanel (on yet another VPS - yeah, I'm a LEB fan), and when I entered the url from that I used the :3000. But it never worked. The Easypanel dashboard made it seem like I needed to have port 3000 as part of the url for paperless-ngx, but it never worked.
So when I found I could use CapRover for Gotenberg, I saw that it's dashboard just gave the domain as the url, minus the port. So, I thought, "OK, I will try that." Now, in my paperless.conf, the Tika url includes the port 9998, but the Gotenberg url doesn't, and it works. I wonder if I needed to have had the port when using Easypanel, but I am not going to try because right now it's all working.
I guess I'm surprised because in the past when I've tried to install more than one thing using Docker, by hand, it never worked. I always had to connect them somehow (I think it was more docker-compose at the time), and I could never figure it out. But maybe I understand it a bit better now, and I was pretty sure that plopping in Tika beside Presearch shouldn't mess anything up. I'm glad I could use CapRover though and Gotenberg has more working parts than Tika to function. My only concern is a warning from the Tika page,
In the example above, we recommend binding the server to localhost because Docker alters iptables and may expose your tika-server to the internet. If you are confident that your tika-server is on an isolated network you can simply run:
I need to do some reading to see if using the IP of the VPS might weaken that server somehow. -
@scooke maybe you can clarify one thing which is not clear in my mind
I put PDF and JPG/PNG into Paperless because these formats are usually not edited, they're semi-frozen.
XLS(X) and DOC(X) are often more living documents with edits, especially XLS(X). Does that mean you re-upload into Paperless when you made a local edit ? And delete the old one ? Or you only upload MS documents which are "finished" and won't change ?
I think Paperless is great and use it for "documents of record", invoices, agreements etc. I tend to think Nextcloud (or Seafile in my case) is more appropriate for living documents.
Interested in your and other views.
-
@timconsidine Yes, definitely for "finished" documents.