Tabula - extracts table data from PDFs (when copy-paste often doesn't)

marcusquinn

"If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there's no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux."

https://tabula.technology
https://github.com/tabulapdf/tabula

Tend to use this a lot for transcribing long PDF invoices.

jdaviescoates

@marcusquinn sounds like a useful tool, but appears to just be a desktop app and not a web app? So not sure how relevant it is to Cloudron...

marcusquinn

@jdaviescoates Ahh, I thought there was a web app/service version. Prob needs moving to Discuss then if mods can?

Also for interest, it's pretty easy to send a scanned/image PDF to Google Vision using Integromat to OCR and extract text.

marcusquinn

Revisiting this, the app runs on a localhost web server, hence could be a useful additional utility for teams to have access to at tabula.example.com.

marcusquinn

Might seem unmaintained, but still works well, and remains the only open-source option for that that I know of.

Becoming more important as a library to use in other LLM data analysis needs.

Dockerised, too, should be relatively simple:

marcusquinn

Python wrapper, too: https://github.com/chezou/tabula-py

LoudLemur

@marcusquinn

By encouraging people to use Free Software, like LibreOffice, for their document creation, they will benefit from being able to export their final draft as a PDF with an embeded .odf for easy data extraction. It can also archive according to ISO / archiving standards, where needed.

Cloudron makes it easy to run web apps like WordPress, Nextcloud, GitLab on your server. Find out more or install now.

Cloudron Forum

Tabula - extracts table data from PDFs (when copy-paste often doesn't)