Tabula - extracts table data from PDFs (when copy-paste often doesn't)
-
"If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there's no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux."
https://tabula.technology
https://github.com/tabulapdf/tabulaTend to use this a lot for transcribing long PDF invoices.
-
@marcusquinn sounds like a useful tool, but appears to just be a desktop app and not a web app? So not sure how relevant it is to Cloudron...
-
@jdaviescoates Ahh, I thought there was a web app/service version. Prob needs moving to Discuss then if mods can?
Also for interest, it's pretty easy to send a scanned/image PDF to Google Vision using Integromat to OCR and extract text.
-
Revisiting this, the app runs on a localhost web server, hence could be a useful additional utility for teams to have access to at tabula.example.com.
-
Might seem unmaintained, but still works well, and remains the only open-source option for that that I know of.
Becoming more important as a library to use in other LLM data analysis needs.
Dockerised, too, should be relatively simple:
-
Python wrapper, too: https://github.com/chezou/tabula-py
-
By encouraging people to use Free Software, like LibreOffice, for their document creation, they will benefit from being able to export their final draft as a PDF with an embeded .odf for easy data extraction. It can also archive according to ISO / archiving standards, where needed.