Tabula - extracts table data from PDFs (when copy-paste often doesn't)
-
"If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there's no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux."
https://tabula.technology
https://github.com/tabulapdf/tabulaTend to use this a lot for transcribing long PDF invoices.
-
"If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there's no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux."
https://tabula.technology
https://github.com/tabulapdf/tabulaTend to use this a lot for transcribing long PDF invoices.
@marcusquinn sounds like a useful tool, but appears to just be a desktop app and not a web app? So not sure how relevant it is to Cloudron...
-
@marcusquinn sounds like a useful tool, but appears to just be a desktop app and not a web app? So not sure how relevant it is to Cloudron...
@jdaviescoates Ahh, I thought there was a web app/service version. Prob needs moving to Discuss then if mods can?
Also for interest, it's pretty easy to send a scanned/image PDF to Google Vision using Integromat to OCR and extract text.
-
@jdaviescoates Ahh, I thought there was a web app/service version. Prob needs moving to Discuss then if mods can?
Also for interest, it's pretty easy to send a scanned/image PDF to Google Vision using Integromat to OCR and extract text.
Revisiting this, the app runs on a localhost web server, hence could be a useful additional utility for teams to have access to at tabula.example.com.
-
Might seem unmaintained, but still works well, and remains the only open-source option for that that I know of.
Becoming more important as a library to use in other LLM data analysis needs.
Dockerised, too, should be relatively simple:
-
Python wrapper, too: https://github.com/chezou/tabula-py
-
"If you’ve ever tried to do anything with data provided to you in PDFs, you know how painful it is — there's no easy way to copy-and-paste rows of data out of PDF files. Tabula allows you to extract that data into a CSV or Microsoft Excel spreadsheet using a simple, easy-to-use interface. Tabula works on Mac, Windows and Linux."
https://tabula.technology
https://github.com/tabulapdf/tabulaTend to use this a lot for transcribing long PDF invoices.
By encouraging people to use Free Software, like LibreOffice, for their document creation, they will benefit from being able to export their final draft as a PDF with an embeded .odf for easy data extraction. It can also archive according to ISO / archiving standards, where needed.