tabulapdf/tabula-java

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/tabulapdf/tabula-java)

tabulapdf / tabula-java

Extract tables from PDF files

☆2,035

Alternatives and similar repositories for tabula-java

Users that are interested in tabula-java are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

tabulapdf / tabula
View on GitHub
Tabula is a tool for liberating data tables trapped inside PDF files
☆7,446Mar 14, 2025Updated last year
chezou / tabula-py
View on GitHub
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
☆2,315Dec 5, 2024Updated last year
thoqbk / traprange
View on GitHub
(Java)A Method to Extract Tabular Content from PDF Files
☆340Apr 22, 2023Updated 3 years ago
tabulapdf / tabula-extractor
View on GitHub
Extract tables from PDF files
☆358May 17, 2016Updated 10 years ago
camelot-dev / excalibur
View on GitHub
A web interface to extract tabular data from PDFs
☆1,810May 20, 2026Updated 2 months ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
atlanhq / camelot
View on GitHub
Camelot: PDF Table Extraction for Humans
☆3,716Jan 5, 2023Updated 3 years ago
camelot-dev / camelot
View on GitHub
A Python library to extract tabular data from PDFs
☆3,786Updated this week
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,575Updated this week
JonathanLink / PDFLayoutTextStripper
View on GitHub
Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extract the content from a table in a pdf fi…
☆1,608Dec 17, 2023Updated 2 years ago
ashima / pdf-table-extract
View on GitHub
Extract tables from PDF pages.
☆300Jun 25, 2020Updated 6 years ago
WZBSocialScienceCenter / pdftabextract
View on GitHub
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
☆2,255Jun 24, 2022Updated 4 years ago
apache / pdfbox
View on GitHub
Mirror of Apache PDFBox
☆3,093Updated this week
pdfminer / pdfminer.six
View on GitHub
Community maintained fork of pdfminer - we fathom PDF
☆7,002Mar 13, 2026Updated 4 months ago
ropensci / tabulapdf
View on GitHub
Bindings for Tabula PDF Table Extractor Library
☆565Jan 3, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
pcodding / hadoop_ctakes
View on GitHub
Hadoop integration code for working with with Apache cTAKES
☆10Feb 11, 2014Updated 12 years ago
okfn / pdftables
View on GitHub
A library for extracting tables from PDF files
☆89Sep 27, 2013Updated 12 years ago
OpenRefine / OpenRefine
View on GitHub
OpenRefine is a free, open source power tool for working with messy data and improving it
☆11,917Updated this week
tesseract-ocr / tesseract
View on GitHub
Tesseract Open Source OCR Engine (main repository)
☆75,484Updated this week
tfmorris / pdf2table
View on GitHub
PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz
☆40Mar 15, 2024Updated 2 years ago
tamirhassan / pdfxtk
View on GitHub
PDF Extraction Toolkit
☆43Nov 23, 2020Updated 5 years ago
euske / pdfminer
View on GitHub
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
☆5,283Dec 7, 2022Updated 3 years ago
grobidOrg / grobid
View on GitHub
A machine learning software for extracting information from scholarly documents
☆5,010Updated this week
doc-analysis / TableBank
View on GitHub
TableBank: A Benchmark Dataset for Table Detection and Recognition
☆1,080Aug 12, 2024Updated last year
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
Layout-Parser / layout-parser
View on GitHub
A Unified Toolkit for Deep Learning Based Document Image Analysis
☆5,764Aug 15, 2024Updated last year
LibrePDF / OpenPDF
View on GitHub
OpenPDF is an open-source Java library for creating, editing, rendering, and encrypting PDF documents, as well as generating PDFs from HT…
☆4,322Jul 8, 2026Updated 2 weeks ago
johnkerl / miller
View on GitHub
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
☆9,958Updated this week
PRImA-Research-Lab / prima-core-libs
View on GitHub
Core libraries by the PRImA Research Lab
☆16Jul 30, 2024Updated last year
coolwanglu / pdf2htmlEX
View on GitHub
Convert PDF to HTML without losing text or format.
☆10,606Jun 2, 2023Updated 3 years ago
cellsrg / tabbypdf
View on GitHub
A tool for extracting arbitrary tables from untagged PDF documents
☆40Jan 8, 2021Updated 5 years ago
BurntSushi / xsv
View on GitHub
A fast CSV command line toolkit written in Rust.
☆10,755Apr 24, 2025Updated last year
apache / superset
View on GitHub
Apache Superset is a Data Visualization and Data Exploration Platform
☆73,912Updated this week
BobLd / tabula-sharp
View on GitHub
Extract tables from PDF files (port of tabula-java)
☆213May 4, 2026Updated 2 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
wireservice / csvkit
View on GitHub
A suite of utilities for converting to and working with CSV, the king of tabular file formats.
☆6,404Updated this week
drj11 / pdftables
View on GitHub
A library for extracting tables from PDF files
☆93Aug 2, 2020Updated 5 years ago
apache / tika
View on GitHub
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
☆3,885Updated this week
py-pdf / pypdf
View on GitHub
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
☆10,121Jun 30, 2026Updated 3 weeks ago
CeON / CERMINE
View on GitHub
Content ExtRactor and MINEr
☆512Jun 30, 2022Updated 4 years ago
explosion / spaCy
View on GitHub
💫 Industrial-strength Natural Language Processing (NLP) in Python
☆33,757May 19, 2026Updated 2 months ago
deanmalmgren / textract
View on GitHub
extract text from any document. no muss. no fuss.
☆4,670Jul 11, 2026Updated last week