jcushman/pdfquery

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/jcushman/pdfquery)

jcushman / pdfquery

A fast and friendly PDF scraping library.

☆781

Alternatives and similar repositories for pdfquery

Users that are interested in pdfquery are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

timClicks / slate
View on GitHub
The simplest way to extract text from PDFs in Python
☆427Jul 7, 2022Updated 4 years ago
alexbyrnes / Datapiece
View on GitHub
Investigative tool for extracting relevant areas from many documents
☆14Nov 17, 2015Updated 10 years ago
pmaupin / pdfrw
View on GitHub
pdfrw is a pure Python library that reads and writes PDFs
☆1,908Apr 29, 2024Updated 2 years ago
euske / pdfminer
View on GitHub
Python PDF Parser (Not actively maintained). Check out pdfminer.six.
☆5,283Dec 7, 2022Updated 3 years ago
pdfminer / pdfminer.six
View on GitHub
Community maintained fork of pdfminer - we fathom PDF
☆7,002Mar 13, 2026Updated 4 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
ecatkins / xpdf_python
View on GitHub
Python wrapper for xpdf
☆19Nov 28, 2019Updated 6 years ago
chezou / tabula-py
View on GitHub
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
☆2,315Dec 5, 2024Updated last year
sunlightlabs / read_FEC
View on GitHub
Turn raw electronic FEC filings into meaningful data
☆19May 20, 2016Updated 10 years ago
datanews / tables
View on GitHub
Tables is a simple command-line tool and powerful library for importing data like a CSV or JSON file into relational tables
☆88Dec 10, 2022Updated 3 years ago
jsvine / pdfplumber
View on GitHub
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
☆10,575Updated this week
tabulapdf / tabula
View on GitHub
Tabula is a tool for liberating data tables trapped inside PDF files
☆7,446Mar 14, 2025Updated last year
kmunger / Topic_Models
View on GitHub
Presentation for the NYU Data Lab December 2015
☆14Dec 2, 2015Updated 10 years ago
WZBSocialScienceCenter / pdftabextract
View on GitHub
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
☆2,255Jun 24, 2022Updated 4 years ago
py-pdf / pypdf
View on GitHub
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
☆10,121Jun 30, 2026Updated 3 weeks ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
The-Politico / politico-civic
View on GitHub
POLITICO's system for managing civic data
☆20Dec 7, 2022Updated 3 years ago
datahoarder / fec_individual_donors
View on GitHub
A how-to do a mass collection of FEC data using the command-line and regular expressions
☆29Mar 18, 2016Updated 10 years ago
newsdev / stevedore
View on GitHub
search document dumps: ingest and explore in one extensible framework
☆123Jun 22, 2020Updated 6 years ago
HazyResearch / pdftotree
View on GitHub
A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
☆460Aug 3, 2023Updated 2 years ago
ashima / pdf-table-extract
View on GitHub
Extract tables from PDF pages.
☆300Jun 25, 2020Updated 6 years ago
newsdev / nyt-pyfec
View on GitHub
A Python library for downloading, parsing and cleaning Federal Election Commission filings.
☆28Jan 30, 2024Updated 2 years ago
alexbyrnes / FCC-Political-Ads
View on GitHub
Archive of political ad data from the Federal Communications Commission
☆21Oct 25, 2017Updated 8 years ago
dpapathanasiou / pdfminer-layout-scanner
View on GitHub
A more complete example of programming with PDFMiner, which continues where the default documentation stops
☆216Dec 3, 2019Updated 6 years ago
deanmalmgren / textract
View on GitHub
extract text from any document. no muss. no fuss.
☆4,670Jul 11, 2026Updated last week
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
lazarogamio / datadoc
View on GitHub
A command-line tool that fetches data from google spreadsheets and saves it as json in the filesystem.
☆10May 12, 2015Updated 11 years ago
dssg / pgdedupe
View on GitHub
A simple command line interface to the datamade/dedupe library.
☆43Dec 26, 2022Updated 3 years ago
atlanhq / camelot
View on GitHub
Camelot: PDF Table Extraction for Humans
☆3,716Jan 5, 2023Updated 3 years ago
alexbyrnes / FCC-Political-Ads_The-Code
View on GitHub
Code for extracting data from a large number of PDFs, particularly FCC political ad documents
☆15Oct 26, 2017Updated 8 years ago
vis4 / d3-geo-state-plane
View on GitHub
Nice and simple US state projections for D3
☆27May 14, 2016Updated 10 years ago
bycoffe / fec-guide
View on GitHub
☆25Mar 18, 2013Updated 13 years ago
jstockwin / py-pdf-parser
View on GitHub
A Python tool to help extracting information from structured PDFs.
☆425Jul 13, 2026Updated last week
veltman / stakeout
View on GitHub
For watching a set of URLs and notifying someone when something has changed.
☆32Jun 12, 2017Updated 9 years ago
fitnr / addfips
View on GitHub
Add state and county fips codes to data
☆43Sep 4, 2025Updated 10 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
relet / python-remotestorage
View on GitHub
An implementation of remotestorage for Python, using a git backend.
☆16Mar 26, 2015Updated 11 years ago
camelot-dev / excalibur
View on GitHub
A web interface to extract tabular data from PDFs
☆1,810May 20, 2026Updated 2 months ago
anthonydb / pneumatic
View on GitHub
pneumatic is a bulk-upload library for DocumentCloud.
☆22Sep 6, 2020Updated 5 years ago
wireservice / lookup
View on GitHub
A repository of journalist's lookup tables.
☆107Apr 26, 2017Updated 9 years ago
alephdata / pdflib
View on GitHub
Binary Python bindings for poppler utils for content extraction
☆42May 12, 2021Updated 5 years ago
redapple / parslepy
View on GitHub
Python implementation of the Parsley language for extracting structured data from web pages
☆92Oct 26, 2017Updated 8 years ago
OpenNewsLabs / datasmells
View on GitHub
☆23Mar 7, 2015Updated 11 years ago