A fast and friendly PDF scraping library.
☆783Oct 17, 2023Updated 2 years ago
Alternatives and similar repositories for pdfquery
Users that are interested in pdfquery are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- The simplest way to extract text from PDFs in Python☆428Jul 7, 2022Updated 3 years ago
- Investigative tool for extracting relevant areas from many documents☆14Nov 17, 2015Updated 10 years ago
- pdfrw is a pure Python library that reads and writes PDFs☆1,911Apr 29, 2024Updated last year
- Python PDF Parser (Not actively maintained). Check out pdfminer.six.☆5,302Dec 7, 2022Updated 3 years ago
- Community maintained fork of pdfminer - we fathom PDF☆6,939Mar 13, 2026Updated last week
- Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame☆2,314Dec 5, 2024Updated last year
- Python wrapper for xpdf☆19Nov 28, 2019Updated 6 years ago
- Tables is a simple command-line tool and powerful library for importing data like a CSV or JSON file into relational tables☆88Dec 10, 2022Updated 3 years ago
- Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.☆9,929Jan 28, 2026Updated last month
- Tabula is a tool for liberating data tables trapped inside PDF files☆7,362Mar 14, 2025Updated last year
- Presentation for the NYU Data Lab December 2015☆14Dec 2, 2015Updated 10 years ago
- ☆23Mar 7, 2015Updated 11 years ago
- A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.☆2,258Jun 24, 2022Updated 3 years ago
- POLITICO's system for managing civic data☆20Dec 7, 2022Updated 3 years ago
- A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files☆9,882Updated this week
- A how-to do a mass collection of FEC data using the command-line and regular expressions☆29Mar 18, 2016Updated 10 years ago
- A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.☆459Aug 3, 2023Updated 2 years ago
- Extract tables from PDF pages.☆300Jun 25, 2020Updated 5 years ago
- extract text from any document. no muss. no fuss.☆4,483Feb 4, 2026Updated last month
- Archive of political ad data from the Federal Communications Commission☆20Oct 25, 2017Updated 8 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆216Dec 3, 2019Updated 6 years ago
- Nice and simple US state projections for D3☆27May 14, 2016Updated 9 years ago
- Camelot: PDF Table Extraction for Humans☆3,717Jan 5, 2023Updated 3 years ago
- Code for extracting data from a large number of PDFs, particularly FCC political ad documents☆15Oct 26, 2017Updated 8 years ago
- Add state and county fips codes to data☆43Sep 4, 2025Updated 6 months ago
- A simple command line interface to the datamade/dedupe library.☆43Dec 26, 2022Updated 3 years ago
- An implementation of remotestorage for Python, using a git backend.☆16Mar 26, 2015Updated 10 years ago
- A Python tool to help extracting information from structured PDFs.☆429Updated this week
- For watching a set of URLs and notifying someone when something has changed.☆32Jun 12, 2017Updated 8 years ago
- A web interface to extract tabular data from PDFs☆1,791Jan 3, 2025Updated last year
- pneumatic is a bulk-upload library for DocumentCloud.☆22Sep 6, 2020Updated 5 years ago
- A repository of journalist's lookup tables.☆107Apr 26, 2017Updated 8 years ago
- Collecting various d3.js tricks☆12Sep 23, 2015Updated 10 years ago
- NICAR 2016 talk about PDFs!☆63Mar 12, 2016Updated 10 years ago
- Turn raw electronic FEC filings into meaningful data☆19May 20, 2016Updated 9 years ago
- A command-line tool that fetches data from google spreadsheets and saves it as json in the filesystem.☆10May 12, 2015Updated 10 years ago
- Binary Python bindings for poppler utils for content extraction☆42May 12, 2021Updated 4 years ago
- Parses Google Documents formatted for annotated transcripts –– with JavaScript☆18Feb 14, 2022Updated 4 years ago
- The Poor Man's Web Components☆14Oct 31, 2016Updated 9 years ago