izderadicka / pdfparserLinks

Python binding to libpoppler with focus on text extraction

☆97

Alternatives and similar repositories for pdfparser

Users that are interested in pdfparser are comparing it to the libraries listed below

Sorting:

ashima / pdf-table-extract
Extract tables from PDF pages.
☆298Updated 5 years ago
drj11 / pdftables
A library for extracting tables from PDF files
☆92Updated 5 years ago
okfn / pdftables
A library for extracting tables from PDF files
☆89Updated 12 years ago
WZBSocialScienceCenter / pdf2xml-viewer
A simple viewer and inspection tool for text boxes in PDF documents
☆96Updated 3 years ago
dpapathanasiou / pdfminer-layout-scanner
A more complete example of programming with PDFMiner, which continues where the default documentation stops
☆216Updated 6 years ago
lanl / pyxDamerauLevenshtein
pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) edit distance algorithm for Python in Cython for high performance.
☆250Updated 4 months ago
axiak / fuzzyset
A simple fuzzy matching set for python strings
☆230Updated last year
kororo / excelcy
Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.
☆104Updated 3 years ago
virantha / pypdfocr
Python script to do PDF OCR conversion using Tesseract
☆374Updated 2 years ago
ocropus / hocr-tools
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
☆407Updated last year
textpipe / textpipe
Textpipe: clean and extract metadata from text
☆302Updated 4 years ago
tamirhassan / pdfxtk
PDF Extraction Toolkit
☆42Updated 5 years ago
HazyResearch / pdftotree
A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
☆461Updated 2 years ago
paperai / pdfanno
Linguistic Annotation and Visualization Tool for PDF Documents
☆200Updated 6 years ago
explosion / displacy-ent
displaCy-ent.js: An open-source named entity visualiser for the modern web
☆200Updated 7 years ago
doukremt / distance
Levenshtein and Hamming distance computation
☆117Updated 6 years ago
danvk / oldnyc
Mapping photos of Old New York
☆293Updated last year
pirate / spellchecker
A spell-checker extending Peter Norvig's with multi-typo correction, hamming distance weighting, and more.
☆98Updated 5 years ago
DistrictDataLabs / baleen
An automated ingestion service for blogs to construct a corpus for NLP research.
☆86Updated 7 years ago
btimby / fulltext
Python library for extracting text from various file formats (for indexing).
☆114Updated 4 years ago
ines / spacy-graphql
🤹‍♀️ Query spaCy's linguistic annotations using GraphQL
☆86Updated 7 years ago
marcolagi / quantulum
Python library for information extraction of quantities from unstructured text
☆118Updated 2 years ago
yougov / fuzzy
☆52Updated 2 years ago
usnistgov / ocr-pipeline
Convert a corpus of PDF to clean text files on a distributed architecture
☆38Updated last year
andychase / reparse
Regular Expression based parsers for extracting data from natural languages
☆71Updated 8 years ago
oubiwann / metaphone
A Python implementation of the Metaphone and Double Metaphone algorithms
☆83Updated last year
grantjenks / python-wordsegment
English word segmentation, written in pure-Python, and based on a trillion-word corpus.
☆378Updated 3 years ago
openpaperwork / libpillowfight
Small library containing various image processing algorithms (+ Python 3 bindings) that has almost no dependencies -- Moved to Gnome's Gi…
☆62Updated 7 years ago
jcushman / pdfquery
A fast and friendly PDF scraping library.
☆783Updated 2 years ago
pyhunspell / pyhunspell
(Official repo for pypi package) Python bindings for the Hunspell spellchecker engine
☆189Updated 5 years ago