fmalina / PDFtranscriptLinks
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆31Updated 6 months ago
Alternatives and similar repositories for PDFtranscript
Users that are interested in PDFtranscript are comparing it to the libraries listed below
Sorting:
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- Conversion and validation for JATS XML☆54Updated last year
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 7 years ago
- Web service to generate citations and bibliographies using citeproc-js☆63Updated last year
- Web based JavaScript GUI library for proofreading/editing hOCR☆95Updated 6 years ago
- PDF to XML ALTO file converter☆240Updated last week
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- Exploring extracting tables from a PDF to CSV using PDF.JS☆104Updated 8 years ago
- This is a side project from 2008. This package contains a tool for automatically cropping and deskewing images of book pages captured by …☆28Updated 12 years ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆392Updated 9 months ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- A library for extracting tables from PDF files☆89Updated 11 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆95Updated 3 years ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆97Updated 2 years ago
- Wrapper for pdftohtml that tries to extract paragraph structure☆50Updated 6 years ago
- Extract tables from PDF pages.☆291Updated 4 years ago
- CaSSius: a CSS-regions-based PDF typesetter for scholarly communications☆41Updated 2 years ago
- MOVED TO https://gitlab.com/crossref/pdfmark☆33Updated 6 years ago
- A Python canonicalizer to disambiguate and recognize known names from a poor quality data entry list.☆20Updated 9 years ago
- An online annotation platform for teaching and learning in the humanities.☆108Updated 3 months ago
- gathering point for open source OCR scripts and diffs☆43Updated 10 years ago
- Github mirror of "search/highlighter" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access…☆103Updated 2 weeks ago
- code to remove "noise" from hOCR output of Tesseract OCR.☆14Updated 8 years ago
- Crop And Splice Segments (of scanned pages)☆14Updated 6 years ago
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆38Updated last year
- A library for extracting tables from PDF files☆89Updated 4 years ago
- Tool for visualizing hOCR output from Tesseract (or other OCR engines that support hOCR).☆24Updated 10 years ago
- An index data structure for approximate string search.☆23Updated 6 years ago
- PDF to EPUB3 Fixed Layout converter☆25Updated 3 years ago