fmalina / PDFtranscript
Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.
☆28Updated 3 months ago
Alternatives and similar repositories for PDFtranscript:
Users that are interested in PDFtranscript are comparing it to the libraries listed below
- Conversion and validation for JATS XML☆52Updated last year
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- An online annotation platform for teaching and learning in the humanities.☆107Updated last month
- LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆82Updated 7 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- Run pdf2htmlEX in a Docker container.☆24Updated last year
- yael (Yet Another EPUB Library) is a Python library for reading, manipulating, and writing EPUB 2/3 files☆17Updated 9 years ago
- Web service to generate citations and bibliographies using citeproc-js☆62Updated last year
- Named-Entity Recognition extension for Google Refine / OpenRefine☆72Updated 7 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 7 years ago
- High-level build project for all LAPDF-Text submodules☆103Updated 9 years ago
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆47Updated 3 years ago
- An index data structure for approximate string search.☆23Updated 5 years ago
- CaSSius: a CSS-regions-based PDF typesetter for scholarly communications☆41Updated 2 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- neonion is a user-centered collaborative semantic annotation webapp developed at the Human-Centered Computing group at Freie Universität …☆68Updated 6 years ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆38Updated 11 months ago
- PDF to XML ALTO file converter☆232Updated last week
- Binary Python bindings for poppler utils for content extraction☆42Updated 3 years ago
- Process, enhance and evaluate multiple OCR output.☆22Updated 4 months ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆66Updated 4 years ago
- Linguistic search for large annotated text corpora, based on Apache Lucene☆111Updated this week
- Free-for-all repository of TEI and plain text files for you (to do cool stuff) provided by the Digital Collections Services group at the …☆27Updated 7 years ago
- Python binding to libpoppler with focus on text extraction☆97Updated 3 years ago
- An open-source CRF Reference String Parsing Package☆158Updated 4 years ago
- ☆32Updated 2 years ago
- Execute OpenRefine JSON scripts without OpenRefine (or Java)☆29Updated 2 years ago
- View HOCR files with Mirador☆27Updated 7 years ago
- This is a side project from 2008. This package contains a tool for automatically cropping and deskewing images of book pages captured by …☆28Updated 11 years ago