fmalina / unilex-transcript

Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.

☆28

Related projects ⓘ

Alternatives and complementary repositories for unilex-transcript

UB-Mannheim / ocr-gt-tools
Ergonomic line-by-line transcription of scanned text.
☆47Updated 3 years ago
kermitt2 / pdfalto
PDF to XML ALTO file converter
☆215Updated last month
ocropus / hocr-tools
Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
☆369Updated 3 months ago
OpenPhilology / nidaba
An expandable and scalable OCR pipeline
☆86Updated 6 years ago
translate / amagama
Web service for implementing a large-scale translation memory
☆90Updated 3 years ago
garysieling / pdf-js-csv
Exploring extracting tables from a PDF to CSV using PDF.JS
☆103Updated 8 years ago
CrossRef / pdfmark
MOVED TO https://gitlab.com/crossref/pdfmark
☆33Updated 5 years ago
cneud / ocr-conversion
Conversions between various OCR formats
☆71Updated last year
WZBSocialScienceCenter / pdf2xml-viewer
A simple viewer and inspection tool for text boxes in PDF documents
☆92Updated 2 years ago
BobLd / simple-docstrum
A step-by-step C# implementation of the Docstrum algorithm
☆23Updated 3 years ago
ashima / pdf-table-extract
Extract tables from PDF pages.
☆276Updated 4 years ago
not-implemented / hocr-proofreader
Web based JavaScript GUI library for proofreading/editing hOCR
☆92Updated 6 years ago
BMKEG / lapdftext
LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …
☆82Updated 6 years ago
UB-Mannheim / ocr-fileformat
Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
☆181Updated 3 weeks ago
BMKEG / lapdftextProject
High-level build project for all LAPDF-Text submodules
☆103Updated 9 years ago
rajbot / autocrop
This is a side project from 2008. This package contains a tool for automatically cropping and deskewing images of book pages captured by …
☆28Updated 11 years ago
kba / hocr-spec
The hOCR Embedded OCR Workflow and Output Format
☆73Updated 2 months ago
MartinPaulEve / CaSSius
CaSSius: a CSS-regions-based PDF typesetter for scholarly communications
☆41Updated 2 years ago
dpapathanasiou / pdfminer-layout-scanner
A more complete example of programming with PDFMiner, which continues where the default documentation stops
☆215Updated 4 years ago
cisocrgroup / PoCoTo
The CIS OCR PostCorrectionTool
☆40Updated 2 years ago
opensemanticsearch / open-semantic-search-apps
Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…
☆94Updated 2 years ago
PeerJ / jats-conversion
Conversion and validation for JATS XML
☆51Updated 8 months ago
drj11 / pdftables
A library for extracting tables from PDF files
☆87Updated 4 years ago
bitextor / pdf-extract
PDF parser and converter to HTML
☆83Updated last month
mgedmin / pdf2html
Wrapper for pdftohtml that tries to extract paragraph structure
☆49Updated 5 years ago
kba / hocrjs
Working with hOCR in Javascript
☆120Updated last year
zejn / pypdf2xml
Convert text from PDF to XML.
☆45Updated 6 years ago
scrapinghub / mdr
A python library detect and extract listing data from HTML page.
☆109Updated 7 years ago
jbaiter / hocrviewer-mirador
View HOCR files with Mirador
☆26Updated 7 years ago
elifesciences / sciencebeam-parser
A set of tools to allow PDF to XML conversion, utilising Apache Beam and other tools. The aim of this project is to bring multiple tools…
☆294Updated 2 years ago