zejn / pypdf2xml
Convert text from PDF to XML.
☆45Updated 6 years ago
Alternatives and similar repositories for pypdf2xml:
Users that are interested in pypdf2xml are comparing it to the libraries listed below
- ☆37Updated 9 years ago
- PDF Structure and Syntactic Analysis for Metadata Extraction and Tagging - https://code.google.com/p/pdfssa4met/☆19Updated 12 years ago
- Open Semantic Visual Linked Data Graph Explorer: Open Source tool (web app) and user interace (UI) for discovery, exploration and visuali…☆84Updated 5 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆187Updated last month
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆47Updated 3 years ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆66Updated 4 years ago
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- Documentation and use cases for ALTO XML☆41Updated 6 years ago
- The CIS OCR PostCorrectionTool☆41Updated 2 years ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆97Updated 2 years ago
- ALTO XML schema - latest and all former versions☆52Updated 8 months ago
- Convert between Tesseract hOCR and ALTO XML using XSL stylesheets☆55Updated 8 months ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated last year
- python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…☆18Updated 2 months ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- Analyze XML extracted from PDFs (e.g. from TET or PDFMiner)☆20Updated 7 years ago
- Stuff for the Text Mining course☆28Updated 2 months ago
- An expandable and scalable OCR pipeline☆87Updated 7 years ago
- Specification of NAF, the NLP annotation format☆21Updated 4 years ago
- Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.☆24Updated 9 years ago
- Orange Data Mining Homepage☆16Updated 5 years ago
- Deep Knowledge Extraction from Text☆37Updated 3 years ago
- test☆23Updated 4 years ago
- PAGE XML format collection for document image page content and more☆67Updated 3 years ago
- Next generation OCR engine based on LSTMs.☆52Updated 6 years ago
- A powerful, tagset-independent and theory-neutral meta model and API for storing, manipulating, and representing nearly all types of ling…☆15Updated 2 years ago
- Semantic data wiki as well as Linked Data publishing engine☆206Updated 9 months ago
- 'ocr-evaluation-tools' from http://ancientgreekocr.org/. Tools to test OCR accuracy.☆22Updated 7 years ago
- LaMachine - A software distribution of our in-house as well as some 3rd party NLP software - Virtual Machine, Docker, or local compilatio…☆68Updated last year