zejn / pypdf2xml
Convert text from PDF to XML.
☆45Updated 5 years ago
Related projects: ⓘ
- PDF Structure and Syntactic Analysis for Metadata Extraction and Tagging - https://code.google.com/p/pdfssa4met/☆20Updated 11 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆95Updated last year
- ☆10Updated 8 years ago
- Specification of NAF, the NLP annotation format☆21Updated 3 years ago
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆45Updated 2 years ago
- Statistical text analysis and semantic networks with Python☆13Updated 6 years ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆37Updated 6 months ago
- ☆37Updated 8 years ago
- [archived]☆18Updated 3 years ago
- A workflow system for Natural Language Processing.☆21Updated 4 years ago
- 🍊 🎓 Educational widgets for machine learning and data mining in Orange 3.☆27Updated 6 months ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- An efficient data structure for fast string similarity searches☆23Updated 3 years ago
- Multi-Entity Extraction Framework for Academic Documents (with default extraction tools)☆29Updated 11 months ago
- Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.☆46Updated 5 years ago
- Next generation OCR engine based on LSTMs.☆51Updated 6 years ago
- Open Semantic Visual Linked Data Graph Explorer: Open Source tool (web app) and user interace (UI) for discovery, exploration and visuali…☆77Updated 4 years ago
- Semantic data wiki as well as Linked Data publishing engine☆204Updated 3 months ago
- PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz☆38Updated 6 months ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Updated 7 months ago
- Framework for information extraction from tables☆41Updated 5 years ago
- 'ocr-evaluation-tools' from http://ancientgreekocr.org/. Tools to test OCR accuracy.☆22Updated 6 years ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆63Updated 3 years ago
- ☆13Updated last year
- Information extraction and interactive visualization of textual datasets for investigative data-driven journalism and eDiscovery☆53Updated 2 months ago
- PDF Extraction Toolkit☆41Updated 3 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆91Updated 2 years ago
- Python bindings for Apache Tika☆22Updated 4 years ago
- Ergonomic line-by-line transcription of scanned text.☆47Updated 3 years ago