zejn / pypdf2xml
Convert text from PDF to XML.
☆45Updated 6 years ago
Related projects ⓘ
Alternatives and complementary repositories for pypdf2xml
- Convert a corpus of PDF to clean text files on a distributed architecture☆37Updated 8 months ago
- Ergonomic line-by-line transcription of scanned text.☆48Updated 3 years ago
- PDF Structure and Syntactic Analysis for Metadata Extraction and Tagging - https://code.google.com/p/pdfssa4met/☆20Updated 11 years ago
- Analyze XML extracted from PDFs (e.g. from TET or PDFMiner)☆20Updated 6 years ago
- Pre-built Scrapy spiders for AutoExtract☆19Updated 6 months ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆95Updated 2 years ago
- An expandable and scalable OCR pipeline☆86Updated 7 years ago
- (Python) Execute tesseract OCR on a multi-page PDF.☆18Updated last year
- Specification of NAF, the NLP annotation format☆21Updated 3 years ago
- Editor of training sets for page segmentation and zone classification of scholarly PDFs☆11Updated 8 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆92Updated 2 years ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆65Updated 4 years ago
- Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & N…☆262Updated 2 years ago
- Functional and structural analysis of tables in research papers (Table disentangling)☆20Updated 7 years ago
- Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.☆46Updated 5 years ago
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆46Updated 2 years ago
- LocalCopy is a plugin that extends the popular reference manager JabRef. It provides an automatic download feature for preprints from the…☆28Updated 12 years ago
- [archived]☆18Updated 3 years ago
- python-timbl, originally developed by Sander Canisius, is a Python extension module wrapping the full TiMBL C++ programming interface. Wi…☆18Updated 3 weeks ago
- An HTML to JSON API webscraper for ResearchGate, adaptable for other sites☆19Updated 5 years ago
- Open Semantic Visual Linked Data Graph Explorer: Open Source tool (web app) and user interace (UI) for discovery, exploration and visuali…☆78Updated 4 years ago
- GROBID extension for identifying and normalizing physical quantities.☆75Updated 2 months ago
- Virtual patent marking crawler at iproduct.epfl.ch☆14Updated 7 years ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- Installer for Thymeflow, a personal knowledge management system.☆32Updated 6 years ago
- Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:☆120Updated 6 months ago
- ☆21Updated 7 years ago
- Python bindings for Neo4j☆26Updated 10 years ago
- 🚀GUI for training spaCy models☆53Updated 3 years ago
- Code and models for our CLEF-HIPE (Named Entity Processing on Historical Newspapers) submissions☆19Updated last year