zejn / pypdf2xml
Convert text from PDF to XML.
☆45Updated 6 years ago
Alternatives and similar repositories for pypdf2xml:
Users that are interested in pypdf2xml are comparing it to the libraries listed below
- PDF Structure and Syntactic Analysis for Metadata Extraction and Tagging - https://code.google.com/p/pdfssa4met/☆20Updated 11 years ago
- Python/Django based webapps and web user interfaces for search, structure (meta data management like thesaurus, ontologies, annotations a…☆95Updated 2 years ago
- Ergonomic line-by-line transcription of scanned text.☆50Updated 4 years ago
- LocalCopy is a plugin that extends the popular reference manager JabRef. It provides an automatic download feature for preprints from the…☆28Updated 12 years ago
- Clustering a set of word/tags using K-Means with word2vec or wordnet distance☆27Updated 5 years ago
- This is a REST Server endpoint built using Flask and Python.☆24Updated 2 years ago
- Automatic tagging and analysis of documents in an Apache Solr index for faceted search by RDF(S) Ontologies & SKOS thesauri☆46Updated 3 years ago
- Written in python, for checking reference lists in systematic reviews and literature reviews, helps with reference list searching both ba…☆23Updated 4 years ago
- 🍊 🎓 Educational widgets for machine learning and data mining in Orange 3.☆27Updated 10 months ago
- Editor of training sets for page segmentation and zone classification of scholarly PDFs☆11Updated 8 years ago
- Augment IBM Watson Natural Language Understanding APIs with a configurable mechanism for text classification, uses Watson Studio.☆46Updated 5 years ago
- ☆15Updated 3 years ago
- PAGE XML format collection for document image page content and more☆67Updated 3 years ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated 10 months ago
- Named entity recognition for the legal domain☆41Updated 3 years ago
- This repo is about the classification of rhetorical roles in Legal Documents such as: Citation, Findings of Fact, Evidence, Legal Rule, R…☆14Updated 2 years ago
- Site Publish is a simple file-based content publishing system that runs on Google App Engine. It is designed to be a part of a dynamic w…☆13Updated 11 years ago
- A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF …☆66Updated 4 years ago
- Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)☆183Updated 3 months ago
- Next generation OCR engine based on LSTMs.☆52Updated 6 years ago
- The CIS OCR PostCorrectionTool☆40Updated 2 years ago
- Wrapper for pdftohtml that tries to extract paragraph structure☆50Updated 6 years ago
- Python interface for building rule-based expert systems over PyCLIPS☆12Updated 2 years ago
- ☆18Updated 6 years ago
- 'ocr-evaluation-tools' from http://ancientgreekocr.org/. Tools to test OCR accuracy.☆22Updated 6 years ago
- [archived]☆18Updated 3 years ago
- Specification of NAF, the NLP annotation format☆21Updated 4 years ago
- Miscellaneous materials for teaching NLP using NLTK☆37Updated 7 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆94Updated 2 years ago
- Given a text, wrap it into phrases and send them to Yandex's search engine. If it yields a "did you mean:", substitute the original phras…☆11Updated 6 years ago