Micka33 / content-extractor
Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string
☆37Updated 6 years ago
Alternatives and similar repositories for content-extractor:
Users that are interested in content-extractor are comparing it to the libraries listed below
- A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/…☆130Updated last year
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- A simple viewer and inspection tool for text boxes in PDF documents☆94Updated 2 years ago
- A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.☆10Updated 7 years ago
- Extract tables from PDF pages.☆283Updated 4 years ago
- A python library detect and extract listing data from HTML page.☆109Updated 7 years ago
- Attempts to isolate and remove translucent watermarks from a sample of images.☆47Updated 6 years ago
- Google word2vec tools built for windows compiled with visual studio 2017 and dev c++ on Windows 10 x64.☆15Updated 7 years ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated 10 months ago
- Distributed text analysis suite based on Celery☆95Updated 2 years ago
- Extract structured data from PDF invoices☆13Updated 3 years ago
- Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.☆108Updated 9 months ago
- Python computer vision project that aims to automatically remove the watermarks of stock images. The algorithm is designed off of those o…☆24Updated 4 years ago
- Image Pre-processing to improve OCR accuracy.☆20Updated 8 years ago
- Office Document Convertor (ODC) is an online convertor for office document which runs as a web service. Its aim is to provide the facilit…☆43Updated 8 years ago
- Adaptive crawler which uses Reinforcement Learning methods☆169Updated 6 years ago
- Recognition Models for Kraken and CLSTM☆13Updated 5 years ago
- Python binding to libpoppler with focus on text extraction☆97Updated 3 years ago
- Tools for web page segmentation. In development☆17Updated 6 years ago
- Linguistic Annotation and Visualization Tool for PDF Documents☆200Updated 5 years ago
- Training/test data for Dragnet☆41Updated 9 years ago
- Python API for Various DB-Backed Simhash Clusters☆64Updated 7 years ago
- Tool that does layout analysis and/or text recognition using tesseract and outputs the result in Page XML format☆46Updated 9 months ago
- Automatic de-keystoning for single camera DIY book scanners☆23Updated 8 years ago
- Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.☆28Updated last month
- Automatic Item List Extraction☆87Updated 8 years ago
- A RESTful web service adaptor for pdf2json, built with restify and nodejs.☆34Updated 2 years ago
- Plugin to use rich text in Annotator☆30Updated 10 years ago