Micka33 / content-extractorLinks
Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string
☆36Updated 7 years ago
Alternatives and similar repositories for content-extractor
Users that are interested in content-extractor are comparing it to the libraries listed below
Sorting:
- A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR …☆66Updated last year
 - A simple viewer and inspection tool for text boxes in PDF documents☆95Updated 3 years ago
 - Suite of tools for detecting changes in web pages and their rendering☆55Updated last year
 - A library for extracting tables from PDF files☆92Updated 5 years ago
 - Similarity hashing☆49Updated 14 years ago
 - Python code to read text from a PDF file (OCR).☆70Updated 5 years ago
 - A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/…☆132Updated 2 years ago
 - Extract tables from PDF pages.☆298Updated 5 years ago
 - my take at a PDF text extraction utility☆25Updated 10 years ago
 - Wrapper for pdftohtml that tries to extract paragraph structure☆52Updated 6 years ago
 - Pretrained mixed models to be used with Calamari.☆65Updated last year
 - Recognition Models for Kraken and CLSTM☆16Updated 6 years ago
 - A bundle of html content extraction algorithms☆122Updated 10 years ago
 - Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆402Updated last year
 - Office Document Convertor (ODC) is an online convertor for office document which runs as a web service. Its aim is to provide the facilit…☆44Updated 8 years ago
 - PDF to XML ALTO file converter☆254Updated last month
 - A more complete example of programming with PDFMiner, which continues where the default documentation stops☆216Updated 5 years ago
 - LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance …☆81Updated 7 years ago
 - OCR in Javascript via Emscripten☆99Updated 11 years ago
 - Type discovery for Python☆24Updated 9 years ago
 - An implementation of RESTful web service for tesseract-OCR using tornado☆136Updated 2 years ago
 - 汉字组件笔画数据☆15Updated 7 years ago
 - A vector similarity database☆230Updated 11 years ago
 - Extract tables from scanned image PDFs using Optical Character Recognition.☆276Updated 5 years ago
 - Image Captcha Solving Using TensorFlow and CNN Model,with self-labeling image Dataset crawled from a website,free to download my Dataset …☆20Updated 2 years ago
 - Web Content Extraction Through Machine Learning☆184Updated 11 years ago
 - Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.☆108Updated 6 months ago
 - ☆61Updated last year
 - ☆43Updated 11 years ago
 - PDF parser and converter to HTML☆89Updated last year