Micka33 / content-extractorLinks
Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string
☆36Updated 7 years ago
Alternatives and similar repositories for content-extractor
Users that are interested in content-extractor are comparing it to the libraries listed below
Sorting:
- A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/…☆132Updated 2 years ago
- Recognition Models for Kraken and CLSTM☆16Updated 6 years ago
- A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR …☆66Updated last year
- Python code to read text from a PDF file (OCR).☆70Updated 5 years ago
- Suite of tools for detecting changes in web pages and their rendering☆55Updated last year
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated last year
- A simple viewer and inspection tool for text boxes in PDF documents☆95Updated 3 years ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆397Updated last year
- This is a side project from 2008. This package contains a tool for automatically cropping and deskewing images of book pages captured by …☆28Updated 12 years ago
- Extract tables from PDF pages.☆296Updated 5 years ago
- Transliteration data and models☆56Updated 8 years ago
- Inspired by Machine Learning course on coursera.org. A helper tool for generating ocr features for Machine Learning algos...☆77Updated 5 years ago
- A library for extracting tables from PDF files☆92Updated 5 years ago
- Extract tables from scanned image PDFs using Optical Character Recognition.☆276Updated 5 years ago
- Tool that does layout analysis and/or text recognition using tesseract and outputs the result in Page XML format☆46Updated 6 months ago
- Pre-Recognize Library - library with algorithms for improving OCR quality.☆109Updated 2 years ago
- liberate all kinds of data from PDF and other unstructural format and make the information machine-readable and visualizeable for popul…☆31Updated 7 years ago
- Mapping photos of Old New York☆291Updated 10 months ago
- Attempts to determine the natural language of a selection of Unicode (utf-8) text (a clone of http://code.google.com/p/guess-language wit…☆48Updated 15 years ago
- Script for downloading and installing Tesseract OCR Engine on RedHat and CentOS☆53Updated 7 years ago
- LightSide Workbench☆24Updated 2 years ago
- A simple program to extract the text from an image before performing OCR☆222Updated 5 years ago
- python script for building google chrome extension crx☆34Updated 11 years ago
- ☆129Updated 8 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆216Updated 5 years ago
- Convert a PDF via OCR to a TXT file in UTF-8 encoding☆152Updated 2 years ago
- Wrapper for pdftohtml that tries to extract paragraph structure☆51Updated 6 years ago
- Pretrained mixed models to be used with Calamari.☆65Updated last year
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 6 years ago