Micka33 / content-extractorLinks
Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string
☆36Updated 7 years ago
Alternatives and similar repositories for content-extractor
Users that are interested in content-extractor are comparing it to the libraries listed below
Sorting:
- A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/…☆132Updated 2 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆95Updated 3 years ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated last year
- Suite of tools for detecting changes in web pages and their rendering☆55Updated last year
- Wrapper for pdftohtml that tries to extract paragraph structure☆51Updated 6 years ago
- A source mirror of Skim, the OSX PDF viewer. The main project homepage is http://skim-app.sourceforge.net/☆44Updated 14 years ago
- Convert a PDF via OCR to a TXT file in UTF-8 encoding☆152Updated last year
- Recognition Models for Kraken and CLSTM☆16Updated 6 years ago
- Transliteration data and models☆56Updated 8 years ago
- Distributed text analysis suite based on Celery☆96Updated 2 years ago
- 📑 SQLite extension to add the Okapi BM25 ranking algorithm☆35Updated 10 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆216Updated 5 years ago
- my take at a PDF text extraction utility☆25Updated 10 years ago
- Convert PDF to HTML without losing text or format.☆21Updated 10 years ago
- This is a side project from 2008. This package contains a tool for automatically cropping and deskewing images of book pages captured by …☆28Updated 12 years ago
- An efficient data structure for fast string similarity searches☆22Updated 4 years ago
- Tesseract OCR Mac☆35Updated 7 years ago
- OCR in Javascript via Emscripten☆99Updated 11 years ago
- Yet another Chinese word segmentation package based on character-based tagging heuristics and CRF algorithm☆245Updated 12 years ago
- Python library for manipulating Open Packaging Convention (OPC) files like .docx, .pptx, and .xslx☆46Updated 8 years ago
- simple inverted index full text search engine written in python☆13Updated 11 years ago
- ☆61Updated last year
- Training/test data for Dragnet☆41Updated 10 years ago
- ☆23Updated last year
- clone of swftools git repository +mouse scrolling in the PDF viewer☆27Updated 8 years ago
- PDF to JPEG images + HTML with <img> alt text converter☆49Updated 11 years ago
- Similarity hashing☆49Updated 14 years ago
- Docker configuration for MateCat web cattool https://github.com/matecat/MateCat☆21Updated last month
- ☆26Updated 6 years ago
- LightSide Workbench☆24Updated last year