Micka33 / content-extractor
Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string
☆37Updated 7 years ago
Alternatives and similar repositories for content-extractor:
Users that are interested in content-extractor are comparing it to the libraries listed below
- A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/…☆130Updated last year
- ☆43Updated 10 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- Web/FileSystem Crawler Library☆29Updated this week
- A bundle of html content extraction algorithms☆121Updated 9 years ago
- Wrapper for pdftohtml that tries to extract paragraph structure☆50Updated 6 years ago
- Python code to read text from a PDF file (OCR).☆66Updated 4 years ago
- Get semantic HTML from PDFs, recover lost text, tables, data... in bulk.☆28Updated 2 months ago
- Google word2vec tools built for windows compiled with visual studio 2017 and dev c++ on Windows 10 x64.☆14Updated 7 years ago
- 代码讲解部分请前往blog:http://lan2720.github.io/☆33Updated 8 years ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated 11 months ago
- Tools for extract figure, table, text, .. from a pdf document.☆32Updated 4 years ago
- Image Pre-processing to improve OCR accuracy.☆20Updated 8 years ago
- Text Classification ToolKit☆22Updated 6 years ago
- A simple viewer and inspection tool for text boxes in PDF documents☆94Updated 2 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.☆382Updated 6 months ago
- Python module that intent to crack basic captcha engines using OpenCV and Pytesser☆39Updated 10 years ago
- Open Collaborative AI Driven Parser builder for Web Scraping, Data Extraction and Crawling,Knowledge GraphUpdated last month
- ☆128Updated 8 years ago
- A Simple Chinese OCR from tipdm contest☆63Updated 8 years ago
- Offline Isolated Handwriting Chinese Charater Regonization☆16Updated 6 years ago
- Office Document Convertor (ODC) is an online convertor for office document which runs as a web service. Its aim is to provide the facilit…☆43Updated 8 years ago
- 基于OPENCV和tesseract的中文扫描票据OCR识别。☆91Updated 6 years ago
- node readability☆22Updated 6 years ago
- Python library for manipulating Open Packaging Convention (OPC) files like .docx, .pptx, and .xslx☆43Updated 7 years ago
- This plugin provides a useful feature for multi-language☆14Updated 2 years ago
- Extensions for using Scrapy on Amazon AWS☆32Updated 12 years ago
- ☆14Updated 7 years ago
- Nodejs binding for fasttext representation and classification.☆42Updated 11 months ago