Micka33 / content-extractor
Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string
☆37Updated 7 years ago
Alternatives and similar repositories for content-extractor:
Users that are interested in content-extractor are comparing it to the libraries listed below
- A selection of test lines of several early printed books as well as the corresponding individual OCRopus models and mixed models.☆10Updated 7 years ago
- A small framework taking over the manual training process described in the Tesseract3 Wiki: https://code.google.com/p/tesseract-ocr/wiki/…☆131Updated last year
- PDF to XML ALTO file converter☆234Updated this week
- A wrapper for tesseract / abbyyOCR11 ocr4linux finereader cli that can perform batch operations or monitor a directory and launch an OCR …☆65Updated last year
- Pretrained mixed models to be used with Calamari.☆61Updated 6 months ago
- Ergonomic line-by-line transcription of scanned text.☆51Updated 4 years ago
- HOCR Specification Python Parser☆13Updated 9 years ago
- Tool that does layout analysis and/or text recognition using tesseract and outputs the result in Page XML format☆46Updated this week
- The CIS OCR PostCorrectionTool☆41Updated 2 years ago
- A more complete example of programming with PDFMiner, which continues where the default documentation stops☆214Updated 5 years ago
- Convert a corpus of PDF to clean text files on a distributed architecture☆38Updated last year
- liberate all kinds of data from PDF and other unstructural format and make the information machine-readable and visualizeable for popul…☆31Updated 6 years ago
- Training files produced for and by the Tesseract OCR engine for work on the Early Modern OCR Project (eMOP)☆36Updated 9 years ago
- Recognition Models for Kraken and CLSTM☆14Updated 5 years ago
- Layout Analysis Evaluator for the ICDAR 2017 competition on Layout Analysis for Challenging Medieval Manuscripts☆22Updated 5 years ago
- my take at a PDF text extraction utility☆14Updated 9 years ago
- Tool for visualizing hOCR output from Tesseract (or other OCR engines that support hOCR).☆23Updated 10 years ago
- files and code related to the Early Modern OCR Project (eMOP) at the IDHMC☆16Updated 10 years ago
- Crop And Splice Segments (of scanned pages)☆14Updated 6 years ago
- 'ocr-evaluation-tools' from http://ancientgreekocr.org/. Tools to test OCR accuracy.☆22Updated 7 years ago
- Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.☆108Updated last year
- Inline annotation for the web in pure Javascript. Select text, images, or (nearly) anything else, and add your notes.☆9Updated 8 years ago
- Next generation OCR engine based on LSTMs.☆52Updated 6 years ago
- Suite of tools for detecting changes in web pages and their rendering☆54Updated last year
- Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)☆17Updated last week
- An intelligent OCR to detect tables and pure text inside PDFs and obtaing a csv file and a txt from it☆14Updated 6 years ago
- A source mirror of Skim, the OSX PDF viewer. The main project homepage is http://skim-app.sourceforge.net/☆43Updated 14 years ago
- OCR evaluation brought to you by University of Alicante☆67Updated 2 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 7 years ago
- Tools for extract figure, table, text, .. from a pdf document.☆32Updated 4 years ago