Tooling to extract data from scanned paper forms OCR-ed by Tesseract using the HOCR standard.
☆86Mar 1, 2016Updated 10 years ago
Alternatives and similar repositories for whatwordwhere
Users that are interested in whatwordwhere are comparing it to the libraries listed below
Sorting:
- OCRopus model for Gothic print (Fraktur)☆19Feb 16, 2020Updated 6 years ago
- A collection of stemmers in Clojure☆21Jan 17, 2023Updated 3 years ago
- Course in Document and Content Analysis.☆14Apr 18, 2020Updated 5 years ago
- Notes for my talk "Exploring the Radio Spectrum for News"☆13Mar 6, 2020Updated 6 years ago
- ☆11Feb 13, 2026Updated last month
- Crop And Splice Segments (of scanned pages)☆14Mar 11, 2019Updated 7 years ago
- Manuals, lexica, OCR test data for PoCoTo and the profiler☆15Jul 2, 2021Updated 4 years ago
- R tools for journalists☆18Mar 9, 2018Updated 8 years ago
- ☆25Mar 18, 2013Updated 13 years ago
- Code & supporting data behind Pioneer Press stories and interactives.☆14Jan 16, 2018Updated 8 years ago
- A Ruby parser for electronic candidate, PAC and party campaign filings from the Federal Election Commission.☆15Feb 3, 2024Updated 2 years ago
- Deutsch Language Tool Kit☆12Aug 31, 2015Updated 10 years ago
- Stand-alone implementation of UCD's IIIF image re-formatting tool + plugin to integrate with Mirador IIIF-compliant image viewer☆18Jul 31, 2017Updated 8 years ago
- Natural language generation with hidden markov models (using hmmlearn)☆25Sep 24, 2016Updated 9 years ago
- An Editor for creating simple or complex OCR workflows☆17Jun 13, 2024Updated last year
- fork of tesseract for emscripten☆21Jul 21, 2015Updated 10 years ago
- GermaNER: Free Open German Named Entity Recognition Tool☆36Dec 16, 2023Updated 2 years ago
- Guess a person's gender by their first name. Caveats apply.☆18May 6, 2023Updated 2 years ago
- Efficient hOCR tooling☆55Aug 18, 2025Updated 7 months ago
- Investigative tool for extracting relevant areas from many documents☆14Nov 17, 2015Updated 10 years ago
- An unambiguous dialect of ArchieML☆23Oct 27, 2023Updated 2 years ago
- ☆13Jul 18, 2018Updated 7 years ago
- ☆25Apr 22, 2018Updated 7 years ago
- Ergonomic line-by-line transcription of scanned text.☆54Feb 2, 2026Updated last month
- Code for extracting data from a large number of PDFs, particularly FCC political ad documents☆15Oct 26, 2017Updated 8 years ago
- Next generation OCR engine based on LSTMs.☆51Apr 8, 2018Updated 7 years ago
- Some helpful bash profile functions for working with earth imagery☆33Mar 8, 2020Updated 6 years ago
- A list of inspirational and thought-provoking reads about women who code.☆10Nov 20, 2025Updated 4 months ago
- A provenance library for bioinformatics workflows 🧬 🔀 📝☆14Oct 5, 2021Updated 4 years ago
- Multi-dimensional LSTM implementation in TensorFlow☆22Sep 25, 2017Updated 8 years ago
- nicar 17: advanced pdf manipulation☆18Mar 4, 2017Updated 9 years ago
- Test using WebWorkers to run D3 geo projection☆10Jul 2, 2018Updated 7 years ago
- An extensible viewer for OCR-D mets.xml files☆23May 30, 2024Updated last year
- Rapidly scaffold out visual-vocabulary projects☆11Jan 10, 2019Updated 7 years ago
- pneumatic is a bulk-upload library for DocumentCloud.☆22Sep 6, 2020Updated 5 years ago
- A peptide string building for expanding chemical dataset combinations.☆12Dec 8, 2024Updated last year
- Archive of political ad data from the Federal Communications Commission☆20Oct 25, 2017Updated 8 years ago
- Part of eMOP: Franken+ tool for creating font training for Tesseract OCR engine from page images.☆24Sep 24, 2015Updated 10 years ago
- My Library of R Helpers☆13Aug 3, 2020Updated 5 years ago