caltechlibrary / documentarist
Process Caltech Archives' digital documents and photos, and annotate each page or image with information about its contents
☆12Updated 2 years ago
Related projects: ⓘ
- A Datasette plugin providing an MLOps platform to train, eval and predict machine learning models☆15Updated last week
- Post-processing OCR errors with seq2seq models☆28Updated 4 years ago
- Python tools for Tesseract OCR training☆25Updated 2 years ago
- The Seshat audio annotation management platform☆13Updated 3 years ago
- ☆15Updated 3 years ago
- DFKI Layout Detection for OCR-D☆48Updated 4 months ago
- Finds linguistic patterns effortlessly☆31Updated last year
- Tools for evaluating OCR performance relative to ground truth.☆9Updated 8 months ago
- Uses Beautiful Soup to read Wiki pages, Gensim to summarize, NLTK to process, and extracts keywords based on entropy: everything in one b…☆9Updated 3 years ago
- Hybrid architecture media server, media service and Streamlit client app using FastAPI and Python☆12Updated 2 years ago
- A Python package to get useful information from documents using TopicRank Algorithm.☆16Updated last year
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆31Updated last year
- Stylometric framework in Python☆13Updated 9 years ago
- ☆10Updated 5 years ago
- Run tesseract with the tesserocr bindings with @OCR-D's interfaces☆38Updated 3 weeks ago
- Batch processing using joblib including tqdm progress bars☆20Updated 2 years ago
- Deeplearing based Reverse Image Search using Annoy library☆16Updated 5 years ago
- NSS Capstone project to use natural language modeling, classification, and information extraction to get the exact employee count values …☆15Updated 6 years ago
- ☆12Updated 10 months ago
- METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)☆52Updated last year
- Ergonomic line-by-line transcription of scanned text.☆47Updated 3 years ago
- Reproducing "Writing with Transformer" demo, using aitextgen/FastAPI in backend, Quill/React in frontend☆28Updated 3 years ago
- OCR-D post-correction module based on weighted finite-state transducers☆11Updated 8 months ago
- A library to create and load tfrecord files as tf.data.Dataset☆9Updated 4 months ago
- A web app built with Streamlit that summarizes input text☆13Updated 3 years ago
- Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.☆24Updated 2 years ago
- Fast and accurate natural language detection. Detector written in Python. Nito-ELD, ELD.☆11Updated 11 months ago
- Binary Python bindings for poppler utils for content extraction☆42Updated 3 years ago
- Dataiku DSS plugin to detect languages, correct misspellings, and clean text data 🧼☆23Updated 4 months ago
- 'ocr-evaluation-tools' from http://ancientgreekocr.org/. Tools to test OCR accuracy.☆22Updated 6 years ago