caltechlibrary / documentarist
Process Caltech Archives' digital documents and photos, and annotate each page or image with information about its contents
☆12Updated 2 years ago
Alternatives and similar repositories for documentarist:
Users that are interested in documentarist are comparing it to the libraries listed below
- Uses Beautiful Soup to read Wiki pages, Gensim to summarize, NLTK to process, and extracts keywords based on entropy: everything in one b…☆9Updated 4 years ago
- Given a text, wrap it into phrases and send them to Yandex's search engine. If it yields a "did you mean:", substitute the original phras…☆11Updated 6 years ago
- Tool for sentiment analysis annotation☆12Updated 5 months ago
- Post-processing OCR errors with seq2seq models☆28Updated 4 years ago
- A Datasette plugin providing an MLOps platform to train, eval and predict machine learning models☆15Updated this week
- The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques☆29Updated 4 years ago
- Example of building a working Spanish-to-English translation model with Marian NMT☆20Updated 4 years ago
- Embedding Visualizer (EmbedViz) data app made with Streamlit library☆22Updated 4 years ago
- Tools for using OpenAI Codex to do various useful things☆48Updated 3 years ago
- A system for reading scanned documents and grouping them into high level topics☆16Updated 4 years ago
- An autonomous LLM-based agent that generates code to extract structured information from web pages and extracts it.☆10Updated 4 months ago
- Text classification automl☆21Updated 3 years ago
- An implementation of Tiling and Corruption (TACo) Augmentations for OCR/HTR☆15Updated 3 years ago
- This is a demo project to compare two web scrapping frameworks, Playwright and Selenium and using the new Pipelining tool Dagster☆13Updated 3 years ago
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 9 months ago
- OCR-D post-correction module based on weighted finite-state transducers☆11Updated last year
- An index data structure for approximate string search.☆23Updated 5 years ago
- Visualize large text collections with WebGL☆25Updated 6 months ago
- Finds linguistic patterns effortlessly☆35Updated last year
- Datamallet is a python library which contains several helper functions and module for the common tasks in a typical data science workflow…☆11Updated 2 years ago
- Automating Google Colab with JavaScript to run prescheduled and dynamic Python scripts☆20Updated 3 years ago
- Segmenting a given document using recursive xy-cut algorithm.☆12Updated 6 years ago
- Stylometric framework in Python☆17Updated 9 years ago
- Python tools for Tesseract OCR training☆25Updated 2 years ago
- DFKI Layout Detection for OCR-D☆47Updated 4 months ago
- NSS Capstone project to use natural language modeling, classification, and information extraction to get the exact employee count values …☆15Updated 6 years ago
- Simple and clean Python implementation of TextRank as per seminal paper by Rada Mihalcea and Paul Tarau. This implementation performs bot…☆11Updated 4 years ago
- ☆12Updated last year
- Generic Environment for Context-Aware Correction of Orthography☆22Updated 2 years ago
- Run tesseract with the tesserocr bindings with @OCR-D's interfaces☆39Updated this week