KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
☆12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆58Updated last week
- Pre-train Static Word Embeddings☆87Updated last month
- A polite and user-friendly downloader for Common Crawl data☆56Updated 2 months ago
- A sentence segmentation library with wide language support optimized for speed and utility.☆68Updated 3 months ago
- Fast Text Classification with Compressors dictionary☆149Updated 2 years ago
- Contextualized per-token embeddings☆29Updated 5 months ago
- A general purpose processing framework for corpora of scientific documents☆65Updated this week
- This is a new backend implementation of the ANNIS linguistic search and visualization system.☆18Updated last week
- NLP with Rust for Python 🦀🐍☆65Updated 5 months ago
- Library for fast text representation and classification.☆31Updated last year
- Extracts plain text, language identification and more metadata from WARC records☆23Updated 2 weeks ago
- Libraries, Archives and Museums (LAM)☆87Updated 3 years ago
- Efficiently computing & storing token n-grams from large corpora☆26Updated last year
- The Institutional Data Initiative's pipeline for analyzing, refining, and publishing the Institutional Books 1.0 collection.☆45Updated this week
- Using embeddings compressed by Product Quantization, in Javascript☆31Updated 2 years ago
- Layout Analysis Dataset with Segmonto (LADaS)☆21Updated 3 months ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- GGML implementation of BERT model with Python bindings and quantization.☆55Updated last year
- Code for collecting, processing, and preparing datasets for the Common Pile☆234Updated last month
- Highly concurrent and fast content processing for Mighty Inference Server☆10Updated 2 years ago
- ☆78Updated 10 months ago
- Locality Sensitive Hashing☆73Updated 2 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆26Updated 10 months ago
- Modular Rust transformer/LLM library using Candle☆36Updated last year
- LLM plugin for clustering embeddings☆82Updated last year
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datas…☆210Updated 2 weeks ago
- ☆157Updated 2 years ago
- Like picoGPT but for BERT.☆50Updated 2 years ago
- fasttext with wheels and no external dependency, but only the predict method (<1MB)☆18Updated 10 months ago
- ☆43Updated this week