KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
β12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- NLP with Rust for Python π¦πβ63Updated 2 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ β¦β46Updated this week
- Pre-train Static Word Embeddingsβ84Updated last month
- Efficiently computing & storing token n-grams from large corporaβ24Updated 9 months ago
- GGML implementation of BERT model with Python bindings and quantization.β56Updated last year
- Tree-based indexes for neural-searchβ32Updated last year
- Fast Text Classification with Compressors dictionaryβ150Updated last year
- Library for fast text representation and classification.β30Updated last year
- Small string compression using smaz compression algorithm. Fast, because it's in C. Supports Python 3+β13Updated last year
- Highly concurrent and fast content processing for Mighty Inference Serverβ10Updated 2 years ago
- Small python package to measure OCR quality and other related metrics.β24Updated last year
- Layout Analysis Dataset with Segmonto (LADaS)β21Updated this week
- The application performs real-time inference on audio from an ALSA capture deviceβ33Updated 3 weeks ago
- Generate a SQLite database from Wikipedia & Wikidata dumps.β35Updated last year
- A CLI tool for managing OpenAI batch processing jobs with ease.β37Updated 2 months ago
- Next-generation Punkt sentence boundary detection with zero dependenciesβ17Updated 3 months ago
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasβ¦β187Updated 3 weeks ago
- π’ Work with static vector modelsβ28Updated 2 months ago
- The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.β18Updated 8 months ago
- A sentence segmentation library with wide language support optimized for speed and utility.β65Updated 3 weeks ago
- BlinkDL's RWKV-v4 running in the browserβ47Updated 2 years ago
- β54Updated last month
- Inference engine for GLiNER models, in Rustβ62Updated last week
- Python library to use Pleias-RAG modelsβ58Updated 2 months ago
- Showcase how mxbai-embed-large-v1 can be used to produce binary embedding. Binary embeddings enabled 32x storage savings and 40x faster rβ¦β18Updated last year
- Code for collecting, processing, and preparing datasets for the Common Pileβ180Updated last month
- A general purpose processing framework for corpora of scientific documentsβ64Updated 2 weeks ago
- aiXplain enables python programmers to add AI functions to their software.β46Updated this week
- Web browser version of StarCoder.cppβ45Updated last year
- β57Updated 10 months ago