KorAP / Tokenizer-EvaluationLinks

Benchmark scripts for comparing different tokenizers and sentence segmenters of German

☆12

Alternatives and similar repositories for Tokenizer-Evaluation

Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below

Sorting:

raphaelsty / LeNLP
NLP with Rust for Python 🦀🐍
☆63Updated 2 months ago
commoncrawl / web-languages
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …
☆46Updated this week
MinishLab / tokenlearn
Pre-train Static Word Embeddings
☆84Updated last month
EleutherAI / tokengrams
Efficiently computing & storing token n-grams from large corpora
☆24Updated 9 months ago
iamlemec / bert.cpp
GGML implementation of BERT model with Python bindings and quantization.
☆56Updated last year
raphaelsty / neural-tree
Tree-based indexes for neural-search
☆32Updated last year
cyrilou242 / ftcc
Fast Text Classification with Compressors dictionary
☆150Updated last year
kpu / fasterText
Library for fast text representation and classification.
☆30Updated last year
originell / smaz-py3
Small string compression using smaz compression algorithm. Fast, because it's in C. Supports Python 3+
☆13Updated last year
maxdotio / mighty-batch
Highly concurrent and fast content processing for Mighty Inference Server
☆10Updated 2 years ago
Pleias / OCRoscope
Small python package to measure OCR quality and other related metrics.
☆24Updated last year
DEFI-COLaF / LADaS
Layout Analysis Dataset with Segmonto (LADaS)
☆21Updated this week
bondagit / whisper-alsa
The application performs real-time inference on audio from an ALSA capture device
☆33Updated 3 weeks ago
explosion / wikid
Generate a SQLite database from Wikipedia & Wikidata dumps.
☆35Updated last year
cmakafui / batchwizard
A CLI tool for managing OpenAI batch processing jobs with ease.
☆37Updated 2 months ago
alea-institute / nupunkt
Next-generation Punkt sentence boundary detection with zero dependencies
☆17Updated 3 months ago
beowolx / rensa
High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datas…
☆187Updated 3 weeks ago
neuml / staticvectors
🔢 Work with static vector models
☆28Updated 2 months ago
Pleias / toxic-commons
The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.
☆18Updated 8 months ago
wikimedia / sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
☆65Updated 3 weeks ago
josephrocca / rwkv-v4-web
BlinkDL's RWKV-v4 running in the browser
☆47Updated 2 years ago
liujch1998 / infini-gram
☆54Updated last month
fbilhaut / gline-rs
Inference engine for GLiNER models, in Rust
☆62Updated last week
Pleias / Pleias-RAG-Library
Python library to use Pleias-RAG models
☆58Updated 2 months ago
mixedbread-ai / binary-embeddings
Showcase how mxbai-embed-large-v1 can be used to produce binary embedding. Binary embeddings enabled 32x storage savings and 40x faster r…
☆18Updated last year
r-three / common-pile
Code for collecting, processing, and preparing datasets for the Common Pile
☆180Updated last month
dginev / CorTeX
A general purpose processing framework for corpora of scientific documents
☆64Updated 2 weeks ago
aixplain / aiXplain
aiXplain enables python programmers to add AI functions to their software.
☆46Updated this week
rahuldshetty / starcoder.js
Web browser version of StarCoder.cpp
☆45Updated last year
cjpais / whisperfile
☆57Updated 10 months ago