KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
β12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- Pre-train Static Word Embeddingsβ90Updated 2 months ago
- NLP with Rust for Python π¦πβ66Updated 6 months ago
- A sentence segmentation library with wide language support optimized for speed and utility.β71Updated last week
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ β¦β63Updated 2 weeks ago
- Contextualized per-token embeddingsβ31Updated 6 months ago
- Extracts plain text, language identification and more metadata from WARC recordsβ23Updated last month
- Library for fast text representation and classification.β31Updated last year
- A polite and user-friendly downloader for Common Crawl dataβ59Updated 3 months ago
- Small python package to measure OCR quality and other related metrics.β25Updated last year
- This is a new backend implementation of the ANNIS linguistic search and visualization system.β18Updated last month
- β79Updated 2 weeks ago
- Fast Text Classification with Compressors dictionaryβ150Updated 2 years ago
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasβ¦β215Updated last month
- Efficiently computing & storing token n-grams from large corporaβ26Updated last year
- Code for collecting, processing, and preparing datasets for the Common Pileβ242Updated 2 months ago
- PyLate efficient inference engineβ67Updated 2 months ago
- Highly concurrent and fast content processing for Mighty Inference Serverβ10Updated 2 years ago
- Python library to use Pleias-RAG modelsβ66Updated 6 months ago
- A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals β¦β15Updated last year
- Libraries, Archives and Museums (LAM)β88Updated 3 years ago
- Tree-based indexes for neural-searchβ32Updated last year
- Modular Rust transformer/LLM library using Candleβ36Updated last year
- Trainable embedding transformation for confidence estimation, feature extraction, explainability and conversion from dense to sparse.β26Updated 5 months ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, includingβ¦β67Updated 4 months ago
- Code for SaGe subword tokenizer (EACL 2023)β27Updated 11 months ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.β110Updated last year
- π€ HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)β17Updated last year
- Inference engine for GLiNER models, in Rustβ76Updated 2 weeks ago
- My NER Experiments with ModernBERT and Ettinβ22Updated 4 months ago
- Flask Interface to Thompson's Motif Indexβ18Updated 6 years ago