KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
☆12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- A sentence segmentation library with wide language support optimized for speed and utility.☆73Updated 3 weeks ago
- A polite and user-friendly downloader for Common Crawl data☆63Updated 3 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆65Updated this week
- Contextualized per-token embeddings☆33Updated 7 months ago
- Pre-train Static Word Embeddings☆92Updated 3 months ago
- NLP with Rust for Python 🦀🐍☆70Updated 6 months ago
- Efficiently computing & storing token n-grams from large corpora☆26Updated last year
- Library for fast text representation and classification.☆31Updated last year
- Layout Analysis Dataset with Segmonto (LADaS)☆23Updated 4 months ago
- Highly concurrent and fast content processing for Mighty Inference Server☆10Updated 2 years ago
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datas…☆217Updated 2 months ago
- Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.☆13Updated 5 years ago
- Small string compression using smaz compression algorithm. Fast, because it's in C. Supports Python 3+☆13Updated last month
- A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals …☆15Updated last year
- Fast Text Classification with Compressors dictionary☆150Updated 2 years ago
- This is a new backend implementation of the ANNIS linguistic search and visualization system.☆18Updated 2 months ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, including…☆67Updated last week
- Senna is an advanced AI-powered search engine designed to provide users with immediate answers to their queries by leveraging natural lan…☆19Updated last year
- Code for collecting, processing, and preparing datasets for the Common Pile☆246Updated 3 months ago
- A general purpose processing framework for corpora of scientific documents☆65Updated this week
- ☆23Updated 10 months ago
- Extracts plain text, language identification and more metadata from WARC records☆23Updated 2 months ago
- BlinkDL's RWKV-v4 running in the browser☆47Updated 2 years ago
- ☆157Updated 2 years ago
- The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.☆22Updated last year
- 🔢 Work with static vector models☆35Updated 7 months ago
- utilities for loading and running text embeddings with onnx☆44Updated 3 months ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆85Updated 2 years ago
- Fast and versatile tokenizer for language models, compatible with SentencePiece, Tokenizers, Tiktoken and more. Supports BPE, Unigram and…☆39Updated 2 months ago
- GGML implementation of BERT model with Python bindings and quantization.☆58Updated last year