KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
☆12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- Efficiently computing & storing token n-grams from large corpora☆26Updated 10 months ago
- A polite and user-friendly downloader for Common Crawl data☆51Updated last month
- NLP with Rust for Python 🦀🐍☆64Updated 2 months ago
- Pre-train Static Word Embeddings☆85Updated 2 months ago
- Extracts plain text, language identification and more metadata from WARC records☆23Updated 2 weeks ago
- Contextualized per-token embeddings☆27Updated 3 months ago
- Library for fast text representation and classification.☆31Updated last year
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated last week
- ☆76Updated 7 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆50Updated last week
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆25Updated 8 months ago
- Libraries, Archives and Museums (LAM)☆85Updated 2 years ago
- This is a new backend implementation of the ANNIS linguistic search and visualization system.☆18Updated this week
- The pipeline for the OSCAR corpus☆171Updated last year
- Code for collecting, processing, and preparing datasets for the Common Pile☆216Updated 2 weeks ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, including…☆66Updated last month
- Using embeddings compressed by Product Quantization, in Javascript☆31Updated 2 years ago
- Layout Analysis Dataset with Segmonto (LADaS)☆21Updated last month
- Fast Text Classification with Compressors dictionary☆150Updated last year
- Highly concurrent and fast content processing for Mighty Inference Server☆10Updated 2 years ago
- A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals …☆13Updated last year
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated last year
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datas…☆193Updated 3 weeks ago
- Completion After Prompt Probability. Make your LLM make a choice☆80Updated 9 months ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆83Updated last year
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆108Updated last year
- Nadir: Cutting-edge PyTorch optimizers for simplicity & composability! 🔥🚀💻☆14Updated last year
- Locality Sensitive Hashing☆72Updated 2 years ago
- lossily compress representation vectors using product quantization☆58Updated 3 months ago