KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
β12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- NLP with Rust for Python π¦πβ71Updated 8 months ago
- Pre-train Static Word Embeddingsβ94Updated 5 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ β¦β69Updated last month
- Efficiently computing & storing token n-grams from large corporaβ26Updated last year
- A polite and user-friendly downloader for Common Crawl dataβ67Updated 5 months ago
- A sentence segmentation library with wide language support optimized for speed and utility.β86Updated 3 weeks ago
- Contextualized per-token embeddingsβ34Updated 9 months ago
- π€ HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)β17Updated last year
- Extracts plain text, language identification and more metadata from WARC recordsβ23Updated 4 months ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, includingβ¦β68Updated 2 months ago
- lossily compress representation vectors using product quantizationβ59Updated 3 months ago
- PyLate efficient inference engineβ71Updated last month
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.β81Updated 2 years ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.β111Updated last year
- Libraries, Archives and Museums (LAM)β88Updated 3 years ago
- Faster, modernized fork of the language identification tool langid.pyβ60Updated last year
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasβ¦β230Updated last month
- Completion After Prompt Probability. Make your LLM make a choiceβ82Updated last year
- Code for SaGe subword tokenizer (EACL 2023)β27Updated last year
- A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals β¦β15Updated last year
- β50Updated 3 months ago
- Tree-based indexes for neural-searchβ31Updated last year
- GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extractionβ84Updated last year
- The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching oβ¦β155Updated 6 months ago
- Notebooks for training universal 0-shot classifiers on many different tasksβ140Updated last year
- Library for fast text representation and classification.β31Updated 2 years ago
- π’ Work with static vector modelsβ36Updated 9 months ago
- Small python package to measure OCR quality and other related metrics.β26Updated last year
- This is a new backend implementation of the ANNIS linguistic search and visualization system.β18Updated 3 weeks ago
- A robust web archive analytics toolkitβ129Updated 3 months ago