KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
β12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- NLP with Rust for Python π¦πβ64Updated 3 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ β¦β52Updated this week
- Efficiently computing & storing token n-grams from large corporaβ26Updated 10 months ago
- Pre-train Static Word Embeddingsβ84Updated 3 months ago
- This is a new backend implementation of the ANNIS linguistic search and visualization system.β18Updated last week
- Modular Rust transformer/LLM library using Candleβ36Updated last year
- Fast Text Classification with Compressors dictionaryβ150Updated 2 years ago
- Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.β78Updated last year
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.β110Updated last year
- Completion After Prompt Probability. Make your LLM make a choiceβ80Updated 10 months ago
- β42Updated 4 months ago
- GGML implementation of BERT model with Python bindings and quantization.β56Updated last year
- Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. Works well with pytorch.β13Updated 5 years ago
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datasβ¦β201Updated last month
- Next-generation Punkt sentence boundary detection with zero dependenciesβ17Updated 3 weeks ago
- A sentence segmentation library with wide language support optimized for speed and utility.β66Updated 2 months ago
- β156Updated 2 years ago
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.β80Updated 2 years ago
- Small python package to measure OCR quality and other related metrics.β25Updated last year
- A Streamlit application to visualize sentence embeddingsβ19Updated 2 years ago
- Training code for Sparse Autoencoders on Embedding modelsβ38Updated 6 months ago
- A polite and user-friendly downloader for Common Crawl dataβ53Updated 2 weeks ago
- β76Updated 8 months ago
- Contextualized per-token embeddingsβ28Updated 3 months ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, includingβ¦β66Updated 2 months ago
- Libraries, Archives and Museums (LAM)β85Updated 2 years ago
- Layout Analysis Dataset with Segmonto (LADaS)β21Updated last month
- Highly concurrent and fast content processing for Mighty Inference Serverβ10Updated 2 years ago
- Code for collecting, processing, and preparing datasets for the Common Pileβ222Updated last week
- Notebooks for training universal 0-shot classifiers on many different tasksβ136Updated 8 months ago