KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
☆12Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- A polite and user-friendly downloader for Common Crawl data☆64Updated 4 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆68Updated this week
- Pre-train Static Word Embeddings☆94Updated 4 months ago
- Efficiently computing & storing token n-grams from large corpora☆26Updated last year
- NLP with Rust for Python 🦀🐍☆70Updated 7 months ago
- Extracts plain text, language identification and more metadata from WARC records☆23Updated 3 months ago
- Layout Analysis Dataset with Segmonto (LADaS)☆23Updated 5 months ago
- A sentence segmentation library with wide language support optimized for speed and utility.☆77Updated last month
- Fast Text Classification with Compressors dictionary☆150Updated 2 years ago
- Contextualized per-token embeddings☆33Updated 7 months ago
- Library for fast text representation and classification.☆31Updated 2 years ago
- A general purpose processing framework for corpora of scientific documents☆65Updated last week
- This is a new backend implementation of the ANNIS linguistic search and visualization system.☆18Updated 3 weeks ago
- Optimus is a flexible and scalable framework built to train language models efficiently across diverse hardware configurations, including…☆68Updated last month
- Google Colab Notebooks for Transcription with Whisper☆24Updated 8 months ago
- Libraries, Archives and Museums (LAM)☆88Updated 3 years ago
- Python Finite-State Toolkit☆60Updated last week
- ☆62Updated last year
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated last year
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Put together a multilingual corpus from a variety of sources. Used for wordfreq and word embeddings.☆57Updated 4 years ago
- This repository contains an easy and intuitive approach to use SetFit in combination with spaCy.☆80Updated 2 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆27Updated last year
- Highly concurrent and fast content processing for Mighty Inference Server☆10Updated 2 years ago
- Flask Interface to Thompson's Motif Index☆18Updated 6 years ago
- High-performance MinHash implementation in Rust with Python bindings for efficient similarity estimation and deduplication of large datas…☆224Updated this week
- Code for collecting, processing, and preparing datasets for the Common Pile☆248Updated 3 months ago
- ☆157Updated 2 years ago
- A client library for LAION's effort to filter CommonCrawl with CLIP, building a large scale image-text dataset.☆32Updated 2 years ago
- GGML implementation of BERT model with Python bindings and quantization.☆58Updated last year