KorAP / Tokenizer-EvaluationLinks
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
☆11Updated 2 years ago
Alternatives and similar repositories for Tokenizer-Evaluation
Users that are interested in Tokenizer-Evaluation are comparing it to the libraries listed below
Sorting:
- Pre-train Static Word Embeddings☆79Updated 3 weeks ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆45Updated this week
- Library for fast text representation and classification.☆30Updated last year
- A Streamlit application to visualize sentence embeddings☆19Updated 2 years ago
- Modular Rust transformer/LLM library using Candle☆36Updated last year
- Extracts plain text, language identification and more metadata from WARC records☆22Updated 3 months ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆109Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆25Updated 6 months ago
- Highly concurrent and fast content processing for Mighty Inference Server☆10Updated 2 years ago
- A sentence segmentation library with wide language support optimized for speed and utility.☆65Updated this week
- Fast Neural Machine Translation in C++ - development repository☆19Updated last year
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 2 months ago
- A Python module for retrieving script types of writing systems including alphabets, abjads, abugidas, syllabaries, logographs, featurals …☆13Updated 11 months ago
- NLP with Rust for Python 🦀🐍☆62Updated last month
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆31Updated 3 months ago
- LTG-Bert☆33Updated last year
- an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction☆37Updated 3 months ago
- utilities for loading and running text embeddings with onnx☆44Updated 10 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆58Updated last month
- 🔢 Work with static vector models☆28Updated 2 months ago
- Python Finite-State Toolkit☆56Updated this week
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated last year
- Python library to use Pleias-RAG models☆57Updated last month
- aiXplain enables python programmers to add AI functions to their software.☆44Updated this week
- Character-level conversion between Hebrew text and Latin transliteration using deep learning - a demonstration of seq2seq training.☆13Updated last year
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated last year
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆57Updated 2 years ago
- Efficiently computing & storing token n-grams from large corpora☆24Updated 8 months ago
- Showcase how mxbai-embed-large-v1 can be used to produce binary embedding. Binary embeddings enabled 32x storage savings and 40x faster r…☆18Updated last year
- A CLI tool for managing OpenAI batch processing jobs with ease.☆37Updated last month