Pleias / toxic-commonsLinks
The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.
☆17Updated 7 months ago
Alternatives and similar repositories for toxic-commons
Users that are interested in toxic-commons are comparing it to the libraries listed below
Sorting:
- Small python package to measure OCR quality and other related metrics.☆23Updated last year
- ☆67Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆25Updated 6 months ago
- Library for fast text representation and classification.☆30Updated last year
- ☆22Updated 5 months ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆25Updated 7 months ago
- Python library to use Pleias-RAG models☆57Updated last month
- Pre-train Static Word Embeddings☆79Updated 3 weeks ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- ☆12Updated 6 months ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 2 months ago
- Multilingual Entity Linking model by BELA model☆12Updated last year
- A BERT-based application for reusable text classification at scale☆38Updated last year
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆58Updated last month
- My NER Experiments with ModernBERT☆21Updated last month
- Using short models to classify long texts☆21Updated 2 years ago
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago
- ☆27Updated 4 months ago
- Official Repository for "Hypencoder: Hypernetworks for Information Retrieval"☆25Updated 3 months ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆48Updated last year
- Model implementation for the contextual embeddings project☆33Updated 3 weeks ago
- Efficient few-shot learning with cross-encoders.☆53Updated last year
- Evaluate language models using multiple choice items☆13Updated last month
- ☆22Updated 3 years ago
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆24Updated 11 months ago
- ☆10Updated 8 months ago
- ☆55Updated last year
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆59Updated 10 months ago
- ☆47Updated 4 months ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆21Updated 4 months ago