Pleias / toxic-commonsLinks
The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.
☆18Updated 9 months ago
Alternatives and similar repositories for toxic-commons
Users that are interested in toxic-commons are comparing it to the libraries listed below
Sorting:
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆25Updated 8 months ago
- One-stop shop for running and fine-tuning transformer-based language models for retrieval☆57Updated this week
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆59Updated last year
- Official Repository for "Hypencoder: Hypernetworks for Information Retrieval"☆27Updated 5 months ago
- ☆67Updated last year
- 🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.☆14Updated 4 months ago
- Pre-train Static Word Embeddings☆85Updated 2 months ago
- Efficient few-shot learning with cross-encoders.☆56Updated last year
- ☆51Updated 6 months ago
- 🔢 Work with static vector models☆29Updated 3 months ago
- A Python utility for indexing file lines. Best demo honourable mention at ECIR 2024.☆23Updated last year
- Semantically Structured Sentence Embeddings☆66Updated 9 months ago
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆147Updated 2 months ago
- Multilingual Entity Linking model by BELA model☆12Updated 2 years ago
- ☆79Updated 2 months ago
- ☆27Updated 5 months ago
- Library for fast text representation and classification.☆31Updated last year
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago
- Python library to use Pleias-RAG models☆61Updated 3 months ago
- GLADIS: A General and Large Acronym Disambiguation Benchmark (EACL 23)☆17Updated last year
- Libraries, Archives and Museums (LAM)☆85Updated 2 years ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆82Updated 11 months ago
- ☆22Updated 6 months ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆21Updated last month
- ParaNames: A multilingual resource for parallel names☆34Updated last year
- Augmenty is an augmentation library based on spaCy for augmenting texts.☆156Updated last year
- A BERT-based application for reusable text classification at scale☆38Updated 2 years ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆63Updated 2 months ago
- Repository for the paper "MultiNERD: A Multilingual, Multi-Genre and Fine-Grained Dataset for Named Entity Recognition (and Disambiguatio…☆44Updated last year