Pleias / toxic-commonsLinks
The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.
☆16Updated 6 months ago
Alternatives and similar repositories for toxic-commons
Users that are interested in toxic-commons are comparing it to the libraries listed below
Sorting:
- Code for SaGe subword tokenizer (EACL 2023)☆25Updated 6 months ago
- ☆67Updated last year
- Small python package to measure OCR quality and other related metrics.☆22Updated last year
- ☆22Updated 4 months ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆25Updated 6 months ago
- Pre-train Static Word Embeddings☆76Updated this week
- Using short models to classify long texts☆21Updated 2 years ago
- My NER Experiments with ModernBERT☆21Updated 3 weeks ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 2 months ago
- Library for fast text representation and classification.☆28Updated last year
- Model implementation for the contextual embeddings project☆26Updated this week
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆56Updated 3 weeks ago
- A BERT-based application for reusable text classification at scale☆38Updated last year
- ☆27Updated 3 months ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- Efficient few-shot learning with cross-encoders.☆52Updated last year
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆59Updated 10 months ago
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated last year
- Python library to use Pleias-RAG models☆53Updated last month
- Multilingual Entity Linking model by BELA model☆12Updated last year
- 🔢 Work with static vector models☆28Updated last month
- One-stop shop for running and fine-tuning transformer-based language models for retrieval☆56Updated this week
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆20Updated 4 months ago
- ☆10Updated 8 months ago
- ☆12Updated 6 months ago
- An easy way to chunk spaCy docs.☆20Updated 9 months ago
- Tool to apply Legal Matter Specification Standard (LMSS) to documents☆13Updated 9 months ago
- A Python library aimed at dissecting and augmenting NER training data.☆58Updated 2 years ago
- Evaluate language models using multiple choice items☆13Updated 3 weeks ago
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆24Updated 11 months ago