Pleias / toxic-commonsLinks
The official repository for Toxic Commons and Celadon. Toxicity Classification for public domain data.
☆22Updated last year
Alternatives and similar repositories for toxic-commons
Users that are interested in toxic-commons are comparing it to the libraries listed below
Sorting:
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Code for SaGe subword tokenizer (EACL 2023)☆27Updated last year
- ☆67Updated last year
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆111Updated last year
- Efficient few-shot learning with cross-encoders.☆60Updated last year
- One-stop shop for running and fine-tuning transformer-based language models for retrieval☆60Updated 2 weeks ago
- ☆27Updated 9 months ago
- Libraries, Archives and Museums (LAM)☆88Updated 3 years ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆22Updated 5 months ago
- A BERT-based application for reusable text classification at scale☆38Updated 2 years ago
- 🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.☆16Updated 3 months ago
- ☆23Updated 10 months ago
- My NER Experiments with ModernBERT and Ettin☆25Updated 4 months ago
- Python library to use Pleias-RAG models☆67Updated 6 months ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆63Updated last year
- ParaNames: A multilingual resource for parallel names☆37Updated last year
- Pre-train Static Word Embeddings☆91Updated 2 months ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆67Updated 2 months ago
- A python package to run inference with HuggingFace language and vision-language checkpoints wrapping many convenient features.☆28Updated last year
- 💫 SpaCy wrapper for ConceptNet 💫☆95Updated 2 years ago
- The robust European language model benchmark.☆138Updated this week
- 🔢 Work with static vector models☆34Updated 7 months ago
- T-Projection is a method to perform high-quality Annotation Projection of Sequence Labeling datasets.☆13Updated 2 years ago
- Data for the HIPE 2022 shared task.☆21Updated 2 years ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆85Updated last year
- 🗺️ Data Cleaning and Textual Data Visualization 🗺️☆191Updated 6 months ago
- Semantically Structured Sentence Embeddings☆69Updated last year
- Notebooks for training universal 0-shot classifiers on many different tasks☆137Updated 11 months ago
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆69Updated 2 years ago
- A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.☆108Updated last year