cisnlp / GlotLIDLinks
π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
β177Updated last month
Alternatives and similar repositories for GlotLID
Users that are interested in GlotLID are comparing it to the libraries listed below
Sorting:
- β119Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β106Updated last year
- Datasets collection and preprocessings framework for NLP extreme multitask learningβ189Updated 5 months ago
- FastFit β‘ When LLMs are Unfit Use FastFit β‘ Fast and Effective Text Classification with Many Classesβ213Updated 3 months ago
- The pipeline for the OSCAR corpusβ174Updated last month
- A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB teβ¦β287Updated 2 months ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Updated 8 months ago
- NTREX -- News Test References for MT Evaluationβ86Updated last year
- The FLORES+ Machine Translation Benchmarkβ109Updated last year
- Tools for managing datasets for governance and training.β87Updated 2 weeks ago
- β56Updated 11 months ago
- Pipeline for pulling and processing online language model pretraining data from the webβ179Updated 2 years ago
- Official implementation of the paper "CoEdIT: Text Editing by Task-Specific Instruction Tuning" (EMNLP 2023)β134Updated last year
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β64Updated last year