facebookresearch / stopes
A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB team.
☆251Updated last month
Related projects ⓘ
Alternatives and complementary repositories for stopes
- A tool that locates, downloads, and extracts machine translation corpora☆147Updated 5 months ago
- NTREX -- News Test References for MT Evaluation☆75Updated 5 months ago
- A neural word aligner based on multilingual BERT☆328Updated 2 years ago
- ☆193Updated this week
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆150Updated 5 months ago
- Official Implementation of "DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization."☆137Updated 2 years ago
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.☆90Updated last month
- The FLORES+ Machine Translation Benchmark☆99Updated last week
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆96Updated 7 months ago
- OpusFilter - Parallel corpus processing toolkit☆102Updated 3 months ago
- cLang-8 is a dataset for grammatical error correction.☆103Updated 2 years ago
- Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)☆351Updated last year
- Bicleaner fork that uses neural networks☆38Updated 3 months ago
- SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.☆338Updated 3 weeks ago
- Improved Sentence Alignment in Linear Time and Space☆163Updated last year
- Universal Romanizer that can convert any unicode script to roman (latin) script☆154Updated 3 months ago
- Reduce the size of pretrained Hugging Face models via vocabulary trimming.☆43Updated last year
- [EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction☆119Updated 3 years ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆48Updated 2 months ago
- The pipeline for the OSCAR corpus☆162Updated 11 months ago
- Transformer based translation quality estimation☆107Updated last year
- Repository to collect and categorize Grammatical Error Correction papers.☆114Updated last month
- Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆90Updated 3 weeks ago
- ☆221Updated 5 months ago
- A Neural Framework for MT Evaluation☆508Updated 3 months ago
- Efficient Low-Memory Aligner☆139Updated 2 months ago
- This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences fro…☆157Updated last month
- Repository for SLURP paper☆97Updated 2 years ago
- ☆179Updated last year
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆70Updated 8 months ago