OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
β57Feb 3, 2026Updated last month
Alternatives and similar repositories for OpusCleaner
Users that are interested in OpusCleaner are comparing it to the libraries listed below
Sorting:
- Targetted language identifier, based on FastText and Hunspell.β38Sep 4, 2025Updated 6 months ago
- πΈ GlotWeb: Web Indexing for Minority Languages (WWW 2026)β17Updated this week
- Bicleaner fork that uses neural networksβ40Feb 23, 2026Updated last week
- A library for data streaming and augmentationβ21May 5, 2025Updated 9 months ago
- The implementation of "Mitigating Hallucinations and Off-target Machine Translation with Source-Contrastive and Language-Contrastive Decoβ¦β36Aug 29, 2025Updated 6 months ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Apr 1, 2025Updated 11 months ago
- A tool that locates, downloads, and extracts machine translation corporaβ162Sep 18, 2025Updated 5 months ago
- Do Multilingual Language Models Think Better in English?β42Aug 3, 2023Updated 2 years ago
- β10Oct 2, 2024Updated last year
- Efficient teacher-student models and scripts to make themβ54Dec 16, 2023Updated 2 years ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β106Apr 20, 2024Updated last year
- KnowMAN: Weakly Supervised Multinomial Adversarial Networksβ12Nov 9, 2021Updated 4 years ago
- β10Sep 13, 2022Updated 3 years ago
- β21Feb 13, 2023Updated 3 years ago
- π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated 10 months ago
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.β126Oct 13, 2025Updated 4 months ago
- Bitextor generates translation memories from multilingual websitesβ301Nov 11, 2024Updated last year
- OpusFilter - Parallel corpus processing toolkitβ115Feb 11, 2026Updated 3 weeks ago
- β11Jun 23, 2022Updated 3 years ago
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Nov 4, 2025Updated 4 months ago
- A free, fast and accurate routine to compute the position of the Sunβ20Feb 13, 2024Updated 2 years ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β35Feb 5, 2026Updated 3 weeks ago
- Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)β389Nov 7, 2023Updated 2 years ago
- Fast Neural Machine Translation in C++ - development repositoryβ22May 12, 2024Updated last year
- πΈ GlotCC Dataset and Pipline -- NeurIPS 2024β20Apr 6, 2025Updated 10 months ago
- π Resource and Tool for Writing System Identification (Unicode 17.0) -- LREC 2024β21Feb 17, 2026Updated 2 weeks ago
- c++ mosestokenizerβ18Mar 13, 2024Updated last year
- Source code for the ACL-IJCNLP 2021 paper entitled "T-DNA: Taming Pre-trained Language Models with N-gram Representations for Low-Resourcβ¦β19Jan 12, 2023Updated 3 years ago
- β82Jan 30, 2026Updated last month
- NTREX -- News Test References for MT Evaluationβ88Jun 5, 2024Updated last year
- Data Collection System For NLP/Speech Recognitionβ25Apr 20, 2021Updated 4 years ago
- β44Feb 11, 2026Updated 3 weeks ago
- A repository containing the code for speech translation papers.β21Mar 11, 2022Updated 3 years ago
- Implementation of "SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages" paper, accepted to Eβ¦β28Feb 8, 2023Updated 3 years ago
- PyTorch implementation of NAACL 2021 paper "Multi-view Subword Regularization"β26Jun 2, 2021Updated 4 years ago
- Pushing the Limits of Zero-shot End-to-End Speech Translationβ26Dec 12, 2024Updated last year
- β23Nov 15, 2022Updated 3 years ago
- Climate Crisis, is a variable font designed to help visualise the urgency of climate change.β29Feb 20, 2026Updated last week
- MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinkiβ30Feb 25, 2026Updated last week