hplt-project/OpusCleaner

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/hplt-project/OpusCleaner)

hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

☆58

Alternatives and similar repositories for OpusCleaner

Users that are interested in OpusCleaner are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

hplt-project / OpusTrainer
View on GitHub
Curriculum training
☆22Jun 25, 2025Updated last year
mbanon / fastspell
View on GitHub
Targetted language identifier, based on FastText and Hunspell.
☆38Sep 4, 2025Updated 10 months ago
fyvo / WMT-Biomed-Test
View on GitHub
☆13Aug 23, 2024Updated last year
bitextor / bicleaner-ai
View on GitHub
Bicleaner fork that uses neural networks
☆40Feb 23, 2026Updated 5 months ago
mozilla / translation-service
View on GitHub
This is the repo that hosts the code for Mozilla's translation service
☆32Feb 12, 2024Updated 2 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
ymoslem / MT-Tools
View on GitHub
Collection of Common Machine Translation Tools
☆11Jul 26, 2022Updated 3 years ago
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
laurieburchell / open-lid-dataset
View on GitHub
Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)
☆77Apr 1, 2025Updated last year
hplt-project / data-analytics-tool
View on GitHub
HPLT Analytics
☆15Updated this week
transducens / linguacrawl
View on GitHub
Crawling engine that crawls a set of top-level domains looking for documents in a list of languages
☆11Feb 6, 2024Updated 2 years ago
juletx / self-translate
View on GitHub
Do Multilingual Language Models Think Better in English?
☆42Aug 3, 2023Updated 2 years ago
google-research / mt-metrics-eval
View on GitHub
Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.
☆132Apr 23, 2026Updated 3 months ago
cisnlp / Glot500
View on GitHub
[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
☆107Apr 14, 2026Updated 3 months ago
cisnlp / simalign
View on GitHub
[EMNLP 2020] Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
☆398Nov 7, 2023Updated 2 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
AppraiseDev / Appraise
View on GitHub
Appraise code used as part of WMT21 human evaluation campaign
☆30Jul 15, 2026Updated last week
mprompting / xlmrprompt
View on GitHub
☆11Jun 23, 2022Updated 4 years ago
LuisaMaerz / KnowMAN
View on GitHub
KnowMAN: Weakly Supervised Multinomial Adversarial Networks
☆12Nov 9, 2021Updated 4 years ago
wmt-conference / wmt22-news-systems
View on GitHub
☆21Feb 13, 2023Updated 3 years ago
paracrawl / keops
View on GitHub
Tool for manual evaluation of parallel sentences.
☆15Jan 26, 2026Updated 5 months ago
Helsinki-NLP / OpusTools
View on GitHub
☆83Jun 24, 2026Updated last month
ottowg / gsap-ner
View on GitHub
☆10Oct 2, 2024Updated last year
kpu / fasterText
View on GitHub
Library for fast text representation and classification.
☆31Jan 9, 2024Updated 2 years ago
mt-upc / ZeroSwot
View on GitHub
Pushing the Limits of Zero-shot End-to-End Speech Translation
☆25Dec 12, 2024Updated last year
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
alexrs / herd
View on GitHub
Mixture of Expert (MoE) techniques for enhancing LLM performance through expert-driven prompt mapping and adapter combinations.
☆11Feb 11, 2024Updated 2 years ago
seznam / vertical-search-blending-dataset
View on GitHub
☆13Sep 18, 2019Updated 6 years ago
UniversalDependencies / UD_Polish-PDB
View on GitHub
Polish data.
☆13May 6, 2026Updated 2 months ago
ahmetustun / hyperx
View on GitHub
☆21Dec 5, 2022Updated 3 years ago
UBC-NLP / afrolid
View on GitHub
AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.
☆39Feb 5, 2026Updated 5 months ago
bitextor / bicleaner
View on GitHub
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
☆160Jun 18, 2024Updated 2 years ago
browsermt / bergamot-translator
View on GitHub
Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
☆530May 12, 2024Updated 2 years ago
lorelupo / divide-and-rule
View on GitHub
☆12Oct 17, 2022Updated 3 years ago
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 8 months ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
fbarez / neuroplasticity
View on GitHub
☆14Mar 31, 2024Updated 2 years ago
swiss-ai / parity-aware-bpe
View on GitHub
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [ACL 2026]
☆20Apr 18, 2026Updated 3 months ago
mgaido91 / FBK-fairseq-ST
View on GitHub
A repository containing the code for speech translation papers.
☆21Mar 11, 2022Updated 4 years ago
philschmid / fine-tune-GPT-2
View on GitHub
☆21Feb 3, 2021Updated 5 years ago
cisnlp / GlotCC
View on GitHub
[NeurIPS 2024] 🕸 GlotCC Dataset and Pipline
☆21Apr 6, 2025Updated last year
cindyxinyiwang / multiview-subword-regularization
View on GitHub
PyTorch implementation of NAACL 2021 paper "Multi-view Subword Regularization"
☆26Jun 2, 2021Updated 5 years ago
mozilla / translations
View on GitHub
The code, training pipeline, and models that power Firefox Translations
☆324Updated this week