bitextor/bifixer

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bitextor/bifixer)

bitextor / bifixer

Tool to fix bitexts and tag near-duplicates for removal

☆35

Alternatives and similar repositories for bifixer

Users that are interested in bifixer are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

paracrawl / keops
View on GitHub
Tool for manual evaluation of parallel sentences.
☆15Jan 26, 2026Updated 5 months ago
bitextor / bicleaner
View on GitHub
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
☆160Jun 18, 2024Updated 2 years ago
sortiz / tmxt
View on GitHub
Transform TMX to text
☆27Nov 23, 2022Updated 3 years ago
mbanon / fastspell
View on GitHub
Targetted language identifier, based on FastText and Hunspell.
☆38Sep 4, 2025Updated 10 months ago
loomchild / segment
View on GitHub
Program used to split text into segments
☆28Oct 27, 2024Updated last year
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
Helsinki-NLP / OpusFilter
View on GitHub
OpusFilter - Parallel corpus processing toolkit
☆115Jul 1, 2026Updated 2 weeks ago
paracrawl / corset
View on GitHub
Corset is a web-based data selection portal that helps you getting relevant data from massive amounts of parallel data.
☆21Nov 6, 2023Updated 2 years ago
langtech-bsc / mt-evaluation
View on GitHub
A framework for evaluating Machine Translation models.
☆13Apr 21, 2026Updated 2 months ago
salesforce / localization-xml-mt
View on GitHub
A High-Quality Multilingual Dataset for Structured Documentation Translation
☆39May 1, 2025Updated last year
MaxyLee / 3AM
View on GitHub
Official code and data of "3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset"
☆12Dec 8, 2024Updated last year
cjbayron / artist2lyrics
View on GitHub
Lyrics crawling, pre-processing, embedding generation, model training, and lyrics generation - all in one tool
☆14Nov 4, 2018Updated 7 years ago
robertostling / eflomal
View on GitHub
Efficient Low-Memory Aligner
☆148Jan 15, 2025Updated last year
microsoft / factored-segmenter
View on GitHub
Unsupervised factor-based text tokenizer for natural-language processing applications
☆17Jul 24, 2020Updated 5 years ago
AI4Bharat / webcorpus
View on GitHub
Generate large textual corpora for almost any language by crawling the web
☆13Feb 17, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
browsermt / marian-dev
View on GitHub
Fast Neural Machine Translation in C++ - development repository
☆23May 12, 2024Updated 2 years ago
microsoft / Lightweight-Low-Resource-NMT
View on GitHub
Official code for "Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Reso…
☆18Oct 9, 2025Updated 9 months ago
Helsinki-NLP / OpusTools
View on GitHub
☆83Jun 24, 2026Updated 3 weeks ago
simon-ging / fasttext-numpy2
View on GitHub
Library for fast text representation and classification. Fix compatibility with numpy 2
☆15Nov 21, 2024Updated last year
deep-spin / qaware-decode
View on GitHub
A repository for experiments in quality-aware decoding
☆18Jun 7, 2022Updated 4 years ago
cisnlp / parcoure
View on GitHub
ParCourE - Parallel Corpus Explorer
☆12Dec 27, 2021Updated 4 years ago
masakhane-io / masakhane-reading-group
View on GitHub
Agile reading group that works
☆13Feb 2, 2022Updated 4 years ago
microsoft / MMLMCalibration
View on GitHub
Code for EMNLP 2022 Paper: On the Calibration of Massively Multilingual Language Models
☆15Jun 12, 2023Updated 3 years ago
marian-nmt / sotastream
View on GitHub
A library for data streaming and augmentation
☆22May 5, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
bitextor / warc2text
View on GitHub
Extracts plain text, language identification and more metadata from WARC records
☆23Apr 16, 2026Updated 3 months ago
hipster-philology / nlp-pie-taggers
View on GitHub
Extension for pie to include taggers with their models and pre/postprocessors
☆11Jun 23, 2026Updated 3 weeks ago
prajdabre / yanmtt
View on GitHub
Yet Another Neural Machine Translation Toolkit
☆178Mar 7, 2025Updated last year
Unbabel / smaug
View on GitHub
Python package to augment multilingual data
☆15Feb 15, 2023Updated 3 years ago
amazon-science / contrastive-controlled-mt
View on GitHub
Code and data for the IWSLT 2022 shared task on Formality Control for SLT
☆22May 24, 2023Updated 3 years ago
M4t1ss / parallel-corpora-tools
View on GitHub
Tools for filtering and cleaning parallel and monolingual corpora for machine translation and other natural language processing tasks.
☆42Dec 19, 2023Updated 2 years ago
NathanGodey / headless-lm
View on GitHub
Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…
☆29Apr 17, 2024Updated 2 years ago
Helsinki-NLP / mammoth
View on GitHub
MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki
☆32Jul 13, 2026Updated last week
bitextor / bitextor
View on GitHub
Bitextor generates translation memories from multilingual websites
☆299Nov 11, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
wei-peng-47 / proxy-filter
View on GitHub
☆14Dec 7, 2020Updated 5 years ago
Unbabel / OpenKiwi
View on GitHub
Open-Source Machine Translation Quality Estimation in PyTorch
☆233Jun 23, 2022Updated 4 years ago
facebookresearch / mlqe
View on GitHub
We release a dataset based on Wikipedia sentences and the corresponding translations in 6 different languages along with the scores (scal…
☆81Aug 31, 2021Updated 4 years ago
Unbabel / BConTrasT
View on GitHub
☆20Aug 17, 2021Updated 4 years ago
tag-and-generate / politeness-dataset
View on GitHub
Dataset for the politeness transfer task
☆38Apr 5, 2021Updated 5 years ago
browsermt / students
View on GitHub
Efficient teacher-student models and scripts to make them
☆57Dec 16, 2023Updated 2 years ago
MartinThoma / lidtk
View on GitHub
Language Identification Toolkit
☆18Aug 25, 2021Updated 4 years ago