kpu/preprocess

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/kpu/preprocess)

kpu / preprocess

Corpus preprocessing

☆100

Alternatives and similar repositories for preprocess

Users that are interested in preprocess are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

sortiz / tmxt
View on GitHub
Transform TMX to text
☆27Nov 23, 2022Updated 3 years ago
browsermt / students
View on GitHub
Efficient teacher-student models and scripts to make them
☆57Dec 16, 2023Updated 2 years ago
fyvo / WMT-Biomed-Test
View on GitHub
☆13Aug 23, 2024Updated last year
paracrawl / extractor
View on GitHub
☆24Nov 29, 2017Updated 8 years ago
ucam-smt / ucam-smt
View on GitHub
Cambridge SMT System
☆18Aug 1, 2017Updated 8 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
sleepinyourhat / quora-duplicate-questions-util
View on GitHub
Converts Quora's new NLU dataset to SNLI txt/jsonl format, plus test/dev split, tokenization.
☆14Jan 27, 2017Updated 9 years ago
mjpost / bin
View on GitHub
bin files
☆13Jan 30, 2025Updated last year
smatthewenglish / trst
View on GitHub
☆12Jan 15, 2015Updated 11 years ago
mahfuzibnalam / terminology_evaluation
View on GitHub
☆21May 30, 2022Updated 4 years ago
George0828Zhang / torch_cif
View on GitHub
A fast parallel PyTorch implementation of the "CIF: Continuous Integrate-and-Fire for End-to-End Speech Recognition" https://arxiv.org/ab…
☆37Feb 10, 2024Updated 2 years ago
cbaziotis / lm-prior-for-nmt
View on GitHub
This repository contains source code for the paper "Language Model Prior for Low-Resource Neural Machine Translation"
☆43Mar 16, 2021Updated 5 years ago
voidful / MMLM
View on GitHub
Toward Multi Modality Language Model - implementation of GPT-4o/Project Astra
☆16Dec 10, 2024Updated last year
paracrawl / keops
View on GitHub
Tool for manual evaluation of parallel sentences.
☆15Jan 26, 2026Updated 5 months ago
bitextor / bitextor
View on GitHub
Bitextor generates translation memories from multilingual websites
☆299Nov 11, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
alvations / myth
View on GitHub
Myanmar and Thai Language Resources
☆10Jul 18, 2022Updated 4 years ago
suhaibani / JointReps
View on GitHub
Learning word representation jointly using a corpus and a knowledge base (KB)
☆19Oct 19, 2018Updated 7 years ago
neubig / rapid-adaptation
View on GitHub
Reproduction instructions for "Rapid Adaptation of Neural Machine Translation to New Languages"
☆39Aug 7, 2018Updated 7 years ago
transducens / linguacrawl
View on GitHub
Crawling engine that crawls a set of top-level domains looking for documents in a list of languages
☆11Feb 6, 2024Updated 2 years ago
coastalcph / supersense-data-twitter
View on GitHub
Tweets annotated with coarse-grained sense labels (supersenses)
☆13Jun 13, 2014Updated 12 years ago
lilt / alignment-scripts
View on GitHub
Scripts to preprocess training and test data and to run fast_align and giza
☆107Nov 2, 2021Updated 4 years ago
acocos / cluster_paraphrases
View on GitHub
Cluster paraphrases by word sense
☆12Jan 3, 2019Updated 7 years ago
pmichel31415 / mtnt
View on GitHub
Code for the collection and analysis of the MTNT dataset
☆56Apr 2, 2019Updated 7 years ago
mingruimingrui / fast-mosestokenizer
View on GitHub
c++ mosestokenizer
☆18Mar 13, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
roy-ht / pyter
View on GitHub
☆27Jan 7, 2017Updated 9 years ago
shamilcm / pedra
View on GitHub
Post-editing Datasets by Rakuten (PEDRa)
☆14Jun 23, 2021Updated 5 years ago
marian-nmt / marian-examples
View on GitHub
Examples, tutorials and use cases for Marian, including our WMT-2017/18 baselines.
☆81Apr 8, 2023Updated 3 years ago
XapaJIaMnu / gLM
View on GitHub
A GPU language model, based on btree backed tries.
☆30Mar 6, 2018Updated 8 years ago
idiap / phonvoc
View on GitHub
Phonetic and phonological vocoding platform
☆17Nov 23, 2016Updated 9 years ago
thammegowda / mtdata
View on GitHub
A tool that locates, downloads, and extracts machine translation corpora
☆165Apr 13, 2026Updated 3 months ago
jtkim-kaist / end-point-detection
View on GitHub
☆10Sep 19, 2018Updated 7 years ago
Avmb / clweadv
View on GitHub
Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders
☆22Aug 11, 2016Updated 9 years ago
Roxot / mbr-nmt
View on GitHub
Sampling-Based Minimum Bayes-Risk Decoding for Neural Machine Translation
☆16Oct 14, 2022Updated 3 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
AppraiseDev / Appraise
View on GitHub
Appraise code used as part of WMT21 human evaluation campaign
☆30Updated this week
zomux / tree2code
View on GitHub
tree2code: Learning Discrete Syntactic Codes for Structural Diverse Translation
☆26Dec 27, 2019Updated 6 years ago
waylonflinn / bvec
View on GitHub
Fast Vector Operations on Pretty Big Data
☆13Nov 17, 2015Updated 10 years ago
cnap / smt-for-gec
View on GitHub
☆12Sep 8, 2017Updated 8 years ago
bpopeters / mg2p
View on GitHub
Multilingual grapheme-to-phoneme conversion
☆20Feb 23, 2018Updated 8 years ago
robin1001 / kaldi-aslp
View on GitHub
☆43Jun 25, 2018Updated 8 years ago
cltl / svm_wsd
View on GitHub
Word Sense Disambiguation system developed on the DutchSemCor project using Support Vector Machines. The input is plain text, and the out…
☆12Feb 5, 2019Updated 7 years ago