alexandres / terashuf
terashuf shuffles multi-terabyte text files using limited memory
☆218Updated 2 years ago
Alternatives and similar repositories for terashuf:
Users that are interested in terashuf are comparing it to the libraries listed below
- Fast BPE☆670Updated 10 months ago
- Efficient, check-pointed data loading for deep learning with massive data sets.☆206Updated last year
- Embedding Quantization (Compress Word Embeddings)☆86Updated 5 years ago
- Byte Pair Encoding for Python!☆228Updated 2 years ago
- LASER multilingual sentence embeddings as a pip package☆223Updated last year
- PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset☆123Updated 5 years ago
- Python port of Moses tokenizer, truecaser and normalizer☆495Updated 11 months ago
- A tool that locates, downloads, and extracts machine translation corpora☆154Updated 3 weeks ago
- A tool for holistic analysis of language generations systems☆468Updated 3 years ago
- Corpus preprocessing☆96Updated last year
- Implementation of a linear-chain CRF in PyTorch☆96Updated 4 years ago
- scripts and configuration files for Edinburgh neural MT submission to WMT 16 shared translation task☆138Updated 4 years ago
- ☆322Updated 2 years ago
- super fast cpp implementation of longest common subsequence/substring☆67Updated last year
- TensorFlow code and pre-trained models for BERT☆114Updated 5 years ago
- Python library & examples for Masked Language Model Scoring (ACL 2020)☆342Updated 2 years ago
- Convert word2vec vectors between binary and plain text format☆136Updated 5 years ago
- A masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a…☆242Updated 3 years ago
- Implementation of unsupervised smoothed inverse frequency (Best Paper, Repl4NLP @ ACL 2018)☆77Updated 6 years ago
- This is a repository with the data and code for the ACL 2019 paper "When a Good Translation is Wrong in Context: ..." and the EMNLP 2019 …☆97Updated 4 years ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆157Updated 10 months ago
- This project attempts to maintain the SOTA performance in machine translation☆108Updated 4 years ago
- eXtensible Neural Machine Translation☆186Updated 5 years ago
- ☆42Updated 6 years ago
- Neural macine translation soft alignment visualisations for web and command line☆72Updated 3 years ago
- Fast Neural Machine Translation in C++ - development repository☆272Updated 6 months ago
- LAMB Optimizer for Large Batch Training (TensorFlow version)☆120Updated 5 years ago
- ICLR 2018 Quick-Thought vectors☆204Updated 5 years ago
- Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/☆192Updated last year
- Neural Text Generation with Unlikelihood Training☆309Updated 3 years ago