alexandres/terashuf

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/alexandres/terashuf)

alexandres / terashuf

terashuf shuffles multi-terabyte text files using limited memory

☆232

Alternatives and similar repositories for terashuf

Users that are interested in terashuf are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

fyvo / WMT-Biomed-Test
View on GitHub
☆13Aug 23, 2024Updated last year
microsoft / factored-segmenter
View on GitHub
Unsupervised factor-based text tokenizer for natural-language processing applications
☆17Jul 24, 2020Updated 5 years ago
trufanov-nok / shuf-t
View on GitHub
This application shuffles the input file lines skipping (optionaly) the header. It's optimized for files bigger than available RAM.
☆25Jan 9, 2017Updated 9 years ago
browsermt / students
View on GitHub
Efficient teacher-student models and scripts to make them
☆57Dec 16, 2023Updated 2 years ago
kpu / fasterText
View on GitHub
Library for fast text representation and classification.
☆31Jan 9, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
bitextor / bicleaner-ai
View on GitHub
Bicleaner fork that uses neural networks
☆40Feb 23, 2026Updated 4 months ago
ctylim / rhuffle
View on GitHub
Line shuffler for huge text file which does not fit in memory
☆13Dec 1, 2022Updated 3 years ago
robertostling / eflomal
View on GitHub
Efficient Low-Memory Aligner
☆148Jan 15, 2025Updated last year
Mihir3009 / In-BoXBART
View on GitHub
In-BoXBART: Get Instructions into Biomedical Multi-task Learning
☆15Aug 23, 2022Updated 3 years ago
BrightXiaoHan / optimum-ascend
View on GitHub
Optimized inference with Ascend and Hugging Face
☆12Apr 23, 2024Updated 2 years ago
pedrada88 / crossembeddings-twitter
View on GitHub
☆14May 15, 2020Updated 6 years ago
lRomul / gramtion
View on GitHub
Twitter bot for generating photo descriptions (alt text)
☆23Jul 1, 2021Updated 5 years ago
VKCOM / YouTokenToMe
View on GitHub
Unsupervised text tokenizer focused on computational efficiency
☆979Mar 29, 2024Updated 2 years ago
marian-nmt / sotastream
View on GitHub
A library for data streaming and augmentation
☆22May 5, 2025Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
huggingface / olm-datasets
View on GitHub
Pipeline for pulling and processing online language model pretraining data from the web
☆179Jul 31, 2023Updated 2 years ago
dcjones / subsample
View on GitHub
Randomly sample lines from massive text files efficiently
☆16Apr 1, 2015Updated 11 years ago
EdinburghNLP / opus-100-corpus
View on GitHub
☆93Feb 13, 2024Updated 2 years ago
monologg / ko_lm_dataformat
View on GitHub
A utility for storing and reading files for Korean LM training 💾
☆35Updated this week
wmt-conference / wmt22-news-systems
View on GitHub
☆21Feb 13, 2023Updated 3 years ago
VladimirGl / eastern-front-dataset
View on GitHub
This repo contains details about USSR Eastern Front WWII veterans dataset extracted from Pamyat Naroda website
☆13May 8, 2020Updated 6 years ago
bitextor / bifixer
View on GitHub
Tool to fix bitexts and tag near-duplicates for removal
☆35Sep 4, 2025Updated 10 months ago
Unbabel / MT-Telescope
View on GitHub
☆33Nov 22, 2021Updated 4 years ago
microsoft / fastformers
View on GitHub
FastFormers - highly efficient transformer models for NLU
☆706Mar 21, 2025Updated last year
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
BinWang28 / Sentence-Embedding-S3E
View on GitHub
Efficient Sentence Embedding via Semantic Subspace Analysis
☆14Feb 25, 2020Updated 6 years ago
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,046Apr 25, 2023Updated 3 years ago
korean-named-entity / konec
View on GitHub
Korean Named Entity Corpus
☆25May 12, 2023Updated 3 years ago
MicrosoftTranslator / MSLT-Corpus
View on GitHub
Microsoft Speech Language Translation (MSLT) Corpus
☆19Sep 18, 2017Updated 8 years ago
ppleskov / Russian-Language-Model
View on GitHub
☆56May 12, 2018Updated 8 years ago
softwaremill / detectnet-tests
View on GitHub
Python scripts and other resources for tesing DetectNet on Nvidia DIGITS
☆14Oct 10, 2017Updated 8 years ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
QuoQA-NLP / T5_Translation
View on GitHub
↔️ T5 Machine Translation from English to Korean
☆18Aug 11, 2022Updated 3 years ago
bitextor / bicleaner
View on GitHub
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
☆160Jun 18, 2024Updated 2 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
jungokasai / deep-shallow
View on GitHub
☆43Sep 16, 2020Updated 5 years ago
Kaleidophon / awesome-experimental-standards-deep-learning
View on GitHub
Repository collecting resources and best practices to improve experimental rigour in deep learning research.
☆27Mar 30, 2023Updated 3 years ago
roeeaharoni / unsupervised-domain-clusters
View on GitHub
Code and data accompanying our ACL 2020 paper, "Unsupervised Domain Clusters in Pretrained Language Models".
☆58Aug 22, 2020Updated 5 years ago
songys / 2021Langcon
View on GitHub
☆11Oct 3, 2021Updated 4 years ago
andrey-avdeev / telemetry
View on GitHub
Easy to use util for profiling in production
☆11Aug 3, 2023Updated 2 years ago
facebookresearch / stopes
View on GitHub
A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB te…
☆309Updated this week
UKPLab / on-emergence
View on GitHub
Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning
☆33Jan 9, 2025Updated last year