alexandres / terashufView external linksLinks
terashuf shuffles multi-terabyte text files using limited memory
☆229Feb 5, 2023Updated 3 years ago
Alternatives and similar repositories for terashuf
Users that are interested in terashuf are comparing it to the libraries listed below
Sorting:
- ☆13Aug 23, 2024Updated last year
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- A library for data streaming and augmentation☆21May 5, 2025Updated 9 months ago
- A tool that locates, downloads, and extracts machine translation corpora☆162Sep 18, 2025Updated 4 months ago
- Pipeline for pulling and processing online language model pretraining data from the web☆176Jul 31, 2023Updated 2 years ago
- ↔️ T5 Machine Translation from English to Korean☆18Aug 11, 2022Updated 3 years ago
- Parallelize your computations in parallel-apply fashion.☆33Jul 19, 2019Updated 6 years ago
- Bicleaner fork that uses neural networks☆40Jan 29, 2026Updated 2 weeks ago
- Efficient Low-Memory Aligner☆146Jan 15, 2025Updated last year
- A utility for storing and reading files for Korean LM training 💾☆35Oct 15, 2025Updated 4 months ago
- Efficient teacher-student models and scripts to make them☆54Dec 16, 2023Updated 2 years ago
- Korean Named Entity Corpus☆25May 12, 2023Updated 2 years ago
- Code and data for the paper "Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?"☆26Jun 3, 2025Updated 8 months ago
- Twitter bot for generating photo descriptions (alt text)☆23Jul 1, 2021Updated 4 years ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆160Jun 18, 2024Updated last year
- AI model designed to test the effectiveness in handling external ethical attacks.☆11Feb 9, 2026Updated last week
- 0-Shot Tokenizer Transplant☆14May 16, 2025Updated 9 months ago
- Paper Review about Speech Recognition · NLP☆10Mar 25, 2021Updated 4 years ago
- This is project for korean auto spacing☆12Aug 3, 2020Updated 5 years ago
- Optimized inference with Ascend and Hugging Face☆12Apr 23, 2024Updated last year
- Line shuffler for huge text file which does not fit in memory☆13Dec 1, 2022Updated 3 years ago
- ☆94Feb 13, 2024Updated 2 years ago
- ☆21Feb 13, 2023Updated 3 years ago
- All-in-one text de-duplication☆742Jan 2, 2026Updated last month
- Tools to download and cleanup Common Crawl data☆1,039Apr 25, 2023Updated 2 years ago
- 中文原生等级化代码能力测试基准☆15Apr 11, 2024Updated last year
- ☆11Oct 3, 2021Updated 4 years ago
- All notebook for FastAI learning purposes.☆15Jun 11, 2019Updated 6 years ago
- Deploy KoGPT with Triton Inference Server☆14Nov 18, 2022Updated 3 years ago
- End-to-End Korean Automatic Speech Recognition leveraging PyTorch and Hydra.☆10Jan 21, 2022Updated 4 years ago
- Tool to fix bitexts and tag near-duplicates for removal☆34Sep 4, 2025Updated 5 months ago
- Accepted to ICLR 2025. MetaMetrics is a calibrated meta-metric designed to evaluate generation tasks across different modalities aligned …☆14Dec 30, 2024Updated last year
- baikal.ai's pre-trained BERT models: descriptions and sample codes☆12Jun 24, 2021Updated 4 years ago
- Easy to use util for profiling in production☆11Aug 3, 2023Updated 2 years ago
- Efficient Sentence Embedding via Semantic Subspace Analysis☆14Feb 25, 2020Updated 5 years ago
- ☆14May 15, 2020Updated 5 years ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆57Feb 3, 2026Updated last week
- OSLO: Open Source framework for Large-scale model Optimization☆309Aug 25, 2022Updated 3 years ago
- A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB te…☆295Updated this week