alexandres / terashufView external linksLinks
terashuf shuffles multi-terabyte text files using limited memory
☆229Feb 5, 2023Updated 3 years ago
Alternatives and similar repositories for terashuf
Users that are interested in terashuf are comparing it to the libraries listed below
Sorting:
- ☆13Aug 23, 2024Updated last year
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- Unsupervised factor-based text tokenizer for natural-language processing applications☆17Jul 24, 2020Updated 5 years ago
- A library for data streaming and augmentation☆21May 5, 2025Updated 9 months ago
- A tool that locates, downloads, and extracts machine translation corpora☆162Sep 18, 2025Updated 4 months ago
- Pipeline for pulling and processing online language model pretraining data from the web☆176Jul 31, 2023Updated 2 years ago
- ↔️ T5 Machine Translation from English to Korean☆18Aug 11, 2022Updated 3 years ago
- Bicleaner fork that uses neural networks☆40Jan 29, 2026Updated 2 weeks ago
- Efficient Low-Memory Aligner☆146Jan 15, 2025Updated last year
- Unsupervised text tokenizer focused on computational efficiency☆977Mar 29, 2024Updated last year
- A utility for storing and reading files for Korean LM training 💾☆35Oct 15, 2025Updated 4 months ago
- Efficient teacher-student models and scripts to make them☆54Dec 16, 2023Updated 2 years ago
- Korean Named Entity Corpus☆25May 12, 2023Updated 2 years ago
- Twitter bot for generating photo descriptions (alt text)☆23Jul 1, 2021Updated 4 years ago
- Code and data for the paper "Turning English-centric LLMs Into Polyglots: How Much Multilinguality Is Needed?"☆26Jun 3, 2025Updated 8 months ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆160Jun 18, 2024Updated last year
- Optimized inference with Ascend and Hugging Face☆12Apr 23, 2024Updated last year
- Paper Review about Speech Recognition · NLP☆10Mar 25, 2021Updated 4 years ago
- BFloat16 Fused Adam Operator for PyTorch☆16Nov 16, 2024Updated last year
- This is project for korean auto spacing☆12Aug 3, 2020Updated 5 years ago
- AI model designed to test the effectiveness in handling external ethical attacks.☆11Feb 9, 2026Updated last week
- 0-Shot Tokenizer Transplant☆14May 16, 2025Updated 9 months ago
- ☆94Feb 13, 2024Updated 2 years ago
- ☆21Feb 13, 2023Updated 3 years ago
- All-in-one text de-duplication☆742Jan 2, 2026Updated last month
- Tools to download and cleanup Common Crawl data☆1,039Apr 25, 2023Updated 2 years ago
- End-to-End Korean Automatic Speech Recognition leveraging PyTorch and Hydra.☆10Jan 21, 2022Updated 4 years ago
- ☆11Oct 3, 2021Updated 4 years ago
- Deploy KoGPT with Triton Inference Server☆14Nov 18, 2022Updated 3 years ago
- 中文原生等级化代码能力测试基准☆15Apr 11, 2024Updated last year
- All notebook for FastAI learning purposes.