☆94Jul 16, 2022Updated 3 years ago
Alternatives and similar repositories for openwebtext2
Users that are interested in openwebtext2 are comparing it to the libraries listed below
Sorting:
- Downloads 2020 English Wikipedia articles as plaintext☆27Mar 25, 2023Updated 2 years ago
- ☆78Dec 7, 2023Updated 2 years ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆47Sep 22, 2020Updated 5 years ago
- ☆32May 23, 2023Updated 2 years ago
- ☆14Oct 4, 2024Updated last year
- ☆1,636Apr 27, 2023Updated 2 years ago
- downloads and parses subtitle dataset from opensubtitles.org☆16Apr 19, 2024Updated last year
- 中文原生等级化代码能力测试基准☆15Apr 11, 2024Updated last year
- ☆13Jan 20, 2023Updated 3 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆86Dec 6, 2023Updated 2 years ago
- Character-level conversion between Hebrew text and Latin transliteration using deep learning - a demonstration of seq2seq training.☆14Jun 27, 2023Updated 2 years ago
- Learning High-Quality and General-Purpose Phrase Representations. Findings of EACL 2024☆16Feb 29, 2024Updated 2 years ago
- ☆16Dec 11, 2024Updated last year
- ☆30Jan 22, 2026Updated last month
- A simple neural truecaser written in pytorch and allennlp.☆33Jun 17, 2024Updated last year
- A python library to find differences between audio and transcriptions☆19Nov 14, 2023Updated 2 years ago
- ☆14Feb 9, 2023Updated 3 years ago
- Starbucks: Improved Training for 2D Matryoshka Embeddings☆22Jun 30, 2025Updated 8 months ago
- Code for our EMNLP 2019 paper titled "Sentence-Level Content Planning and Style Specification for Neural Text Generation"☆17May 4, 2020Updated 5 years ago
- Multipack distributed sampler for fast padding-free training of LLMs☆204Aug 10, 2024Updated last year
- LLM training in simple, raw C/CUDA☆15Dec 5, 2024Updated last year
- Datasets for hackernews posts☆16Feb 17, 2022Updated 4 years ago
- Anonymous ICLR Submission☆14Sep 25, 2019Updated 6 years ago
- ☆16Jul 20, 2023Updated 2 years ago
- A toolkit for researchers in the multimodal sound separation.☆16Oct 20, 2023Updated 2 years ago
- Download, parse, and filter data from Phil Papers. Data-ready for The-Pile.☆19Aug 28, 2023Updated 2 years ago
- This repository contains source codes for SoftCTC. Original paper can be found here: https://arxiv.org/abs/2212.02135☆19Mar 7, 2023Updated 2 years ago
- ☆23Oct 30, 2023Updated 2 years ago
- Common tools for data processing☆22Dec 8, 2025Updated 2 months ago
- Participant Kit for the TextGraphs-15 Shared Task on Explanation Regeneration☆19Nov 8, 2021Updated 4 years ago
- Research work aimed at addressing the problem of modeling infinite-length context☆46Dec 18, 2025Updated 2 months ago
- Open-source tools for morphological tagging, segmentation and stemming.☆40Jul 11, 2019Updated 6 years ago
- ☆22Jun 30, 2021Updated 4 years ago
- 5Hz Deep-Compression Speech VAE for AR-Diffusion and CALMs☆57Nov 19, 2025Updated 3 months ago
- ☆19May 6, 2023Updated 2 years ago
- An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.☆391Mar 26, 2024Updated last year
- Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.☆755Dec 8, 2022Updated 3 years ago
- Tools for managing datasets for governance and training.☆90Jan 19, 2026Updated last month
- Conditioned U-Net for Music Source Separation☆20May 15, 2021Updated 4 years ago