sdtblck/youtube_subtitle_dataset

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/sdtblck/youtube_subtitle_dataset)

sdtblck / youtube_subtitle_dataset

YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training

☆47

Alternatives and similar repositories for youtube_subtitle_dataset

Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

EleutherAI / pile-pubmedcentral
View on GitHub
A script for collecting the PubMed Central dataset in a language modelling friendly format.
☆26Feb 16, 2021Updated 5 years ago
thoppe / The-Pile-PhilPapers
View on GitHub
Download, parse, and filter data from Phil Papers. Data-ready for The-Pile.
☆20Aug 28, 2023Updated 2 years ago
EleutherAI / openwebtext2
View on GitHub
☆94Jul 16, 2022Updated 4 years ago
lemurproject / ClueWeb22
View on GitHub
☆17Dec 11, 2024Updated last year
yehudagale / fuzzyJoiner
View on GitHub
☆13Dec 8, 2022Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
LAION-AI / laion5B-paper
View on GitHub
Building the laion5B paper
☆36May 6, 2022Updated 4 years ago
leogao2 / commoncrawl_downloader
View on GitHub
☆33May 23, 2023Updated 3 years ago
leogao2 / lm_dataformat
View on GitHub
☆79Dec 7, 2023Updated 2 years ago
kan-bayashi / Taco2withBERT
View on GitHub
Tacotron2 with BERT examples
☆10Jul 8, 2019Updated 7 years ago
tedhtchang / bert-sentiment-tfjs
View on GitHub
Sentiment Analysis using BERT model and Tensorflowjs
☆13Jun 2, 2020Updated 6 years ago
OSU-STARLAB / Simul-LLM
View on GitHub
[ACL 2024] An easily extensible framework for simultaneous, text-to-text neural machine translation (SimulMT) for LLMs.
☆18Apr 21, 2025Updated last year
noanabeshima / wikipedia-downloader
View on GitHub
Downloads 2020 English Wikipedia articles as plaintext
☆27Mar 25, 2023Updated 3 years ago
EleutherAI / lm_perplexity
View on GitHub
☆165Mar 5, 2021Updated 5 years ago
wordnik / language-museums
View on GitHub
A JSON dataset of information about language museums around the world
☆13Feb 26, 2020Updated 6 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jul 21, 2026Updated last week
EleutherAI / the-pile
View on GitHub
☆1,670Apr 27, 2023Updated 3 years ago
EleutherAI / pilev2
View on GitHub
☆13Jan 20, 2023Updated 3 years ago
3B-Group / ConvRe
View on GitHub
🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)
☆24Oct 10, 2023Updated 2 years ago
kowey / corenlp-server
View on GitHub
Server wrapper for Stanford CoreNLP
☆14Nov 4, 2014Updated 11 years ago
Mayank-Bhatia / UrbanSound_Classification
View on GitHub
Sound classification using neural networks
☆12Jun 6, 2018Updated 8 years ago
broadinstitute / ml4ht_data_source
View on GitHub
Multimodal data loader compatible with pytorch and tensorflow
☆12Aug 14, 2024Updated last year
DigitalPhonetics / cyclegan-emotion-transfer
View on GitHub
CycleGAN-based Emotion Style Transfer as Data Augmentation for Speech Emotion Recognition
☆12Oct 7, 2019Updated 6 years ago
JonathanRaiman / epub_conversion
View on GitHub
Python package for converting xml and epubs to text files
☆33Jun 9, 2020Updated 6 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
ivy-llc / mech
View on GitHub
Mechanics functions with end-to-end support for deep learning developers, written in Ivy.
☆14Aug 28, 2023Updated 2 years ago
jeffreyjeffreywang / SSE
View on GitHub
Self-supervised Speech Enhancement network
☆11Aug 27, 2020Updated 5 years ago
CarperAI / Code-Pile
View on GitHub
This repository contains all the code for collecting large scale amounts of code from GitHub.
☆109Feb 17, 2023Updated 3 years ago
jordipons / ICASSP2017
View on GitHub
Designing efficient architectures for modeling temporal features with convolutional neural networks
☆16Mar 17, 2017Updated 9 years ago
liyucheng09 / Contamination_Detector
View on GitHub
Lightweight tool to identify Data Contamination in LLMs evaluation
☆53Mar 8, 2024Updated 2 years ago
alephic / prompt-fab
View on GitHub
Toolkit for building prompt templates for language models
☆11Sep 30, 2022Updated 3 years ago
mhollfelder / openvent
View on GitHub
This repository contains generic information about open-source ventilator applications.
☆21Jun 11, 2020Updated 6 years ago
chongkong / rocat
View on GitHub
🚀 Python asyncio actor library
☆14Aug 16, 2017Updated 8 years ago
toltoxgh / CoreNLP
View on GitHub
Stanford CoreNLP Extensions: Fork to provide the ability to capture Multi-Word Expressions
☆10Jun 14, 2022Updated 4 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
t184256 / hacks
View on GitHub
hacks, a python plugin library that doesn't play by the rules
☆18Jan 27, 2017Updated 9 years ago
chrisdavidmills / selection-api-examples
View on GitHub
Examples to demonstrate use of the Selection API.
☆12Mar 1, 2017Updated 9 years ago
wenjiepei / TAGM
View on GitHub
The source code for Temporal Attention-Gated Model.
☆21Jul 6, 2017Updated 9 years ago
CTSRD-CHERI / gxemul
View on GitHub
Adaptation of gxemul to support the CHERI MIPS unit test suite and certain CHERI features
☆15Dec 8, 2015Updated 10 years ago
harbor-framework / harbor-index
View on GitHub
A compact high-signal benchmark for evaluating frontier agents
☆21Updated this week
unisonweb / unison-local-ui
View on GitHub
The Codebase UI that ships with UCM
☆21May 20, 2026Updated 2 months ago
maki-nage / rxray
View on GitHub
Ray distributed computing integration for RxPY
☆12Aug 1, 2021Updated 4 years ago