YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆46Sep 22, 2020Updated 5 years ago
Alternatives and similar repositories for youtube_subtitle_dataset
Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A script for collecting the PubMed Central dataset in a language modelling friendly format.☆26Feb 16, 2021Updated 5 years ago
- Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.☆15Jun 3, 2023Updated 2 years ago
- ☆17Dec 11, 2024Updated last year
- ☆13Dec 8, 2022Updated 3 years ago
- ☆33May 23, 2023Updated 2 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- ☆78Dec 7, 2023Updated 2 years ago
- Code for "Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media: A Unified Model"☆18Feb 14, 2022Updated 4 years ago
- ☆130Apr 30, 2026Updated last week
- KB data lab☆10Dec 8, 2020Updated 5 years ago
- [EMNLP 2025 Findings] A complete cross-modal RAG system for end-to-end speech-to-speech large models, including ASR-based Retrieval and E…☆31Jul 11, 2025Updated 9 months ago
- downloads and parses subtitle dataset from opensubtitles.org☆15Apr 19, 2024Updated 2 years ago
- A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.☆86Jul 24, 2023Updated 2 years ago
- ☆164Mar 5, 2021Updated 5 years ago
- [ACL 2024] An easily extensible framework for simultaneous, text-to-text neural machine translation (SimulMT) for LLMs.☆18Apr 21, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Downloads 2020 English Wikipedia articles as plaintext☆27Mar 25, 2023Updated 3 years ago
- A JSON dataset of information about language museums around the world☆13Feb 26, 2020Updated 6 years ago
- Rudimentary snippets facility for ScIDE, implemented in sclang☆13Oct 20, 2022Updated 3 years ago
- ☆1,650Apr 27, 2023Updated 3 years ago
- Web archiving utility library☆11Updated this week
- ☆13Jan 20, 2023Updated 3 years ago
- Tools for training pytorch language models☆27Nov 14, 2020Updated 5 years ago
- ☆14Feb 11, 2022Updated 4 years ago
- Chakra UI Animations is a dependancy which offers you pre-built animations for your Chakra UI components.☆14Oct 18, 2022Updated 3 years ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- NCLDR is an open source .NET implementation of CLDR (the Common Locale Data Repository)☆26Jun 19, 2014Updated 11 years ago
- The Codebase UI that ships with UCM☆20Apr 17, 2026Updated 3 weeks ago
- Data and preprocessing scripts for SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding☆15Feb 3, 2022Updated 4 years ago
- Stanford CoreNLP Extensions: Fork to provide the ability to capture Multi-Word Expressions☆10Jun 14, 2022Updated 3 years ago
- ☆11Nov 28, 2015Updated 10 years ago
- Mechanics functions with end-to-end support for deep learning developers, written in Ivy.☆14Aug 28, 2023Updated 2 years ago
- Python package for converting xml and epubs to text files☆33Jun 9, 2020Updated 5 years ago
- Toolkit for building prompt templates for language models☆12Sep 30, 2022Updated 3 years ago
- Examples to demonstrate use of the Selection API.☆12Mar 1, 2017Updated 9 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ChatGPT Participates in a Computer Science Exam (2023)☆31Mar 21, 2023Updated 3 years ago
- hacks, a python plugin library that doesn't play by the rules☆18Jan 27, 2017Updated 9 years ago
- ☆22Feb 9, 2023Updated 3 years ago
- ☆26Jul 11, 2022Updated 3 years ago
- Browser-based annotation tool for Framenet☆16Jan 27, 2015Updated 11 years ago
- Adaptation of gxemul to support the CHERI MIPS unit test suite and certain CHERI features☆16Dec 8, 2015Updated 10 years ago
- use delay/sleep/wait to async/await ES7