sdtblck / youtube_subtitle_datasetView external linksLinks
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆47Sep 22, 2020Updated 5 years ago
Alternatives and similar repositories for youtube_subtitle_dataset
Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below
Sorting:
- A script for collecting the PubMed Central dataset in a language modelling friendly format.☆25Feb 16, 2021Updated 5 years ago
- Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.☆15Jun 3, 2023Updated 2 years ago
- Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.…☆11May 16, 2024Updated last year
- ☆16Dec 11, 2024Updated last year
- Building the laion5B paper☆36May 6, 2022Updated 3 years ago
- ☆78Dec 7, 2023Updated 2 years ago
- NHS England PhD Internship Projects Pages☆19Oct 3, 2025Updated 4 months ago
- Code for paper "Point and Ask: Incorporating Pointing into Visual Question Answering"☆19Oct 4, 2022Updated 3 years ago
- Wikipedia based dataset to train relationship classifiers and fact extraction models☆26May 25, 2021Updated 4 years ago
- My personal web page☆11Oct 20, 2025Updated 3 months ago
- ☆26Jul 11, 2022Updated 3 years ago
- 🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)☆24Oct 10, 2023Updated 2 years ago
- Downloads 2020 English Wikipedia articles as plaintext☆27Mar 25, 2023Updated 2 years ago
- A multi-agent mind implemented using LLMs engaged in ongoing conversation☆26Mar 1, 2023Updated 2 years ago
- ☆47Oct 28, 2025Updated 3 months ago
- ChatGPT Participates in a Computer Science Exam (2023)☆31Mar 21, 2023Updated 2 years ago
- Knowledge graph extraction from text using OpenAI ChatGPT for graph extraction and Neo4j for DB storage☆11Feb 26, 2024Updated last year
- LLM-powered Q/A over arXiv preprints☆32Apr 5, 2023Updated 2 years ago
- Run SWE-bench evaluations remotely☆56Aug 14, 2025Updated 6 months ago
- This is a pip package implementing Reinforcement Learning algorithms in non-stationary environments supported by the OpenAI Gym toolkit.☆33Jun 5, 2019Updated 6 years ago
- ☆32May 23, 2023Updated 2 years ago
- MirMachine, a command line tool to detect microRNA homologs in genome sequences.☆13Dec 3, 2025Updated 2 months ago
- ☆18Mar 19, 2014Updated 11 years ago
- A suite of open-ended, non-imitative tasks involving generalizable skills for large language model chatbots and agents to enable bootstra…☆43Jan 31, 2025Updated last year
- Material associated with Physics Report "Data science applications to string theory"☆11Jun 20, 2023Updated 2 years ago
- Real-ESRGAN aims at developing Practical Algorithms for General Image/Video Restoration.☆10Jul 24, 2024Updated last year
- Neural Error Mitigation of Near-Term Quantum Simulations (arXiv:2105.08086)☆10Jul 6, 2022Updated 3 years ago
- ☆162Mar 5, 2021Updated 4 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆86Dec 6, 2023Updated 2 years ago
- Agentless Lite: RAG-based SWE-Bench software engineering scaffold☆45Apr 15, 2025Updated 10 months ago
- A simple module consistently outperforms self-attention and Transformer model on main NMT datasets with SoTA performance.☆86Jul 24, 2023Updated 2 years ago
- ☆1,636Apr 27, 2023Updated 2 years ago
- Super simple, zero config options, <2kb declarative tooltip library with no dependencies.☆17Jun 2, 2023Updated 2 years ago
- Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binari…☆15Aug 25, 2017Updated 8 years ago
- Discord Docsbot, Built on bgent☆11Jun 17, 2024Updated last year
- ☆12Jan 11, 2026Updated last month
- A framework for few-shot evaluation of autoregressive language models.☆12Jul 14, 2025Updated 7 months ago
- This program meshes Volumetric Video recorded with LiveScan3D☆10Dec 17, 2020Updated 5 years ago
- ☆13Updated this week