YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆47Sep 22, 2020Updated 5 years ago
Alternatives and similar repositories for youtube_subtitle_dataset
Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below
Sorting:
- A script for collecting the PubMed Central dataset in a language modelling friendly format.☆25Feb 16, 2021Updated 5 years ago
- Download, parse, and filter data from Phil Papers. Data-ready for The-Pile.☆19Aug 28, 2023Updated 2 years ago
- Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.☆15Jun 3, 2023Updated 2 years ago
- Minimal, clean code for video/image "patchnization" - a process commonly used in tokenizing visual data for use in a Transformer encoder.…☆11May 16, 2024Updated last year
- Finetuning InstructLLaMA on consumer hardware (copy from https://github.com/tloen/alpaca-lora)☆11Mar 17, 2023Updated 2 years ago
- Building the laion5B paper☆36May 6, 2022Updated 3 years ago
- downloads and parses subtitle dataset from opensubtitles.org☆15Apr 19, 2024Updated last year
- ☆78Dec 7, 2023Updated 2 years ago
- Dataset: BuzzFeed News “Trending” Strip, 2018–2023☆18May 24, 2023Updated 2 years ago
- NHS England PhD Internship Projects Pages☆19Oct 3, 2025Updated 5 months ago
- Code for paper "Point and Ask: Incorporating Pointing into Visual Question Answering"☆19Oct 4, 2022Updated 3 years ago
- My personal web page☆11Feb 17, 2026Updated 2 weeks ago
- Wikipedia based dataset to train relationship classifiers and fact extraction models☆26May 25, 2021Updated 4 years ago
- ☆26Jul 11, 2022Updated 3 years ago
- 🤖ConvRe🤯: An Investigation of LLMs’ Inefficacy in Understanding Converse Relations (EMNLP 2023)☆24Oct 10, 2023Updated 2 years ago
- ☆48Oct 28, 2025Updated 4 months ago
- A multi-agent mind implemented using LLMs engaged in ongoing conversation☆25Mar 1, 2023Updated 3 years ago
- WeatherFusionNet - our solution to the NeurIPS 2022 Weather4cast competition☆33Nov 30, 2023Updated 2 years ago
- ChatGPT Participates in a Computer Science Exam (2023)☆31Mar 21, 2023Updated 2 years ago
- Knowledge graph extraction from text using OpenAI ChatGPT for graph extraction and Neo4j for DB storage☆11Feb 26, 2024Updated 2 years ago
- Talk to your CSV: how to Visualize Your Data with Langchain and Streamlit☆29Aug 26, 2023Updated 2 years ago
- Program and links to the material for the GloBIAS Training School 2025, Kobe, Japan.☆22Oct 27, 2025Updated 4 months ago
- LLM-powered Q/A over arXiv preprints☆32Apr 5, 2023Updated 2 years ago
- This is a pip package implementing Reinforcement Learning algorithms in non-stationary environments supported by the OpenAI Gym toolkit.☆33Jun 5, 2019Updated 6 years ago
- ☆32May 23, 2023Updated 2 years ago
- Run SWE-bench evaluations remotely☆58Aug 14, 2025Updated 6 months ago
- ☆35Mar 5, 2025Updated last year
- ☆10Sep 27, 2020Updated 5 years ago
- MirMachine, a command line tool to detect microRNA homologs in genome sequences.☆13Updated this week
- Here, I provided the solution for exercises of IBM Quantum Challenge 2020☆10Oct 27, 2020Updated 5 years ago
- Neural Error Mitigation of Near-Term Quantum Simulations (arXiv:2105.08086)☆10Jul 6, 2022Updated 3 years ago
- A suite of open-ended, non-imitative tasks involving generalizable skills for large language model chatbots and agents to enable bootstra…☆44Jan 31, 2025Updated last year
- ☆163Mar 5, 2021Updated 5 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆86Dec 6, 2023Updated 2 years ago
- Agentless Lite: RAG-based SWE-Bench software engineering scaffold☆45Apr 15, 2025Updated 10 months ago
- ☆1,636Apr 27, 2023Updated 2 years ago
- Old book pages (with groundtruth), formerly used for OCR studies. There are several versions of the set (concerning resolution and binari…☆15Aug 25, 2017Updated 8 years ago
- maps are everything.☆10Jul 3, 2025Updated 8 months ago
- Add AI to the Linux terminal☆10Apr 28, 2024Updated last year