sdtblck / youtube_subtitle_datasetLinks
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆43Updated 4 years ago
Alternatives and similar repositories for youtube_subtitle_dataset
Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below
Sorting:
- ☆90Updated 2 years ago
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆80Updated last year
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆57Updated 2 years ago
- ☆33Updated 2 years ago
- Open source library for few shot NLP☆78Updated last year
- Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)☆61Updated last year
- ☆78Updated last year
- A library for squeakily cleaning and filtering language datasets.☆46Updated last year
- Applying Reinforcement Learning from Human Feedback to language models to teach them to write short story responses to writing prompts.☆14Updated 3 years ago
- Code for Stage-wise Fine-tuning for Graph-to-Text Generation☆26Updated 2 years ago
- Tools for content datamining and NLP at scale☆43Updated 11 months ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- ☆27Updated 4 years ago
- ☆192Updated last year
- Training & Implementation of chatbots leveraging GPT-like architecture with the aitextgen package to enable dynamic conversations.☆49Updated 2 years ago
- Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.☆11Updated 2 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆33Updated last year
- A TextTiling-based algorithm for text segmentation (aka topic segmentation) that uses neural sentence encoders, as well as extractive sum…☆47Updated 2 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆44Updated last year
- ☆43Updated 2 years ago
- The pipeline for the OSCAR corpus☆167Updated last year
- A question-answering dataset with a focus on subjective information☆45Updated last year
- Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…☆88Updated last year
- Common crawl pretrained sentencepiece tokenizers for English and Japanese for various vocabulary sizes. Also development environment for …☆10Updated 3 years ago
- ☆98Updated 2 years ago
- For experiments involving instruct gpt. Currently used for documenting open research questions.☆71Updated 2 years ago
- ☆97Updated 2 years ago
- ☆24Updated 9 months ago
- ☆111Updated 2 years ago