sdtblck / youtube_subtitle_dataset
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆39Updated 4 years ago
Alternatives and similar repositories for youtube_subtitle_dataset:
Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- ☆86Updated 2 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆80Updated last year
- Fine tuning experiments for the GPT-2 model by OpenAI.☆20Updated 5 years ago
- Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…☆86Updated 10 months ago
- Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine☆31Updated 3 years ago
- ☆77Updated last year
- TVRecap: A Dataset for Generating Stories with Character Descriptions☆20Updated last year
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 2 years ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆45Updated last year
- ☆21Updated last year
- Code for constructing TLDR corpus from Reddit dataset☆26Updated 3 years ago
- ☆90Updated 7 months ago
- Factored Cognition Primer: How to write compositional language model programs☆48Updated last year
- ☆14Updated 3 months ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- Consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentati…☆35Updated last year
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆58Updated 2 years ago
- LIGHT is a platform for text-situated dialogue research. We originally hosted LIGHT as a live game with dialogue models in a grounded set…☆68Updated last year
- Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Lo…☆39Updated last year
- LTG-Bert☆29Updated last year
- ☆54Updated last year
- Experiments with Hugging Face 🔬 🤗☆45Updated 4 months ago
- Crawling engine that crawls a set of top-level domains looking for documents in a list of languages☆10Updated 11 months ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 4 years ago
- BERT models for many languages created from Wikipedia texts☆34Updated 4 years ago
- A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations☆54Updated 2 years ago
- A TextTiling-based algorithm for text segmentation (aka topic segmentation) that uses neural sentence encoders, as well as extractive sum…☆44Updated last year
- arXiv plain text extraction☆41Updated 2 years ago
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year