sdtblck / youtube_subtitle_dataset
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆38Updated 4 years ago
Related projects ⓘ
Alternatives and complementary repositories for youtube_subtitle_dataset
- ☆86Updated 2 years ago
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- URL downloader supporting checkpointing and continuous checksumming.☆19Updated 11 months ago
- ☆76Updated 11 months ago
- Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Lo…☆39Updated 10 months ago
- Ranking of fine-tuned HF models as base models.☆35Updated last year
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆76Updated 11 months ago
- A question-answering dataset with a focus on subjective information☆43Updated 10 months ago
- Training & Implementation of chatbots leveraging GPT-like architecture with the aitextgen package to enable dynamic conversations.☆46Updated 2 years ago
- Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine☆31Updated 2 years ago
- Tower Parse: Low-Resource Dependency Parsing via Hierarchical Source Selection☆15Updated 3 years ago
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆58Updated 2 years ago
- ☆27Updated 3 months ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆18Updated last year
- NLG Best Practices for Data-Efficient Modeling How to Train Production-Ready Models with Little Data☆11Updated 3 years ago
- Scripts supporting the development and serving the Roots Search Tool - https://hf.co/spaces/bigscience-data/roots-search☆10Updated last year
- Smol but mighty language model☆62Updated last year
- Wikipedia based dataset to train relationship classifiers and fact extraction models☆25Updated 3 years ago
- As good as new. How to successfully recycle English GPT-2 to make models for other languages (ACL Findings 2021)☆46Updated 3 years ago
- Code for Relevance-guided Supervision for OpenQA with ColBERT (TACL'21)☆40Updated 3 years ago
- Code for Stage-wise Fine-tuning for Graph-to-Text Generation☆26Updated last year
- Seed Machine Translation Data☆30Updated last week
- An ongoing series of notebooks aimed at helping fellow NLP enthusiasts think about applying new tools and techniques to practical tasks.☆18Updated 3 years ago
- ☆19Updated 2 years ago
- No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval☆27Updated 2 years ago
- BERT models for many languages created from Wikipedia texts☆34Updated 4 years ago
- Source code and data for Like a Good Nearest Neighbor☆28Updated 9 months ago
- Using short models to classify long texts☆20Updated last year
- ☆14Updated last month