sdtblck / youtube_subtitle_dataset
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆43Updated 4 years ago
Alternatives and similar repositories for youtube_subtitle_dataset:
Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below
- ☆89Updated 2 years ago
- Documentation effort for the BookCorpus dataset☆34Updated 3 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆81Updated last year
- ☆77Updated last year
- ☆33Updated last year
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year
- ☆148Updated 4 years ago
- Evaluation suite for large-scale language models.☆125Updated 3 years ago
- Experiments with generating opensource language model assistants☆97Updated last year
- ☆111Updated 2 years ago
- ☆43Updated 2 years ago
- An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.☆86Updated 3 years ago
- A library for squeakily cleaning and filtering language datasets.☆46Updated last year
- Multi-Domain Expert Learning☆67Updated last year
- Open source library for few shot NLP☆78Updated last year
- Repo for the paper "Detecting Logical Fallacies: From Quiz to Climate Change News" (2021)☆75Updated last year
- Pipeline for pulling and processing online language model pretraining data from the web☆177Updated last year
- Helper scripts and notes that were used while porting various nlp models☆46Updated 3 years ago
- 🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.☆56Updated 3 years ago
- code associated with WANLI dataset in Liu et al., 2022☆30Updated last year
- Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…☆87Updated last year
- TVRecap: A Dataset for Generating Stories with Character Descriptions☆20Updated last year
- ☆26Updated last month
- A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+☆37Updated 4 years ago
- Official repo for NAACL 2024 Findings paper "LeTI: Learning to Generate from Textual Interactions."☆63Updated last year
- Downloads 2020 English Wikipedia articles as plaintext☆23Updated 2 years ago
- The Next Generation Multi-Modality Superintelligence☆71Updated 7 months ago
- Create soft prompts for fairseq 13B dense, GPT-J-6B and GPT-Neo-2.7B for free in a Google Colab TPU instance☆28Updated 2 years ago
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆57Updated 2 years ago
- ARCHIVED. Please use https://docs.adapterhub.ml/huggingface_hub.html || 🔌 A central repository collecting pre-trained adapter modules☆68Updated 10 months ago