sdtblck / youtube_subtitle_dataset
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆43Updated 4 years ago
Alternatives and similar repositories for youtube_subtitle_dataset:
Users that are interested in youtube_subtitle_dataset are comparing it to the libraries listed below
- ☆90Updated 2 years ago
- ☆77Updated last year
- Documentation effort for the BookCorpus dataset☆34Updated 3 years ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆81Updated last year
- Pipeline for pulling and processing online language model pretraining data from the web☆177Updated last year
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆57Updated 2 years ago
- Tools for managing datasets for governance and training.☆85Updated 3 months ago
- RaKUn 2.0 - A fast keyword detection algorithm☆67Updated 2 weeks ago
- Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…☆88Updated last year
- 🤗 Disaggregators: Curated data labelers for in-depth analysis.☆65Updated 2 years ago
- Experiments with generating opensource language model assistants☆97Updated last year
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆31Updated last year
- Helper scripts and notes that were used while porting various nlp models☆46Updated 3 years ago
- Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.☆11Updated last year
- ☆33Updated last year
- A library for computing diverse text characteristics and using them to analyze data sets and models with ease.☆40Updated 2 years ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆93Updated 2 years ago
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year
- 🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.☆56Updated 3 years ago
- One stop shop for all things carp☆59Updated 2 years ago
- Official repo for NAACL 2024 Findings paper "LeTI: Learning to Generate from Textual Interactions."☆64Updated last year
- ☆43Updated 2 years ago
- A TextTiling-based algorithm for text segmentation (aka topic segmentation) that uses neural sentence encoders, as well as extractive sum…☆46Updated 2 years ago
- The pipeline for the OSCAR corpus☆168Updated last year
- A library for squeakily cleaning and filtering language datasets.☆47Updated last year
- This repository contains all the code for collecting large scale amounts of code from GitHub.☆107Updated 2 years ago
- Many Natural Language Processing tasks rely on sentence boundary detection (SBD). Although amazing libraries like spacy provide state of …☆61Updated 4 years ago
- Inference script for Meta's LLaMA models using Hugging Face wrapper☆110Updated 2 years ago
- ☆72Updated last year
- Simple Annotated implementation of GPT-NeoX in PyTorch☆110Updated 2 years ago