sdtblck / youtube_subtitle_dataset
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆38Updated 3 years ago
Related projects: ⓘ
- Documentation effort for the BookCorpus dataset☆30Updated 3 years ago
- ☆86Updated 2 years ago
- Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine☆31Updated 2 years ago
- TVRecap: A Dataset for Generating Stories with Character Descriptions☆20Updated last year
- ☆75Updated 9 months ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Lo…☆38Updated 8 months ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆74Updated 9 months ago
- A library for squeakily cleaning and filtering language datasets.☆45Updated last year
- Experiments with generating opensource language model assistants☆97Updated last year
- Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…☆84Updated 6 months ago
- ☆42Updated last year
- Code for Relevance-guided Supervision for OpenQA with ColBERT (TACL'21)☆40Updated 3 years ago
- ☆23Updated 2 weeks ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆42Updated 10 months ago
- Code for Stage-wise Fine-tuning for Graph-to-Text Generation☆26Updated last year
- A TextTiling-based algorithm for text segmentation (aka topic segmentation) that uses neural sentence encoders, as well as extractive sum…☆41Updated last year
- Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)☆59Updated last year
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆91Updated last year
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆69Updated 6 months ago
- ☆27Updated last month
- URL downloader supporting checkpointing and continuous checksumming.☆19Updated 9 months ago
- No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval☆27Updated last year
- ☆31Updated last year
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 3 years ago
- Scripts to convert datasets from various sources to Hugging Face Datasets.☆57Updated last year
- ☆44Updated 2 months ago
- Code for our paper Resources and Evaluations for Multi-Distribution Dense Information Retrieval☆14Updated 8 months ago
- aiXplain enables python programmers to add AI functions to their software.☆24Updated last week