leogao2 / commoncrawl_downloader
☆31Updated last year
Related projects ⓘ
Alternatives and complementary repositories for commoncrawl_downloader
- ☆86Updated 2 years ago
- ☆76Updated 11 months ago
- A dataset of alignment research and code to reproduce it☆69Updated last year
- A library for squeakily cleaning and filtering language datasets.☆45Updated last year
- Script for downloading GitHub.☆88Updated 4 months ago
- One stop shop for all things carp☆59Updated 2 years ago
- Evaluation suite for large-scale language models.☆124Updated 3 years ago
- Pipeline for pulling and processing online language model pretraining data from the web☆174Updated last year
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆76Updated 11 months ago
- Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)☆60Updated last year
- For experiments involving instruct gpt. Currently used for documenting open research questions.☆71Updated 2 years ago
- Downloads 2020 English Wikipedia articles as plaintext☆21Updated last year
- ☆147Updated 3 years ago
- The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.☆36Updated 2 years ago
- Experiments with generating opensource language model assistants☆97Updated last year
- ☆110Updated 2 years ago
- Implementation of Marge, Pre-training via Paraphrasing, in Pytorch☆75Updated 3 years ago
- Tools for managing datasets for governance and training.☆78Updated 3 weeks ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆43Updated 6 months ago
- Source codes for the paper "Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints"☆27Updated last year
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆149Updated 4 months ago
- No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval☆27Updated 2 years ago
- A package for fine-tuning Transformers with TPUs, written in Tensorflow2.0+☆37Updated 3 years ago
- Google's BigBird (Jax/Flax & PyTorch) @ 🤗Transformers☆47Updated last year
- Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP☆58Updated 2 years ago
- An original implementation of EMNLP 2020, "AmbigQA: Answering Ambiguous Open-domain Questions"☆117Updated 2 years ago
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale☆153Updated 11 months ago
- Simple Annotated implementation of GPT-NeoX in PyTorch☆111Updated 2 years ago
- Vespa application making an index of the CORD-19 dataset.☆39Updated this week