facebookresearch / cc_net
Tools to download and cleanup Common Crawl data
☆980Updated last year
Alternatives and similar repositories for cc_net:
Users that are interested in cc_net are comparing it to the libraries listed below
- ☆1,157Updated 5 months ago
- All-in-one text de-duplication☆648Updated 7 months ago
- Dense Passage Retriever - is a set of tools and models for open domain Q&A task.☆1,741Updated last year
- Autoregressive Entity Retrieval☆775Updated last year
- BLEURT is a metric for Natural Language Generation based on transfer learning.☆711Updated last year
- Contriever: Unsupervised Dense Information Retrieval with Contrastive Learning☆705Updated last year
- Expanding natural instructions☆965Updated last year
- XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 ty…☆636Updated 2 years ago
- ☆1,486Updated 2 months ago
- [ACL 2021] LM-BFF: Better Few-shot Fine-tuning of Language Models https://arxiv.org/abs/2012.15723☆722Updated 2 years ago
- An efficient implementation of the popular sequence models for text generation, summarization, and translation tasks. https://arxiv.org/p…☆429Updated 2 years ago
- ☆489Updated 11 months ago
- ⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.☆571Updated last year
- ☆1,259Updated 2 years ago
- [ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval Learns Passage Retrieval, Too https://arxiv.o…☆604Updated 2 years ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,354Updated 9 months ago
- FastFormers - highly efficient transformer models for NLU☆703Updated last year
- Crosslingual Generalization through Multitask Finetuning☆521Updated 3 months ago
- Fast Inference Solutions for BLOOM☆563Updated 3 months ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆306Updated last year
- Library for Knowledge Intensive Language Tasks☆920Updated 2 years ago
- Parallelformers: An Efficient Model Parallelization Toolkit for Deployment☆781Updated last year
- BERT score for text generation☆1,653Updated 5 months ago
- Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.☆1,720Updated this week
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.☆1,673Updated 5 months ago
- Repository containing code for "How to Train BERT with an Academic Budget" paper☆310Updated last year
- Prefix-Tuning: Optimizing Continuous Prompts for Generation☆905Updated 8 months ago
- Code for using and evaluating SpanBERT.☆895Updated last year
- Longformer: The Long-Document Transformer☆2,072Updated last year
- SGPT: GPT Sentence Embeddings for Semantic Search☆857Updated 11 months ago