commoncrawl / ia-web-commonsLinks
Web archiving utility library
☆11Updated last month
Alternatives and similar repositories for ia-web-commons
Users that are interested in ia-web-commons are comparing it to the libraries listed below
Sorting:
- The pipeline for the OSCAR corpus☆175Updated 2 months ago
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆225Updated last year
- A framework for few-shot evaluation of autoregressive language models.☆105Updated 2 years ago
- Scalable training for dense retrieval models.☆298Updated 6 months ago
- Pipeline for pulling and processing online language model pretraining data from the web☆179Updated 2 years ago
- Pretraining Efficiently on S2ORC!☆178Updated last year
- ☆16Updated last year
- This project studies the performance and robustness of language models and task-adaptation methods.☆155Updated last year
- Search Engines with Autoregressive Language models☆295Updated 2 years ago
- Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering☆174Updated 4 years ago
- ☆38Updated last year
- Unified Learned Sparse Retrieval Framework☆68Updated last year
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆189Updated 6 months ago
- Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)☆464Updated 3 years ago
- ☆72Updated 2 years ago
- DSIR large-scale data selection framework for language model training☆268Updated last year
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.☆123Updated 2 months ago
- ☆119Updated last year
- Inquisitive Parrots for Search☆199Updated 7 months ago
- PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an…☆285Updated 3 years ago
- Code for Zero-Shot Tokenizer Transfer☆142Updated 11 months ago
- Dense hybrid representations for text retrieval☆64Updated 2 years ago
- Tk-Instruct is a Transformer model that is tuned to solve many NLP tasks by following instructions.☆183Updated 3 years ago
- The official code of EMNLP 2022, "SCROLLS: Standardized CompaRison Over Long Language Sequences".☆69Updated last year
- ☆328Updated 4 years ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆137Updated last year
- HellaSwag: Can a Machine _Really_ Finish Your Sentence?☆227Updated 5 years ago
- Retrieval-Augmented Generation battle!☆61Updated 5 months ago
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆222Updated 3 weeks ago
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages.☆79Updated 3 years ago