commoncrawl / ia-web-commonsLinks
Web archiving utility library
☆11Updated 2 months ago
Alternatives and similar repositories for ia-web-commons
Users that are interested in ia-web-commons are comparing it to the libraries listed below
Sorting:
- ☆38Updated last year
- ☆16Updated 5 months ago
- minimal pytorch implementation of bm25 (with sparse tensors)☆101Updated last year
- ☆72Updated 2 years ago
- ☆47Updated 3 years ago
- Pipeline for pulling and processing online language model pretraining data from the web☆178Updated last year
- A framework for few-shot evaluation of autoregressive language models.☆103Updated 2 years ago
- INCOME: An Easy Repository for Training and Evaluation of Index Compression Methods in Dense Retrieval. Includes BPR and JPQ.☆24Updated last year
- The pipeline for the OSCAR corpus☆167Updated last year
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆221Updated 6 months ago
- Pretraining Efficiently on S2ORC!☆164Updated 7 months ago
- ☆98Updated 2 years ago
- The official code of EMNLP 2022, "SCROLLS: Standardized CompaRison Over Long Language Sequences".☆69Updated last year
- Dense hybrid representations for text retrieval☆62Updated 2 years ago
- Tools for managing datasets for governance and training.☆85Updated last week
- Unified Learned Sparse Retrieval Framework☆64Updated last year
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆93Updated 2 years ago
- Code for Zero-Shot Tokenizer Transfer☆128Updated 4 months ago
- Code for collecting, processing, and preparing datasets for the Common Pile☆27Updated this week
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆128Updated last year
- ☆65Updated last year
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages.☆76Updated 3 years ago
- Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering☆170Updated 4 years ago
- This project studies the performance and robustness of language models and task-adaptation methods.☆150Updated last year
- provides a common interface to many IR measure tools☆84Updated 3 weeks ago
- SPRINT Toolkit helps you evaluate diverse neural sparse models easily using a single click on any IR dataset.☆45Updated last year
- Experiments for efforts to train a new and improved t5☆77Updated last year
- A Python framework for conversational search☆40Updated 3 years ago
- SILO Language Models code repository☆81Updated last year