commoncrawl / ia-web-commonsLinks
Web archiving utility library
☆11Updated 3 months ago
Alternatives and similar repositories for ia-web-commons
Users that are interested in ia-web-commons are comparing it to the libraries listed below
Sorting:
- ☆38Updated last year
- ☆72Updated 2 years ago
- minimal pytorch implementation of bm25 (with sparse tensors)☆101Updated last year
- Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)☆61Updated 2 years ago
- ☆16Updated 6 months ago
- Tools for managing datasets for governance and training.☆85Updated last month
- ☆47Updated 3 years ago
- ☆100Updated 2 years ago
- Code for Zero-Shot Tokenizer Transfer☆133Updated 5 months ago
- A framework for few-shot evaluation of autoregressive language models.☆105Updated 2 years ago
- Unified Learned Sparse Retrieval Framework☆64Updated last year
- Pipeline for pulling and processing online language model pretraining data from the web☆178Updated last year
- This project studies the performance and robustness of language models and task-adaptation methods.☆149Updated last year
- The pipeline for the OSCAR corpus☆169Updated last year
- The official code of EMNLP 2022, "SCROLLS: Standardized CompaRison Over Long Language Sequences".☆70Updated last year
- Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering☆170Updated 4 years ago
- Common tools for data processing☆14Updated 2 months ago
- SILO Language Models code repository☆81Updated last year
- A robust web archive analytics toolkit☆111Updated 3 months ago
- Pretraining Efficiently on S2ORC!☆164Updated 8 months ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆32Updated last year
- The original implementation of Min et al. "Nonparametric Masked Language Modeling" (paper https//arxiv.org/abs/2212.01349)☆157Updated 2 years ago
- [ICLR 2023] Guess the Instruction! Flipped Learning Makes Language Models Stronger Zero-Shot Learners☆116Updated 9 months ago
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages.☆76Updated 3 years ago
- This repository contains the dataset and code for "WiCE: Real-World Entailment for Claims in Wikipedia" in EMNLP 2023.☆41Updated last year
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆221Updated 7 months ago
- ☆48Updated last year
- INCOME: An Easy Repository for Training and Evaluation of Index Compression Methods in Dense Retrieval. Includes BPR and JPQ.☆24Updated last year
- ☆48Updated 4 months ago
- The official code repo for "Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations".☆83Updated last year