google-research-datasets / swim-irView external linksLinks
SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
☆49Nov 13, 2023Updated 2 years ago
Alternatives and similar repositories for swim-ir
Users that are interested in swim-ir are comparing it to the libraries listed below
Sorting:
- Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numba☆34Oct 16, 2025Updated 3 months ago
- SPRINT Toolkit helps you evaluate diverse neural sparse models easily using a single click on any IR dataset.☆47Jul 25, 2023Updated 2 years ago
- The Python Implementation of CRISP: Clustering Multi-Vector Representations for Denoising and Pruning☆27Jul 27, 2025Updated 6 months ago
- RankLLM is a Python toolkit for reproducible information retrieval research using rerankers, with a focus on listwise reranking.☆576Updated this week
- 🌏 Modular retrievers for zero-shot multilingual IR.☆30Mar 6, 2024Updated last year
- Embroid: Unsupervised Prediction Smoothing Can Improve Few-Shot Classification☆11Aug 12, 2023Updated 2 years ago
- An unofficial implementation of Tensor4D with support for the D-NeRF dataset☆13Nov 8, 2023Updated 2 years ago
- Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation☆15Apr 23, 2025Updated 9 months ago
- BlockRank makes LLMs efficient and scalable for RAG and in-context ranking☆41Dec 12, 2025Updated 2 months ago
- Data and code for paper "ODSum: New Benchmarks for Open Domain Multi-Document Summarization"☆11Sep 20, 2024Updated last year
- ☆12Feb 22, 2024Updated last year
- ☆13Jan 22, 2025Updated last year
- An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io☆16Apr 18, 2024Updated last year
- Submission archive for the MS MARCO passage ranking leaderboard☆13Apr 21, 2023Updated 2 years ago
- [ACL 2025] AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark☆165Oct 14, 2025Updated 4 months ago
- Tevatron - Unified Document Retrieval Toolkit across Scale, Language, and Modality. Demo in SIGIR 2023, SIGIR 2025.☆723Jan 26, 2026Updated 3 weeks ago
- provides a common interface to many IR measure tools☆95Feb 2, 2026Updated last week
- Scalable training for dense retrieval models.☆298Jun 10, 2025Updated 8 months ago
- T-Projection is a method to perform high-quality Annotation Projection of Sequence Labeling datasets.☆13Nov 21, 2023Updated 2 years ago
- QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning P…☆35Aug 15, 2023Updated 2 years ago
- Use contrastive learning to train a large language model (LLM) as a retriever☆12Jul 19, 2024Updated last year
- ☆48Nov 18, 2025Updated 2 months ago
- Code for paper "Open-Domain Hierarchical Event Schema Induction by Incremental Prompting and Verification"☆16Jul 4, 2023Updated 2 years ago
- ☆14Oct 28, 2023Updated 2 years ago
- Huggingface implementation of MVDream for easy import☆16Mar 31, 2025Updated 10 months ago
- Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: …☆340Jul 6, 2023Updated 2 years ago
- Codebase of ACL2024 paper "Spiral of Silence: How is Large Language Model Killing Information Retrieval?—A Case Study on Open Domain Ques…☆16Jun 4, 2024Updated last year
- Embedding Recycling for Language models☆38Jul 11, 2023Updated 2 years ago
- This repository hosts the dataset for the paper Computer Science Named Entity Recognition in the Open Research Knowledge Graph☆21Jan 8, 2024Updated 2 years ago
- Toolkit for domain-specific information retrieval experimentation☆19Updated this week
- ☆24Feb 4, 2026Updated last week
- ☆18Aug 21, 2025Updated 5 months ago
- Quick lookup for Instant-angelo (https://github.com/hugoycj/Instant-angelo) results☆21Oct 22, 2023Updated 2 years ago
- Rotation equivariance meets local feature matching☆18Oct 20, 2022Updated 3 years ago
- Generate BERT vocabularies and pretraining examples from Wikipedias☆17May 11, 2020Updated 5 years ago
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages.☆80Feb 16, 2022Updated 3 years ago
- This repository contains step by step instructions on how to finetune Microsoft's Phi-2 model with your own data.☆22Apr 11, 2024Updated last year
- ☆89Apr 3, 2025Updated 10 months ago
- Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE☆19Sep 22, 2021Updated 4 years ago