Pile Deduplication Code
☆18May 15, 2023Updated 2 years ago
Alternatives and similar repositories for pile_dedupe
Users that are interested in pile_dedupe are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆44Nov 17, 2024Updated last year
- Efficiently computing & storing token n-grams from large corpora☆27Oct 6, 2024Updated last year
- ☆19Jul 31, 2025Updated 9 months ago
- Find duplicate text files.☆14Jan 14, 2025Updated last year
- SIGIR 2021: Proactive Retrieval-based Chatbots based on Relevant Knowledge and Goals☆11Jul 30, 2021Updated 4 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Some example codes for drawing figures in research paper☆35Mar 3, 2022Updated 4 years ago
- An extensive and commented list of resources on Learned Sparse Retrieval.☆51Apr 27, 2026Updated last week
- ☆27Mar 21, 2024Updated 2 years ago
- ☆64Apr 9, 2024Updated 2 years ago
- Hugging Face and Pyserini interoperability☆19May 18, 2023Updated 2 years ago
- KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality☆44Dec 1, 2025Updated 5 months ago
- Advanced Reasoning Benchmark Dataset for LLMs☆47Nov 19, 2023Updated 2 years ago
- JAX notebook showing how to LoRA + GPTQ arbitrary models☆10Aug 8, 2023Updated 2 years ago
- Language models scale reliably with over-training and on downstream tasks☆101Apr 2, 2024Updated 2 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆19Oct 27, 2025Updated 6 months ago
- Scaling Data-Constrained Language Models☆343Jun 28, 2025Updated 10 months ago
- ☆23Aug 7, 2023Updated 2 years ago
- From Easy to Hard: A Dual Curriculum Learning Framework for Context-Aware Document Ranking☆14Oct 25, 2022Updated 3 years ago
- ☆16Apr 1, 2026Updated last month
- ☆12Mar 22, 2024Updated 2 years ago
- Compress conventional Vision-Language Pre-training data☆52Sep 22, 2023Updated 2 years ago
- Minimum Description Length probing for neural network representations☆20Jan 28, 2025Updated last year
- [ICCV23] Official implementation of eP-ALM: Efficient Perceptual Augmentation of Language Models.☆27Oct 27, 2023Updated 2 years ago
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- ☆23Jul 27, 2023Updated 2 years ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆30May 23, 2024Updated last year
- An unsorted collection of little tools and scripts I've made that don't fit anywhere else☆19Jul 15, 2022Updated 3 years ago
- Code repository for the public reproduction of the language modelling experiments on "MatFormer: Nested Transformer for Elastic Inference…☆31Nov 14, 2023Updated 2 years ago
- Multiple ways of chunking for data deduplication: Fixed size chunking, Content defined chunking, and File based chunking.☆19Dec 20, 2013Updated 12 years ago
- ☆81Updated this week
- [ACL'21] Dialogue Response Selection with Hierarchical Curriculum Learning☆21Nov 15, 2022Updated 3 years ago
- RAG methods, benchmarks, and toolkits☆19Nov 28, 2024Updated last year
- 知予人工智能:从学习者到研究者☆13Jan 20, 2025Updated last year
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- Document Sequence with Subtopic Attention☆19Mar 24, 2023Updated 3 years ago
- Repo for paper "Agentic-R: Learning to Retrieve for Agentic Search" (ACL 2026 Findings)☆79Apr 9, 2026Updated 3 weeks ago
- Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning☆33Jan 9, 2025Updated last year
- a Hadoop Map Reduce application that retrieves data/articles related to sports from sources like NY Times, Commoncrawl, and Twitter and c…☆13Oct 3, 2019Updated 6 years ago
- Code Release for "Broken Neural Scaling Laws" (BNSL) paper☆59Oct 29, 2023Updated 2 years ago
- An evaluation suite for Retrieval-Augmented Generation (RAG).☆23Apr 26, 2025Updated last year
- An all-in-one framework for Ad-hoc Information Retrieval.☆18Apr 3, 2024Updated 2 years ago