Pile Deduplication Code
☆18May 15, 2023Updated 2 years ago
Alternatives and similar repositories for pile_dedupe
Users that are interested in pile_dedupe are comparing it to the libraries listed below
Sorting:
- ☆19Jul 31, 2025Updated 7 months ago
- ☆23Jan 27, 2026Updated last month
- Unit Scaling demo and experimentation code☆16Mar 12, 2024Updated last year
- Minimum Description Length probing for neural network representations☆20Jan 28, 2025Updated last year
- Hugging Face and Pyserini interoperability☆19May 18, 2023Updated 2 years ago
- ☆44Nov 17, 2024Updated last year
- ☆22Jul 27, 2023Updated 2 years ago
- Advanced Reasoning Benchmark Dataset for LLMs☆47Nov 19, 2023Updated 2 years ago
- Efficiently computing & storing token n-grams from large corpora☆26Oct 6, 2024Updated last year
- ☆27Mar 21, 2024Updated last year
- Language models scale reliably with over-training and on downstream tasks☆100Apr 2, 2024Updated last year
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28May 23, 2024Updated last year
- Code repository for the public reproduction of the language modelling experiments on "MatFormer: Nested Transformer for Elastic Inference…☆31Nov 14, 2023Updated 2 years ago
- Scaling Data-Constrained Language Models☆342Jun 28, 2025Updated 8 months ago
- [ICCV23] Official implementation of eP-ALM: Efficient Perceptual Augmentation of Language Models.☆27Oct 27, 2023Updated 2 years ago
- Code for the ICLR 2024 paper "How to catch an AI liar: Lie detection in black-box LLMs by asking unrelated questions"☆71Jun 19, 2024Updated last year
- ☆64Apr 9, 2024Updated last year
- A red teaming agent☆18Oct 15, 2025Updated 4 months ago
- Repo for the paper "Large Language Models Struggle to Learn Long-Tail Knowledge"☆78Apr 12, 2023Updated 2 years ago
- ☆78Dec 7, 2023Updated 2 years ago
- never forget anything again! combine AI and intelligent tooling for a local knowledge base to track catalogue, annotate, and plan for you…☆37May 14, 2024Updated last year
- Repo to reproduce the First-Explore paper results☆39Dec 25, 2024Updated last year
- ☆35Feb 5, 2024Updated 2 years ago
- ☆43Oct 13, 2023Updated 2 years ago
- Continual Resilient (CoRe) Optimizer for PyTorch☆11Jun 10, 2024Updated last year
- Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning☆33Jan 9, 2025Updated last year
- An Educational Framework Based on PyTorch for Deep Learning Education and Exploration☆10Dec 24, 2023Updated 2 years ago
- Libraries, guides, blueprints, and sample code, to enable rapidly building 0-1 applications on iOS, Android and web.☆11May 12, 2023Updated 2 years ago
- Official code for the paper "Attention as a Hypernetwork"☆51Feb 24, 2026Updated last week
- Understanding how features learned by neural networks evolve throughout training☆41Oct 24, 2024Updated last year
- Failsafe value retrieval, modification and utils using json-pointer spec☆14Dec 20, 2025Updated 2 months ago
- A RAG that can scale 🧑🏻💻☆11May 28, 2024Updated last year
- A Model Agnostic function to directly remove specified layers from the LLM☆10May 23, 2024Updated last year
- [NeurIPS 2025] Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO☆79Oct 29, 2025Updated 4 months ago
- A conditional expression compiler☆15Jun 26, 2025Updated 8 months ago
- [ACL 2023] Counterspeeches up my sleeve! Intent Distribution Learning and Persistent Fusion for Intent-Conditioned Counterspeech Generati…☆10Sep 23, 2023Updated 2 years ago
- Implementation for paper2repo☆11Dec 7, 2020Updated 5 years ago
- Simple-to-use scoring function for arbitrarily tokenized texts.☆47Feb 19, 2025Updated last year
- Prompt + regex lab☆10Nov 22, 2023Updated 2 years ago