Pile Deduplication Code
☆18May 15, 2023Updated 3 years ago
Alternatives and similar repositories for pile_dedupe
Users that are interested in pile_dedupe are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆44Nov 17, 2024Updated last year
- ☆27Apr 1, 2026Updated 2 months ago
- A merged read deduplication tool capable to perform merged read deduplication on single end data.☆13Sep 4, 2024Updated last year
- Awesome Reinforcement Learning from Human Feedback, the secret behind ChatGPT XD☆23Dec 13, 2022Updated 3 years ago
- Deduplication for cfDNA sequencing data☆11Jul 5, 2017Updated 8 years ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- A Python tool to search for and remove duplicated files in messy datasets☆15Dec 23, 2024Updated last year
- ☆19Jul 31, 2025Updated 10 months ago
- Using queues, tqdm-multiprocess supports multiple worker processes, each with multiple tqdm progress bars, displaying them cleanly throug…☆43Jan 6, 2021Updated 5 years ago
- String deduplication package for Go☆19Jan 10, 2024Updated 2 years ago
- An extensive and commented list of resources on Learned Sparse Retrieval.☆61Updated this week
- Implementation of VQ-VAE with a GPT-style sampler in the JAX and Haiku ecosystem.☆11Nov 23, 2023Updated 2 years ago
- ☆27Mar 21, 2024Updated 2 years ago
- An implementation of DreamerV2 written in JAX, with support for running multiple random seeds of an experiment on a single GPU.☆18Jan 16, 2023Updated 3 years ago
- Find near-duplicate documents using minhashing implemented in Go.☆16Dec 22, 2015Updated 10 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.☆18Aug 28, 2023Updated 2 years ago
- Fork of kingoflolz/mesh-transformer-jax with memory usage optimizations and support for GPT-Neo, GPT-NeoX, BLOOM, OPT and fairseq dense L…☆22Nov 14, 2022Updated 3 years ago
- ☆64Apr 9, 2024Updated 2 years ago
- ☆11Nov 23, 2024Updated last year
- Hugging Face and Pyserini interoperability☆19May 18, 2023Updated 3 years ago
- Unit Scaling demo and experimentation code☆16Mar 12, 2024Updated 2 years ago
- A conversational LoRA for OPT 2.7b☆10Apr 28, 2023Updated 3 years ago
- Advanced Reasoning Benchmark Dataset for LLMs☆47Nov 19, 2023Updated 2 years ago
- Language models scale reliably with over-training and on downstream tasks☆101Apr 2, 2024Updated 2 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Scaling Data-Constrained Language Models☆342Jun 28, 2025Updated 11 months ago
- From Easy to Hard: A Dual Curriculum Learning Framework for Context-Aware Document Ranking☆14Oct 25, 2022Updated 3 years ago
- 接地气的大模型工程,争取成为一本大模型实战百科全书☆16Oct 16, 2023Updated 2 years ago
- ☆12Mar 22, 2024Updated 2 years ago
- Code for our SIGIR 2022 paper. CONVINSE is a framework for conversational question answering (ConvQA) over heterogeneous information sour…☆12Oct 4, 2023Updated 2 years ago
- ☆13Mar 1, 2019Updated 7 years ago
- Python library and dashboard for hyperparameter search and model training for computer vision tasks based on PyTorch, Optuna, FiftyOne, D…☆17Jul 14, 2023Updated 2 years ago
- Code for the paper "Curriculum Learning Strategies for IR: An Empirical Study on Conversation Response Ranking" at ECIR'20☆17Dec 8, 2022Updated 3 years ago
- This repository contains scripts for conversion of data required for most commonly found Machine Learning tasks to TFRecords☆13Mar 6, 2021Updated 5 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- A Go library implementing a buzhash rolling hash function☆31Aug 16, 2016Updated 9 years ago
- Compress conventional Vision-Language Pre-training data☆52Sep 22, 2023Updated 2 years ago
- Minimum Description Length probing for neural network representations☆20Jan 28, 2025Updated last year
- Code for our SIGIR 2023 paper. EXPLAIGNN provides a pipeline for conversational question answering (ConvQA) over heterogeneous sources, a…☆12Jul 15, 2023Updated 2 years ago
- [ICCV23] Official implementation of eP-ALM: Efficient Perceptual Augmentation of Language Models.☆27Oct 27, 2023Updated 2 years ago
- ☆23Jul 27, 2023Updated 2 years ago
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆31May 23, 2024Updated 2 years ago