Pile Deduplication Code
☆18May 15, 2023Updated 2 years ago
Alternatives and similar repositories for pile_dedupe
Users that are interested in pile_dedupe are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆44Nov 17, 2024Updated last year
- ☆24Jan 27, 2026Updated last month
- A merged read deduplication tool capable to perform merged read deduplication on single end data.☆12Sep 4, 2024Updated last year
- Awesome Reinforcement Learning from Human Feedback, the secret behind ChatGPT XD☆23Dec 13, 2022Updated 3 years ago
- Deduplication for cfDNA sequencing data☆11Jul 5, 2017Updated 8 years ago
- Get a list of deduped files on a ZFS filesystem☆13Oct 14, 2020Updated 5 years ago
- ☆19Jul 31, 2025Updated 7 months ago
- Using queues, tqdm-multiprocess supports multiple worker processes, each with multiple tqdm progress bars, displaying them cleanly throug…☆43Jan 6, 2021Updated 5 years ago
- String deduplication package for Go☆19Jan 10, 2024Updated 2 years ago
- ☆78Dec 7, 2023Updated 2 years ago
- See the issue board for the current status of active and prospective projects!☆65Feb 12, 2022Updated 4 years ago
- An awesome list of Causality and Machine Learning related papers, books and other resources.☆12Nov 13, 2023Updated 2 years ago
- Find duplicate text files.☆15Jan 14, 2025Updated last year
- SIGIR 2021: Proactive Retrieval-based Chatbots based on Relevant Knowledge and Goals☆11Jul 30, 2021Updated 4 years ago
- Some example codes for drawing figures in research paper☆35Mar 3, 2022Updated 4 years ago
- RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems☆17May 25, 2020Updated 5 years ago
- ☆27Mar 21, 2024Updated 2 years ago
- Find near-duplicate documents using minhashing implemented in Go.☆16Dec 22, 2015Updated 10 years ago
- ☆64Apr 9, 2024Updated last year
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.☆19Aug 28, 2023Updated 2 years ago
- Fork of kingoflolz/mesh-transformer-jax with memory usage optimizations and support for GPT-Neo, GPT-NeoX, BLOOM, OPT and fairseq dense L…☆22Nov 14, 2022Updated 3 years ago
- Unit Scaling demo and experimentation code☆16Mar 12, 2024Updated 2 years ago
- 🕹️ Group and deduplicate concurrent tasks☆29Jan 1, 2026Updated 2 months ago
- Rabin hashing and content-defined chunking for Go☆20Sep 11, 2017Updated 8 years ago
- KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality☆40Dec 1, 2025Updated 3 months ago
- Advanced Reasoning Benchmark Dataset for LLMs☆47Nov 19, 2023Updated 2 years ago
- Language models scale reliably with over-training and on downstream tasks☆101Apr 2, 2024Updated last year
- Scaling Data-Constrained Language Models☆342Jun 28, 2025Updated 8 months ago
- ☆23Aug 7, 2023Updated 2 years ago
- From Easy to Hard: A Dual Curriculum Learning Framework for Context-Aware Document Ranking☆14Oct 25, 2022Updated 3 years ago
- 接地气的大模型工程,争取成为一本大模型实战百科全书☆16Oct 16, 2023Updated 2 years ago
- ☆16Sep 29, 2024Updated last year
- 逻辑回归和单层softmax的解析解☆12Jul 29, 2021Updated 4 years ago
- ☆12Mar 22, 2024Updated 2 years ago
- ☆62Updated this week
- Code for our SIGIR 2022 paper. CONVINSE is a framework for conversational question answering (ConvQA) over heterogeneous information sour…☆12Oct 4, 2023Updated 2 years ago
- [ACL 2024] Benchmarking Knowledge Boundary for Large Language Models: A Different Perspective on Model Evaluation☆10May 26, 2024Updated last year
- A compiler written in Java to compile a subset of instructions called MiniJava.☆10Apr 20, 2015Updated 10 years ago
- Code for the paper "Curriculum Learning Strategies for IR: An Empirical Study on Conversation Response Ranking" at ECIR'20☆17Dec 8, 2022Updated 3 years ago