Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆19Aug 28, 2023Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- Find near-duplicate documents using minhashing implemented in Go.☆16Dec 22, 2015Updated 10 years ago
- Get a list of deduped files on a ZFS filesystem☆13Oct 14, 2020Updated 5 years ago
- ☆39Jul 28, 2023Updated 2 years ago
- RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems☆17May 25, 2020Updated 5 years ago
- Deduplication for cfDNA sequencing data☆11Jul 5, 2017Updated 8 years ago
- Create snapshot commits on a not checked-out branch without touching the working tree or losing staged changes☆17Updated this week
- Tool to detect (and get rid of) similar images using perceptual hashing (pHash lib)☆84Nov 6, 2016Updated 9 years ago
- ☆14Dec 9, 2021Updated 4 years ago
- Visual hashes☆26Mar 21, 2017Updated 9 years ago
- 🕹️ Group and deduplicate concurrent tasks☆29Jan 1, 2026Updated 2 months ago
- Rabin hashing and content-defined chunking for Go☆20Sep 11, 2017Updated 8 years ago
- A React/MUI component to visualize and explore RDF entities☆11Oct 15, 2024Updated last year
- POSIX-compliant Linux shell utility designed to search files based on their extended attributes.☆13Sep 17, 2022Updated 3 years ago
- AVR-based monitoring of home electricity consumption☆15Jan 4, 2026Updated 2 months ago
- A music spectrum analyser and visualisation program for squeezelite☆14Jan 30, 2023Updated 3 years ago
- Variable-sized block deduplication archival backed by Plan9's venti☆17Jul 15, 2024Updated last year
- Init and management script for mounting rewritable squashfs-compressed data☆45Jun 20, 2025Updated 9 months ago
- Use clonefile to deduplicate files on APFS.☆56Apr 22, 2020Updated 5 years ago
- A tool & Python 3 library to decompress anything☆12Jan 24, 2021Updated 5 years ago
- Multiple ways of chunking for data deduplication: Fixed size chunking, Content defined chunking, and File based chunking.☆19Dec 20, 2013Updated 12 years ago
- Pytest plugin type-checking tests, fixtures, and/or your codebase with @beartype.☆23Mar 3, 2026Updated 2 weeks ago
- data related codebase for polyglot project☆19Mar 30, 2023Updated 2 years ago
- Utility to list duplicate files in one or more directories based on the file contents☆24Sep 23, 2024Updated last year
- A Python FUSE file system that features transparent deduplication and compression which make it ideal for archiving backups.☆139Jul 22, 2010Updated 15 years ago
- Scripts to build openrisc toolchain and bootable filesystem☆12Sep 15, 2014Updated 11 years ago
- Manipulate tar file metadata, list tar files or convert tar to cpio. For some projects, this can replace fakeroot and cpio, when creating…☆32Feb 11, 2026Updated last month
- A python library / model for creating co-references between AMR graph nodes.☆11Dec 11, 2022Updated 3 years ago
- super-Django-CC is a simle web interface for commoncrawl.org☆15Dec 8, 2022Updated 3 years ago
- Copilot with deepseek and more...☆13Mar 7, 2025Updated last year
- FastCDC implementation in Python https://pypi.org/project/fastcdc/☆63Jun 27, 2024Updated last year
- A Golang package that implements CDC chunkers with a generic interface☆121Jan 22, 2026Updated 2 months ago
- level2-nlp-generationfornlp-nlp-05-lv3 created by GitHub Classroom☆14Jan 5, 2025Updated last year
- Check duplicated files☆25Oct 9, 2018Updated 7 years ago
- Supercharged pandas indexing☆11Mar 28, 2021Updated 4 years ago
- Custom AppleScript libraries providing a variety of utilities☆17Sep 11, 2023Updated 2 years ago
- Fast, lightweight MaxMind GeoIP lookup server written in Rust☆16Mar 10, 2026Updated last week
- Go implementation of the FastCDC content-defined chunking algorithm☆82Aug 14, 2023Updated 2 years ago
- GFS: a Graph-based File System Enhanced with Semantic Features☆29May 27, 2021Updated 4 years ago
- Datasource Components for KnockoutJs for paging, sorting and filtering remote sources.☆25Jul 25, 2013Updated 12 years ago