fake-name / IntraArchiveDeduplicator
Tool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
☆101Updated last year
Alternatives and similar repositories for IntraArchiveDeduplicator:
Users that are interested in IntraArchiveDeduplicator are comparing it to the libraries listed below
- Fast hamming-distance range searches via native GiST Indexing facility in PostgreSQL☆170Updated 5 years ago
- Implementation of perceptual image hash calculation in Python☆131Updated last year
- Hamming distance between hex strings in SQLite☆25Updated 7 years ago
- Detect source resolution of upscaled images☆242Updated 11 months ago
- WarcMiddleware lets users seamlessly download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.☆46Updated 7 years ago
- A Python binding for libpuzzle.☆45Updated 4 years ago
- PostgreSQL extension for an effective similarity search || mirror of git://sigaev.ru/smlar.git || see https://www.pgcon.org/2012/schedule…☆122Updated 2 months ago
- Perceptual hashing tools for detecting child sexual abuse material☆181Updated 5 months ago
- A multi format lossless image optimizer that uses external tools☆111Updated last month
- Python library for reading and writing warc files☆239Updated 3 years ago
- Tool to detect (and get rid of) similar images using perceptual hashing (pHash lib)☆82Updated 8 years ago
- Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io☆38Updated 9 years ago
- Rewriting web proxy and archival tool. At this point, it just tries to download all the things.☆202Updated last week
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆44Updated 7 years ago
- Implement SQLite table-valued functions with Python☆59Updated last year
- Fast text chunking algorithms for Python☆12Updated 4 years ago
- LZ4 bindings for python☆106Updated 9 years ago
- Webrecorders DevTools Protocol Automation Library☆17Updated 2 years ago
- A utility for sorting really big files. http://kmkeen.com/gz-sort/☆93Updated 6 years ago
- Serapis is a sentence identifier and modeling pipeline / built for Wordnik☆24Updated 8 years ago
- A slim, non-SWIG Python adapter to CTesseract (Tesseract OCR for C).☆24Updated 10 years ago
- Aviation grade news article metadata extraction☆36Updated last year
- Source code of demo app for image comparison☆74Updated 9 years ago
- A Python Perceptual Image Hashing Module☆210Updated 2 years ago
- C language complearn library☆45Updated 9 years ago
- A python implementation of DEPTA☆83Updated 8 years ago
- Memory-efficient standalone server for bitmapist library☆98Updated last year
- Similar images search for PostgreSQL☆259Updated last year
- Tools for handling rotated Quicktime/MP4 files☆26Updated 6 years ago
- Paginating the web☆37Updated 11 years ago