文档去重功能是为了解决搜索引擎的文档语义重复的问题,方法是多重哈希下的语义指纹算法。
☆12Aug 17, 2013Updated 12 years ago
Alternatives and similar repositories for deduplication-detecting
Users that are interested in deduplication-detecting are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Python脚本实现千万级文本数据快速去重☆19Mar 14, 2016Updated 10 years ago
- A merged read deduplication tool capable to perform merged read deduplication on single end data.☆12Sep 4, 2024Updated last year
- Deduplication for cfDNA sequencing data☆11Jul 5, 2017Updated 8 years ago
- Get a list of deduped files on a ZFS filesystem☆13Oct 14, 2020Updated 5 years ago
- A Python tool to search for and remove duplicated files in messy datasets☆16Dec 23, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- This is the source code for Efficient Sequential Recommendation for Long Term User Interest Via Personalization.☆26Nov 18, 2025Updated 4 months ago
- default visualizations that come packaged with the lightning viz notebook☆12Apr 18, 2016Updated 9 years ago
- 基于互信息和邻接信息熵的中文分词和新词发现☆14Jan 22, 2019Updated 7 years ago
- String deduplication package for Go☆19Jan 10, 2024Updated 2 years ago
- 基于gensim模块,训练LDA(Latent Dirichlet Allocation)模型,用于计算长短文本的相似度.☆12Nov 25, 2020Updated 5 years ago
- Find duplicate text files.☆15Jan 14, 2025Updated last year
- Pile Deduplication Code☆18May 15, 2023Updated 2 years ago
- RapidCDC: Leveraging Duplicate Locality to Accelerate Chunking in CDC-based Deduplication Systems☆17May 25, 2020Updated 5 years ago
- 文本去重算法,研究自推荐系统中新闻的去重,采用了雅虎的Near-duplicates and shingling算法,服务端用c实现,客户端用java实现,利用thrift框架进行通信,为了提高扩展性,去重可以在服务端实现,服务器也提供了计算的接口,方便客户端自己扩展☆24Feb 25, 2014Updated 12 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- ☆18Apr 3, 2023Updated 2 years ago
- Find near-duplicate documents using minhashing implemented in Go.☆16Dec 22, 2015Updated 10 years ago
- Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.☆19Aug 28, 2023Updated 2 years ago
- Python MaxMind DB writer☆14Apr 3, 2019Updated 6 years ago
- 🕹️ Group and deduplicate concurrent tasks☆29Jan 1, 2026Updated 2 months ago
- ☆21May 24, 2016Updated 9 years ago
- Rabin hashing and content-defined chunking for Go☆20Sep 11, 2017Updated 8 years ago
- Implementation of DSGAN (not fully completed)☆13Dec 28, 2019Updated 6 years ago
- Performs a poor man's file deduplication recursively on a directory. Deletes duplicate files, and creates symbolic links in their place.☆31Dec 2, 2011Updated 14 years ago
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- Express4+angularjs1.4+bootstrap3+mysql+nodejs☆20Sep 11, 2018Updated 7 years ago
- ☆18Jun 27, 2016Updated 9 years ago
- A Python FUSE file system that features transparent deduplication and compression which make it ideal for archiving backups.☆139Jul 22, 2010Updated 15 years ago
- Python library and dashboard for hyperparameter search and model training for computer vision tasks based on PyTorch, Optuna, FiftyOne, D…☆17Jul 14, 2023Updated 2 years ago
- Content Defined Chunking playground☆50Mar 22, 2026Updated last week
- A Go library implementing a buzhash rolling hash function☆31Aug 16, 2016Updated 9 years ago
- This repo contains the codebase for the paper "Unifying Generative and Dense Retrieval for Sequential Recommendation".☆35Jun 16, 2025Updated 9 months ago
- Fast duplicate file detection library☆26Jan 5, 2017Updated 9 years ago
- The pytorch implementation of relational extraction models with PCNN feature extractor and multi-instance learning☆16Mar 8, 2018Updated 8 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- ✨ Epris is a JavaScript library that simplifies interface development☆26May 30, 2022Updated 3 years ago
- Rabin fingerprinting and deduplication library in C☆28Feb 16, 2016Updated 10 years ago
- ☆24Nov 23, 2025Updated 4 months ago
- Multiple ways of chunking for data deduplication: Fixed size chunking, Content defined chunking, and File based chunking.☆19Dec 20, 2013Updated 12 years ago
- shadowsocks面板节点一键脚本,libev最新版☆18Feb 1, 2019Updated 7 years ago
- Utility to list duplicate files in one or more directories based on the file contents☆24Sep 23, 2024Updated last year
- ☆10Feb 19, 2021Updated 5 years ago