seomoz / simhash-cluster
A cluster implementation of simhash near-duplicate detection
☆32Updated 9 years ago
Related projects ⓘ
Alternatives and complementary repositories for simhash-cluster
- Python API for Various DB-Backed Simhash Clusters☆64Updated 7 years ago
- Distributed text analysis suite based on Celery☆95Updated last year
- Non-Overlapping Aho-Corasick Python extension, for Python 2 (str and unicode) and Python 3☆50Updated 9 years ago
- An efficient simhash implementation for python☆125Updated 5 years ago
- A readability parser which can extract title, content, images from html pages☆86Updated 4 years ago
- An easy-install script for LibShortText☆27Updated 9 years ago
- Different approaches to computing document similarity☆28Updated 7 years ago
- A python implementation of DEPTA☆83Updated 7 years ago
- Replication software, data, and supplementary materials for the paper: O'Connor, Stewart and Smith, ACL-2013, "Learning to Extract Intern…☆26Updated 3 years ago
- Standalone Semanticizer☆32Updated 9 years ago
- Pure python NLP toolkit☆55Updated 8 years ago
- A GBDT(MART) and LambdaMART training and predicting package☆15Updated 9 years ago
- tag doc using topN words with lda☆10Updated 9 years ago
- The experiment software underlying two papers published at ECIR-2015 and SEMEVAL-2015.☆37Updated 9 years ago
- tyccl(同义词词林) is a ruby gem that provides friendly functions to analyse similarity between Chinese Words.☆46Updated 10 years ago
- Zipfian capstone project - Dan Morris☆30Updated 7 years ago
- Chinese Tokenizer; New words Finder. 中文三段式机械分词算法; 未登录新词发现算法☆95Updated 8 years ago
- Yet another Chinese word segmentation package based on character-based tagging heuristics and CRF algorithm☆243Updated 11 years ago
- A simple and fast search engine☆70Updated 2 years ago
- auto generate chinese words in huge text.☆24Updated 10 years ago
- python-segment是一个纯python实现的分词库,他的目标是提供一个可用的,完善的分词系统和训练环境,包括一个可用的词典。☆17Updated 11 years ago
- A tool for semantic relation extraction. The program finds pairs of semantically related words based on the text definitions coming from …☆28Updated 10 years ago
- A Python package for pullword.com☆83Updated 4 years ago
- 一个分布式的高性能Word2Vec实现☆15Updated 9 years ago
- A Chinese Words Segmentation Tool Based on Bayes Model☆78Updated 11 years ago
- Output scrapy statistics to graphite/carbon☆54Updated 11 years ago
- Code for KDD 2014 paper "Mining Topics in Documents: Standing on the Shoulders of Big Data"☆21Updated 9 years ago