scrapinghub/python-simhash

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/scrapinghub/python-simhash)

scrapinghub / python-simhash

An efficient simhash implementation for python

☆127

Alternatives and similar repositories for python-simhash

Users that are interested in python-simhash are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

seomoz / simhash-py
View on GitHub
Simhash and near-duplicate detection
☆422May 15, 2023Updated 3 years ago
scrapy-plugins / scrapy-querycleaner
View on GitHub
Scrapy spider middleware to clean up query parameters in request URLs
☆24Jun 30, 2016Updated 10 years ago
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
scrapinghub / exporters
View on GitHub
Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
☆39May 21, 2024Updated 2 years ago
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
scrapinghub / autologin
View on GitHub
A project to attempt to automatically login to a website given a single seed
☆11Jun 17, 2024Updated 2 years ago
hybridtheory / floc-simhash
View on GitHub
A fast python implementation of the SimHash algorithm.
☆27Oct 27, 2021Updated 4 years ago
nnnet / superminhash
View on GitHub
SuperMinHash: A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, Simhash and SimhashIndex
☆19Nov 18, 2022Updated 3 years ago
rug-compling / alpino-docker
View on GitHub
Alpino in Docker
☆11Apr 16, 2026Updated 3 months ago
scrapinghub / webpager
View on GitHub
Paginating the web
☆37Feb 11, 2014Updated 12 years ago
seomoz / simhash-db-py
View on GitHub
Python API for Various DB-Backed Simhash Clusters
☆64Mar 16, 2017Updated 9 years ago
triandicAnt / FacebookCommunityDetection
View on GitHub
Find community/segment in an attributed graph of Facebook data.
☆18Apr 20, 2017Updated 9 years ago
scrapinghub / webstruct
View on GitHub
NER toolkit for HTML data
☆259May 3, 2024Updated 2 years ago
pydepta / pydepta
View on GitHub
A python implementation of DEPTA
☆84Jan 14, 2017Updated 9 years ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
15810856129 / Simhash
View on GitHub
使用Simhash对海量文本进行去重
☆12Jun 2, 2018Updated 8 years ago
paulgb / nbgraph
View on GitHub
Inline, interactive graphs inside jupyter/ipython notebooks
☆16Aug 19, 2017Updated 8 years ago
hslh / magpie-corpus
View on GitHub
MAGPIE: A sense-annotated corpus of potentially idiomatic expressions
☆33Jun 7, 2020Updated 6 years ago
jannson / simhash-py
View on GitHub
Simhash and near-duplicate detection
☆17Dec 6, 2013Updated 12 years ago
skalmadka / web-crawler
View on GitHub
Distributed Web Crawler, Parser and Search Engine.
☆10Jun 16, 2016Updated 10 years ago
dfdeshom / indeed-spider
View on GitHub
Indeed web crawler
☆11Aug 14, 2018Updated 7 years ago
L-Zhe / CoRPG
View on GitHub
Code for paper Document-Level Paraphrase Generation with Sentence Rewriting and Reordering by Zhe Lin, Yitao Cai and Xiaojun Wan. This pa…
☆26Nov 10, 2021Updated 4 years ago
wmde / wikidata-mismatch-finder
View on GitHub
A tool to review mismatches between Wikidata and External Databases
☆15Jul 15, 2026Updated last week
honnibal / text_classification
View on GitHub
Relatively simple text classification powered by spaCy
☆41Oct 20, 2015Updated 10 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
scrapinghub / aduana
View on GitHub
Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…
☆54May 21, 2024Updated 2 years ago
microth / mateplus
View on GitHub
Extension of the mate-tools NLP pipeline
☆68Apr 12, 2016Updated 10 years ago
dbalan / pipet
View on GitHub
Personal snippet manager, store bits of text.
☆19Jul 12, 2019Updated 7 years ago
jeschkies / nyan
View on GitHub
NYAN is a news filtering engine written in Python and some Ruby.
☆15Aug 23, 2023Updated 2 years ago
aritter / LDA-SP
View on GitHub
Includes Code for Inference and Evaluation of Topic Models for Selectional Preferences
☆16Mar 10, 2023Updated 3 years ago
uzh / fox
View on GitHub
A framework for PSL inference.
☆22Nov 9, 2015Updated 10 years ago
tokestermw / spacy_kenlm
View on GitHub
KenLM extension for spaCy 2.0.
☆16Dec 6, 2017Updated 8 years ago
Gael-Marcheville / dashboard-google-reviews
View on GitHub
This project is a simple web application that allows you to manage your Google My Business Reviews to see, filter, and reply to them. It …
☆10Apr 25, 2024Updated 2 years ago
scrapinghub / scrapy-mosquitera
View on GitHub
Restrict crawl and scraping scope using matchers.
☆26Jun 8, 2016Updated 10 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
explosion / spacy-vectors-builder
View on GitHub
🌸 Train floret vectors
☆18May 4, 2023Updated 3 years ago
google / NeuroNER-CSPMC
View on GitHub
☆13Feb 20, 2020Updated 6 years ago
allenai / nlpstack
View on GitHub
NLP toolkit (tokenizer, POS-tagger, parser, etc.)
☆43Apr 8, 2017Updated 9 years ago
TeamHG-Memex / arachnado
View on GitHub
Web Crawling UI and HTTP API, based on Scrapy and Tornado
☆162Apr 8, 2026Updated 3 months ago
ritesh99rakesh / pyMIDAS
View on GitHub
Python implementation of MIDAS: Microcluster-Based Detector of Anomalies in Edge Streams
☆38Jun 22, 2022Updated 4 years ago
scrapinghub / dateparser
View on GitHub
python parser for human readable dates
☆2,844Updated this week
HandsomeHan515 / python-alipay
View on GitHub
Use Python3, Django, Django-rest-framework to achieve alipay payment. 包括支付宝支付，支付宝服务器异步通知，支付宝退款
☆12May 26, 2018Updated 8 years ago