TeamHG-Memex/page-compare

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/TeamHG-Memex/page-compare)

TeamHG-Memex / page-compare

Simple heuristic for measuring web page similarity (& data set)

☆91

Alternatives and similar repositories for page-compare

Users that are interested in page-compare are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

matiskay / html-similarity
View on GitHub
Compare html similarity using structural and style metrics
☆219Updated this week
TeamHG-Memex / extract-html-diff
View on GitHub
extract difference between two html pages
☆33Apr 8, 2026Updated 3 months ago
TeamHG-Memex / sitehound-frontend
View on GitHub
Site Hound (previously THH) is a Domain Discovery Tool
☆24Apr 8, 2026Updated 3 months ago
TeamHG-Memex / url-summary
View on GitHub
Show summary of a large number of URLs in a Jupyter Notebook
☆19Apr 8, 2026Updated 3 months ago
TeamHG-Memex / Formasaurus
View on GitHub
Formasaurus tells you the type of an HTML form and its fields using machine learning
☆121Apr 8, 2026Updated 3 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
ContinuumIO / scrapy_scrapers
View on GitHub
Scraper built with Scrapy.
☆18Jun 25, 2026Updated last month
TeamHG-Memex / docker-tor-rotator
View on GitHub
A rotating socks proxy using Tor, Delegate and Haproxy
☆14Apr 8, 2026Updated 3 months ago
TeamHG-Memex / tor-proxy
View on GitHub
a tor socks proxy docker image
☆12Apr 8, 2026Updated 3 months ago
nasa-jpl-memex / elwha
View on GitHub
Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…
☆17Sep 11, 2015Updated 10 years ago
mitll / graph-qube
View on GitHub
Pattern-of-Behavior Search Tool
☆11Jun 20, 2022Updated 4 years ago
TeamHG-Memex / imageSimilarity
View on GitHub
Given a new image, determine if it is likely derived from a known image.
☆21Apr 8, 2026Updated 3 months ago
concordusapps / python-hocr
View on GitHub
HOCR manipulation and utility library; provides hocr2pdf binary.
☆14Mar 5, 2018Updated 8 years ago
bertdida / selenium-wpupdater
View on GitHub
Automate The Boring Stuff: Updating WordPress
☆13Jun 1, 2021Updated 5 years ago
snap-stanford / ringo
View on GitHub
Next generation graph processing platform
☆12Aug 26, 2016Updated 9 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
TeamHG-Memex / html-text
View on GitHub
Extract text from HTML
☆135Apr 8, 2026Updated 3 months ago
nasa-jpl-memex / memex-gate
View on GitHub
General Architecture for Text Engineering
☆50Mar 23, 2016Updated 10 years ago
TeamHG-Memex / MaybeDont
View on GitHub
A component that tries to avoid downloading duplicate content
☆28Apr 8, 2026Updated 3 months ago
tomazk / Text-Extraction-Evaluation
View on GitHub
Framework for evaluating text extraction algorithms implemented as web services
☆42Jun 30, 2012Updated 14 years ago
SniperOJ / Jeopardy-Platform-Run
View on GitHub
Source code of SniperOJ running on server right now
☆12Oct 23, 2018Updated 7 years ago
USCDataScience / autoextractor
View on GitHub
A toolkit for clustering web pages based on various similarity measures.
☆34Oct 27, 2021Updated 4 years ago
draperlaboratory / user-ale
View on GitHub
The User Activity Logging Engine, or User-ALE, is a logging mechanism used to quantitatively assess the behavioural and cognitive state o…
☆13Aug 26, 2016Updated 9 years ago
TeamHG-Memex / undercrawler
View on GitHub
A generic crawler
☆81Apr 8, 2026Updated 3 months ago
stummjr / scrapy-fieldstats
View on GitHub
A Scrapy extension to log items coverage when the spider shuts down
☆18Apr 11, 2020Updated 6 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
sammyer / BoilerPy
View on GitHub
Python port of Boilerpipe library
☆16Apr 6, 2018Updated 8 years ago
btaille / contener
View on GitHub
Code for "Contextualized Embeddings in Named-Entity Recognition", ECIR 2020
☆13Jul 25, 2024Updated 2 years ago
TeamHG-Memex / scrapy-crawl-once
View on GitHub
Scrapy middleware which allows to crawl only new content
☆80Apr 8, 2026Updated 3 months ago
plamenbbn / XDATA
View on GitHub
PINT Algorithm for XDATA
☆21Nov 29, 2016Updated 9 years ago
scrapy / xtractmime
View on GitHub
https://mimesniff.spec.whatwg.org/ implementation for Python
☆13Jul 9, 2026Updated 2 weeks ago
kmscom / Browser-Cache-Folder-Changer
View on GitHub
This repository distributes a Windows application using which the user can change the cache folder path of popular web browsers.
☆10Sep 29, 2025Updated 10 months ago
shaneaevans / psearch
View on GitHub
Prospective search for python
☆26Mar 7, 2026Updated 4 months ago
edgi-govdata-archiving / web-monitoring-processing
View on GitHub
Tools for access, "diff"-ing, and analyzing archived web pages
☆23Updated this week
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
TeamHG-Memex / autologin-middleware
View on GitHub
Scrapy middleware for the autologin
☆36Apr 8, 2026Updated 3 months ago
vidar-team / hctf_backend
View on GitHub
☆47Dec 7, 2022Updated 3 years ago
TeamHG-Memex / arachnado
View on GitHub
Web Crawling UI and HTTP API, based on Scrapy and Tornado
☆162Apr 8, 2026Updated 3 months ago
NextCenturyCorporation / neon-gtd
View on GitHub
Neon Geo-temporal Dashboard
☆14Jan 10, 2020Updated 6 years ago
kedz / newsblaster
View on GitHub
Group workspace for improvements to the Columbia Newsblaster system.
☆31May 12, 2016Updated 10 years ago
scrapinghub / product-extraction-benchmark
View on GitHub
☆16Apr 10, 2026Updated 3 months ago
andreif / epf
View on GitHub
Apple EPF crawler, downloader and parser
☆15Jan 7, 2017Updated 9 years ago