Simple heuristic for measuring web page similarity (& data set)
☆91Apr 8, 2026Updated 2 months ago
Alternatives and similar repositories for page-compare
Users that are interested in page-compare are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Compare html similarity using structural and style metrics☆218May 11, 2023Updated 3 years ago
- extract difference between two html pages☆33Apr 8, 2026Updated 2 months ago
- Site Hound (previously THH) is a Domain Discovery Tool☆24Apr 8, 2026Updated 2 months ago
- code and data used to build a training dataset for dragnet models☆10Nov 29, 2020Updated 5 years ago
- Show summary of a large number of URLs in a Jupyter Notebook☆19Apr 8, 2026Updated 2 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Formasaurus tells you the type of an HTML form and its fields using machine learning☆121Apr 8, 2026Updated 2 months ago
- Scraper built with Scrapy.☆18Aug 14, 2024Updated last year
- A rotating socks proxy using Tor, Delegate and Haproxy☆14Apr 8, 2026Updated 2 months ago
- ☆21Jan 23, 2016Updated 10 years ago
- Vizlinc☆15Jan 14, 2016Updated 10 years ago
- Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…☆17Sep 11, 2015Updated 10 years ago
- Tools for scraping of twitter data, conversion, text analysis and graph construction☆11Aug 1, 2016Updated 9 years ago
- Pattern-of-Behavior Search Tool☆11Jun 20, 2022Updated 3 years ago
- Extract text from HTML☆135Apr 8, 2026Updated 2 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Automate The Boring Stuff: Updating WordPress☆13Jun 1, 2021Updated 5 years ago
- A component that tries to avoid downloading duplicate content☆28Apr 8, 2026Updated 2 months ago
- Open source code for MobiPurpose project☆13Mar 25, 2025Updated last year
- General Architecture for Text Engineering☆50Mar 23, 2016Updated 10 years ago
- Framework for evaluating text extraction algorithms implemented as web services☆42Jun 30, 2012Updated 13 years ago
- ☆25Jan 26, 2016Updated 10 years ago
- A generic crawler☆79Apr 8, 2026Updated 2 months ago
- A toolkit for clustering web pages based on various similarity measures.☆34Oct 27, 2021Updated 4 years ago
- Pipeline for distributed Natural Language Processing, made in Python☆64Jan 31, 2017Updated 9 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- The User Activity Logging Engine, or User-ALE, is a logging mechanism used to quantitatively assess the behavioural and cognitive state o…☆13Aug 26, 2016Updated 9 years ago
- Scrapy middleware which allows to crawl only new content☆80Apr 8, 2026Updated 2 months ago
- Automatic Item List Extraction☆85Jun 15, 2016Updated 10 years ago
- Python port of Boilerpipe library☆16Apr 6, 2018Updated 8 years ago
- Group workspace for improvements to the Columbia Newsblaster system.☆31May 12, 2016Updated 10 years ago
- Scrapy middleware for the autologin☆36Apr 8, 2026Updated 2 months ago
- 网页相似度判断:根据网页结构判断页面相似性 ,可用于相似度计算、越权检测等(Determine page similarity based on HTML page structure)☆281Jul 27, 2019Updated 6 years ago
- This repository distributes a Windows application using which the user can change the cache folder path of popular web browsers.☆10Sep 29, 2025Updated 8 months ago
- Web Crawling UI and HTTP API, based on Scrapy and Tornado☆162Apr 8, 2026Updated 2 months ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- ☆47Dec 7, 2022Updated 3 years ago
- Neon Geo-temporal Dashboard☆14Jan 10, 2020Updated 6 years ago
- ☆16Apr 10, 2026Updated 2 months ago
- Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasti…☆24Jul 5, 2016Updated 9 years ago
- Splash + HAProxy + Docker Compose☆195Apr 8, 2026Updated 2 months ago
- SmallK: very fast data clustering tools☆13Apr 3, 2019Updated 7 years ago
- ☆15Nov 9, 2020Updated 5 years ago