Simple heuristic for measuring web page similarity (& data set)
☆91Apr 8, 2026Updated last week
Alternatives and similar repositories for page-compare
Users that are interested in page-compare are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Compare html similarity using structural and style metrics☆218May 11, 2023Updated 2 years ago
- A command line tool to cluster html pages based on structural and style similarity.☆20Jan 13, 2026Updated 3 months ago
- extract difference between two html pages☆33Apr 8, 2026Updated last week
- Show summary of a large number of URLs in a Jupyter Notebook☆19Apr 8, 2026Updated last week
- Formasaurus tells you the type of an HTML form and its fields using machine learning☆121Apr 8, 2026Updated last week
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- ☆21Jan 23, 2016Updated 10 years ago
- Empirical Time and Memory Complexity Estimation☆11Jun 17, 2019Updated 6 years ago
- Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…☆17Sep 11, 2015Updated 10 years ago
- Tools for scraping of twitter data, conversion, text analysis and graph construction☆11Aug 1, 2016Updated 9 years ago
- Extract text from HTML☆135Apr 8, 2026Updated last week
- Next generation graph processing platform☆12Aug 26, 2016Updated 9 years ago
- A component that tries to avoid downloading duplicate content☆28Apr 8, 2026Updated last week
- General Architecture for Text Engineering☆50Mar 23, 2016Updated 10 years ago
- Source code of SniperOJ running on server right now☆12Oct 23, 2018Updated 7 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- ☆25Jan 26, 2016Updated 10 years ago
- A generic crawler☆79Apr 8, 2026Updated last week
- A toolkit for clustering web pages based on various similarity measures.☆34Oct 27, 2021Updated 4 years ago
- Pipeline for distributed Natural Language Processing, made in Python☆65Jan 31, 2017Updated 9 years ago
- The User Activity Logging Engine, or User-ALE, is a logging mechanism used to quantitatively assess the behavioural and cognitive state o…☆13Aug 26, 2016Updated 9 years ago
- A Scrapy extension to log items coverage when the spider shuts down☆19Apr 11, 2020Updated 6 years ago
- Scrapy middleware which allows to crawl only new content☆79Apr 8, 2026Updated last week
- Automatic Item List Extraction☆86Jun 15, 2016Updated 9 years ago
- BiLSTM+CRF☆10Jan 15, 2019Updated 7 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Scrapy middleware for the autologin☆36Apr 8, 2026Updated last week
- Group workspace for improvements to the Columbia Newsblaster system.☆31May 12, 2016Updated 9 years ago
- Web Crawling UI and HTTP API, based on Scrapy and Tornado☆160Apr 8, 2026Updated last week
- 网页相似度判断:根据网页结构判断页面相似性 ,可用于相似度计算、越权检测等(Determine page similarity based on HTML page structure)☆283Jul 27, 2019Updated 6 years ago
- ☆47Dec 7, 2022Updated 3 years ago
- ☆16Apr 10, 2026Updated last week
- Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasti…☆24Jul 5, 2016Updated 9 years ago
- Splash + HAProxy + Docker Compose☆195Apr 8, 2026Updated last week
- USC GoFFish Graph Analytics Framework☆33Jul 10, 2014Updated 11 years ago
- Wordpress hosting with auto-scaling - Free Trial • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- SmallK: very fast data clustering tools☆14Apr 3, 2019Updated 7 years ago
- SNAP repository for Ringo☆14Jul 25, 2017Updated 8 years ago
- Code and templates required to build the DARPA open catalog.☆18Mar 23, 2016Updated 10 years ago
- Python binding for gumbo-parser using Cython☆14Aug 16, 2016Updated 9 years ago
- Library to populate items using XPath and CSS with a convenient API☆48Jan 29, 2026Updated 2 months ago
- A crawler for http://books.toscrape.com☆42Aug 8, 2023Updated 2 years ago
- Seed acquisition tool to bootstrap focused crawlers☆23Apr 24, 2017Updated 8 years ago