Simple heuristic for measuring web page similarity (& data set)
☆90Feb 23, 2026Updated 2 weeks ago
Alternatives and similar repositories for page-compare
Users that are interested in page-compare are comparing it to the libraries listed below
Sorting:
- Compare html similarity using structural and style metrics☆218May 11, 2023Updated 2 years ago
- A command line tool to cluster html pages based on structural and style similarity.☆20Jan 13, 2026Updated last month
- extract difference between two html pages☆33Feb 10, 2026Updated 3 weeks ago
- Site Hound (previously THH) is a Domain Discovery Tool☆24Feb 10, 2026Updated 3 weeks ago
- Show summary of a large number of URLs in a Jupyter Notebook☆17Feb 10, 2026Updated 3 weeks ago
- Formasaurus tells you the type of an HTML form and its fields using machine learning☆120Feb 23, 2026Updated 2 weeks ago
- A Scrapy extension to log items coverage when the spider shuts down☆19Apr 11, 2020Updated 5 years ago
- Given a new image, determine if it is likely derived from a known image.☆20Feb 10, 2026Updated 3 weeks ago
- A generic crawler☆79Feb 10, 2026Updated 3 weeks ago
- Framework for evaluating text extraction algorithms implemented as web services☆42Jun 30, 2012Updated 13 years ago
- Pipeline for distributed Natural Language Processing, made in Python☆65Jan 31, 2017Updated 9 years ago
- ☆12Apr 7, 2015Updated 10 years ago
- Scraper built with Scrapy.☆18Aug 14, 2024Updated last year
- ☆21Jan 23, 2016Updated 10 years ago
- Automatic Item List Extraction☆86Jun 15, 2016Updated 9 years ago
- A component that tries to avoid downloading duplicate content☆28Feb 10, 2026Updated 3 weeks ago
- Vizlinc☆15Jan 14, 2016Updated 10 years ago
- Manage and load dataprotocols.org Data Packages☆27Sep 17, 2015Updated 10 years ago
- Pattern-of-Behavior Search Tool☆11Jun 20, 2022Updated 3 years ago
- code and data used to build a training dataset for dragnet models☆10Nov 29, 2020Updated 5 years ago
- ☆13Jun 14, 2016Updated 9 years ago
- Scrapy middleware which allows to crawl only new content☆79Feb 10, 2026Updated 3 weeks ago
- General Architecture for Text Engineering☆49Mar 23, 2016Updated 9 years ago
- ☆16Nov 9, 2020Updated 5 years ago
- Automate The Boring Stuff: Updating WordPress☆13Jun 1, 2021Updated 4 years ago
- ☆22Feb 29, 2024Updated 2 years ago
- Source code of SniperOJ running on server right now☆12Oct 23, 2018Updated 7 years ago
- Fast structured perceptron sequential labeler☆15Dec 8, 2015Updated 10 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Jan 16, 2024Updated 2 years ago
- Python binding for gumbo-parser using Cython☆14Aug 16, 2016Updated 9 years ago
- ☆16Apr 24, 2024Updated last year
- Group workspace for improvements to the Columbia Newsblaster system.☆31May 12, 2016Updated 9 years ago
- Intelligent Web Data Extractor☆74Dec 5, 2022Updated 3 years ago
- A toolkit for clustering web pages based on various similarity measures.☆34Oct 27, 2021Updated 4 years ago
- ☆25Jan 26, 2016Updated 10 years ago
- Next generation graph processing platform☆12Aug 26, 2016Updated 9 years ago
- Basic linked data fragments endpoint.☆15Apr 20, 2017Updated 8 years ago
- Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…☆17Sep 11, 2015Updated 10 years ago
- ☆20Mar 31, 2017Updated 8 years ago