A simple algorithm for clustering web pages, suitable for crawlers
☆35Mar 6, 2017Updated 9 years ago
Alternatives and similar repositories for page_clustering
Users that are interested in page_clustering are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Find which links on a web page are pagination links☆29Jan 12, 2017Updated 9 years ago
- A python library detect and extract listing data from HTML page.☆109May 5, 2017Updated 9 years ago
- ☆23Apr 26, 2018Updated 8 years ago
- Automatic Item List Extraction☆86Jun 15, 2016Updated 9 years ago
- A python implementation of DEPTA☆83Jan 14, 2017Updated 9 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Visual regression testing without the flakiness.☆31Apr 2, 2026Updated last month
- Scrapy spider middleware to clean up query parameters in request URLs☆24Jun 30, 2016Updated 9 years ago
- A toolkit for clustering web pages based on various similarity measures.☆34Oct 27, 2021Updated 4 years ago
- NER toolkit for HTML data☆259May 3, 2024Updated 2 years ago
- Paginating the web☆37Feb 11, 2014Updated 12 years ago
- Training selenium agents to simplify UI navigation and reliability☆32Aug 23, 2025Updated 8 months ago
- MongoDB extensions for Scrapy☆44Oct 2, 2014Updated 11 years ago
- Create "perfect" snapshots of web pages☆34Apr 10, 2026Updated 3 weeks ago
- Junit Extensions for Test Impact Analysis☆44Feb 7, 2023Updated 3 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…☆55May 21, 2024Updated last year
- Adaptive crawler which uses Reinforcement Learning methods☆169Apr 8, 2026Updated last month
- Spider templates for automatic crawlers.☆34Mar 26, 2026Updated last month
- Combine multiple subscriptions into a single subscription with multiple items☆12Jul 29, 2021Updated 4 years ago
- Repository for ru-syntax command line tool.☆16Mar 8, 2022Updated 4 years ago
- Tool to flatten stream of JSON-like objects, configured via schema☆33Oct 19, 2019Updated 6 years ago
- Scrapy environment with Tor for anonymous ip routing and Privoxy for http proxy☆20Jul 5, 2016Updated 9 years ago
- 🗿Stones: Persistent key-value containers, compatible with Python dict☆17Jul 15, 2024Updated last year
- Package to facilitate URL clustering☆71Feb 24, 2016Updated 10 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- A project to attempt to automatically login to a website given a single seed☆11Jun 17, 2024Updated last year
- A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.☆15Feb 9, 2014Updated 12 years ago
- Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations☆39May 21, 2024Updated last year
- extract difference between two html pages☆33Apr 8, 2026Updated last month
- ☆13Jul 16, 2013Updated 12 years ago
- RUSSE: Russian Semantic Evaluation.☆16Mar 1, 2022Updated 4 years ago
- Microdata schema for historical data.☆31Jun 12, 2012Updated 13 years ago
- Aplikasi transparansi penyaluran dan realisasi dana desa☆13Dec 9, 2015Updated 10 years ago
- Facebook's contrib fb303 library☆28Jun 14, 2010Updated 15 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Compare two different image search engine approaches developed with Deep Learning algorithms☆14Jan 6, 2021Updated 5 years ago
- TwoFold (2✂︎f). Text files breathe fire.☆23Jan 28, 2026Updated 3 months ago
- ☆14Jun 27, 2019Updated 6 years ago
- redis-cluster-dockerfile☆11May 18, 2015Updated 10 years ago
- Understanding of POS tags and build a POS tagger from scratch☆11Jun 9, 2018Updated 7 years ago
- ☆20Nov 16, 2014Updated 11 years ago
- ☆19Oct 6, 2025Updated 7 months ago