A simple algorithm for clustering web pages, suitable for crawlers
☆33Mar 6, 2017Updated 9 years ago
Alternatives and similar repositories for page_clustering
Users that are interested in page_clustering are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Find which links on a web page are pagination links☆29Jan 12, 2017Updated 9 years ago
- A python library detect and extract listing data from HTML page.☆110May 5, 2017Updated 9 years ago
- ☆23Apr 26, 2018Updated 8 years ago
- Automatic Item List Extraction☆85Jun 15, 2016Updated 10 years ago
- A python implementation of DEPTA☆83Jan 14, 2017Updated 9 years ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Scrapy spider middleware to clean up query parameters in request URLs☆24Jun 30, 2016Updated 9 years ago
- A toolkit for clustering web pages based on various similarity measures.☆34Oct 27, 2021Updated 4 years ago
- NER toolkit for HTML data☆259May 3, 2024Updated 2 years ago
- Paginating the web☆37Feb 11, 2014Updated 12 years ago
- MongoDB extensions for Scrapy☆44Oct 2, 2014Updated 11 years ago
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Jul 3, 2017Updated 8 years ago
- Create "perfect" snapshots of web pages☆33May 23, 2026Updated 3 weeks ago
- Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…☆54May 21, 2024Updated 2 years ago
- Repository for ru-syntax command line tool.☆15Mar 8, 2022Updated 4 years ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Tool to flatten stream of JSON-like objects, configured via schema☆33Oct 19, 2019Updated 6 years ago
- Scrapy environment with Tor for anonymous ip routing and Privoxy for http proxy☆20Jul 5, 2016Updated 9 years ago
- Extract text from HTML☆135Apr 8, 2026Updated 2 months ago
- HTML5 audio/video clipper☆13Mar 7, 2018Updated 8 years ago
- 🗿Stones: Persistent key-value containers, compatible with Python dict☆17Jul 15, 2024Updated last year
- Intelligent Web Data Extractor☆74Dec 5, 2022Updated 3 years ago
- ☆10Apr 22, 2024Updated 2 years ago
- A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.☆15Feb 9, 2014Updated 12 years ago
- ☆10May 13, 2026Updated last month
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Implementation of Monte Carlo Word Movers Distance in Python with TensorFlow☆12Sep 12, 2016Updated 9 years ago
- Podclips is an iOS app that allows users to cut out and share clips from their favourite podcasts☆15Mar 25, 2018Updated 8 years ago
- Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations☆39May 21, 2024Updated 2 years ago
- Simple Python3 Supervisor library☆14Jun 2, 2026Updated 2 weeks ago
- Aplikasi transparansi penyaluran dan realisasi dana desa☆13Dec 9, 2015Updated 10 years ago
- TwoFold (2✂︎f). Text files breathe fire.☆24Jan 28, 2026Updated 4 months ago
- Python with a twist of R syntax☆10May 6, 2019Updated 7 years ago
- A simple Solr client for Go☆15Feb 6, 2018Updated 8 years ago
- 浏览过的精彩逆向文章汇总,值得一看☆10Mar 7, 2022Updated 4 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- ☆10Nov 28, 2025Updated 6 months ago
- Script to rotate webserver log file to AWS S3☆28Jul 10, 2014Updated 11 years ago
- txmpp is a C++ XMPP library.☆11Aug 1, 2024Updated last year
- anydice roller☆12May 26, 2018Updated 8 years ago
- ☆19Oct 6, 2025Updated 8 months ago
- Scrapy exporter for Big Data formats☆16Mar 10, 2026Updated 3 months ago
- Deeplack is a python script designed for comparing images (screenshots) using DeepAI to detect changes on websites.☆14Jun 19, 2019Updated 6 years ago