A simple algorithm for clustering web pages, suitable for crawlers
☆35Mar 6, 2017Updated 9 years ago
Alternatives and similar repositories for page_clustering
Users that are interested in page_clustering are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Find which links on a web page are pagination links☆29Jan 12, 2017Updated 9 years ago
- A python library detect and extract listing data from HTML page.☆109May 5, 2017Updated 8 years ago
- Automatic Item List Extraction☆86Jun 15, 2016Updated 9 years ago
- A python implementation of DEPTA☆83Jan 14, 2017Updated 9 years ago
- Listaa raideja ja silleen☆16Nov 2, 2022Updated 3 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- A toolkit for clustering web pages based on various similarity measures.☆34Oct 27, 2021Updated 4 years ago
- Restrict crawl and scraping scope using matchers.☆26Jun 8, 2016Updated 9 years ago
- NER toolkit for HTML data☆259May 3, 2024Updated last year
- Training selenium agents to simplify UI navigation and reliability☆32Aug 23, 2025Updated 7 months ago
- MongoDB extensions for Scrapy☆44Oct 2, 2014Updated 11 years ago
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Jul 3, 2017Updated 8 years ago
- Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…☆55May 21, 2024Updated last year
- Spider templates for automatic crawlers.☆34Updated this week
- Repository for ru-syntax command line tool.☆16Mar 8, 2022Updated 4 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Tool to flatten stream of JSON-like objects, configured via schema☆33Oct 19, 2019Updated 6 years ago
- Scrapy environment with Tor for anonymous ip routing and Privoxy for http proxy☆20Jul 5, 2016Updated 9 years ago
- Extract text from HTML☆135Feb 10, 2026Updated last month
- HTML5 audio/video clipper☆13Mar 7, 2018Updated 8 years ago
- 🗿Stones: Persistent key-value containers, compatible with Python dict☆17Jul 15, 2024Updated last year
- Package to facilitate URL clustering☆71Feb 24, 2016Updated 10 years ago
- ☆10Apr 22, 2024Updated last year
- Use Python3, Django, Django-rest-framework to achieve alipay payment. 包括支付宝支付,支付宝服务器异步通知,支付宝退款☆12May 26, 2018Updated 7 years ago
- A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.☆15Feb 9, 2014Updated 12 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- ☆10Mar 10, 2026Updated 2 weeks ago
- Implementation of Monte Carlo Word Movers Distance in Python with TensorFlow☆12Sep 12, 2016Updated 9 years ago
- Podclips is an iOS app that allows users to cut out and share clips from their favourite podcasts☆15Mar 25, 2018Updated 8 years ago
- ☆13Jul 16, 2013Updated 12 years ago
- RUSSE: Russian Semantic Evaluation.☆16Mar 1, 2022Updated 4 years ago
- Simple Python3 Supervisor library☆14Mar 2, 2026Updated 3 weeks ago
- TwoFold (2✂︎f). Text files breathe fire.☆23Jan 28, 2026Updated 2 months ago
- https://www.kaggle.com/c/rsna-intracranial-hemorrhage-detection/☆19Oct 20, 2019Updated 6 years ago
- redis-cluster-dockerfile☆11May 18, 2015Updated 10 years ago
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- Alembic extension that adds support for arbitrary user-defined objects like views or functions in autogenerate command.☆13Feb 6, 2025Updated last year
- 浏览过的精彩逆向文章汇总,值得一看☆10Mar 7, 2022Updated 4 years ago
- Understanding of POS tags and build a POS tagger from scratch☆11Jun 9, 2018Updated 7 years ago
- ☆20Nov 16, 2014Updated 11 years ago
- ☆18Oct 6, 2025Updated 5 months ago
- 安卓逆向 不定时分享 抖聘 惠借贷款 平安健康 小红书 搜狐汽车 用药助手 美丽修行 马蜂窝 美图秀秀☆12Mar 24, 2021Updated 5 years ago
- A script that simplifies working with archetypes in Hugo! (@gohugoio) Also supports bulk file creation/editing via a single .csv! 🐍☆17Nov 15, 2021Updated 4 years ago