A generic crawler
☆78Feb 10, 2026Updated 3 weeks ago
Alternatives and similar repositories for undercrawler
Users that are interested in undercrawler are comparing it to the libraries listed below
Sorting:
- A component that tries to avoid downloading duplicate content☆27Feb 10, 2026Updated 3 weeks ago
- extract difference between two html pages☆32Feb 10, 2026Updated 3 weeks ago
- Scrapy middleware for the autologin☆36Feb 10, 2026Updated 3 weeks ago
- Show summary of a large number of URLs in a Jupyter Notebook☆17Feb 10, 2026Updated 3 weeks ago
- A project to attempt to automatically login to a website given a single seed☆128Feb 23, 2026Updated last week
- Site Hound (previously THH) is a Domain Discovery Tool☆23Feb 10, 2026Updated 3 weeks ago
- Detect and classify pagination links☆15Sep 9, 2020Updated 5 years ago
- Extract text from HTML☆134Feb 10, 2026Updated 3 weeks ago
- Detect and classify pagination links☆105Feb 10, 2026Updated 3 weeks ago
- Splash + HAProxy + Docker Compose☆195Feb 10, 2026Updated 3 weeks ago
- Web Crawling UI and HTTP API, based on Scrapy and Tornado☆160Feb 10, 2026Updated 3 weeks ago
- Formasaurus tells you the type of an HTML form and its fields using machine learning☆119Feb 23, 2026Updated last week
- A python library detect and extract listing data from HTML page.☆108May 5, 2017Updated 8 years ago
- Scrapy middleware which allows to crawl only new content☆79Feb 10, 2026Updated 3 weeks ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Jan 16, 2024Updated 2 years ago
- Automatic Item List Extraction☆86Jun 15, 2016Updated 9 years ago
- This is the facade for installation and access to the individual components☆15Feb 10, 2026Updated 3 weeks ago
- Paginating the web☆37Feb 11, 2014Updated 12 years ago
- Crochet-based blocking API for Scrapy.☆46Feb 24, 2017Updated 9 years ago
- The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!☆41May 29, 2017Updated 8 years ago
- Adaptive crawler which uses Reinforcement Learning methods☆168Feb 10, 2026Updated 3 weeks ago
- Scrapy downloader middleware that stores response HTMLs to disk.☆18Jan 14, 2026Updated last month
- Given a new image, determine if it is likely derived from a known image.☆20Feb 10, 2026Updated 3 weeks ago
- Python implementation of the Parsley language for extracting structured data from web pages☆92Oct 26, 2017Updated 8 years ago
- Use pyppeteer from a Scrapy spider☆59Feb 5, 2020Updated 6 years ago
- Small set of utilities to simplify writing Scrapy spiders.☆49Jul 24, 2015Updated 10 years ago
- Simple heuristic for measuring web page similarity (& data set)☆90Feb 23, 2026Updated last week
- This project deals with hierarchical classification of web pages based on dmoz dataset.☆14Apr 10, 2014Updated 11 years ago
- Write OpenCL kernels in rust.☆12Sep 28, 2013Updated 12 years ago
- WebDAV client for Rust☆10Jun 6, 2018Updated 7 years ago
- Scraper built with Scrapy.☆18Aug 14, 2024Updated last year
- Scrapy entrypoint for Scrapinghub job runner☆26Updated this week
- A scrapy extension to store requests and responses information in storage service☆27Mar 11, 2022Updated 3 years ago
- Failover AWS Spot Instances☆11Dec 8, 2017Updated 8 years ago
- Tools for scraping of twitter data, conversion, text analysis and graph construction☆11Aug 1, 2016Updated 9 years ago
- Algorithms for "schema matching"☆26Jul 6, 2016Updated 9 years ago
- use multiple proxies with Scrapy☆773Feb 10, 2026Updated 3 weeks ago
- Tool to flatten stream of JSON-like objects, configured via schema☆33Oct 19, 2019Updated 6 years ago
- A SQLite extension that embeds a Lua interpreter into SQLite☆18Jun 20, 2020Updated 5 years ago