pablohoffman / awesome-web-scrapingLinks
List of libraries, tools and APIs for web scraping and data processing.
☆13Updated 9 years ago
Alternatives and similar repositories for awesome-web-scraping
Users that are interested in awesome-web-scraping are comparing it to the libraries listed below
Sorting:
- A scrapy extension to store requests and responses information in storage service☆26Updated 3 years ago
- Small set of utilities to simplify writing Scrapy spiders.☆49Updated 9 years ago
- A Scrapy pipeline to categorize items using MonkeyLearn☆37Updated 8 years ago
- Python implementation of the Parsley language for extracting structured data from web pages☆92Updated 7 years ago
- Plots various graphs for a series of plaintext files in a directory☆19Updated 9 years ago
- Python library with common functionality for writing web scrapers☆102Updated 10 years ago
- Restrict crawl and scraping scope using matchers.☆26Updated 9 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- Tool to flatten stream of JSON-like objects, configured via schema☆33Updated 5 years ago
- A Python library for finding feed links on websites.☆52Updated 3 years ago
- Scrapy schema validation pipeline and Item builder using JSON Schema☆44Updated 4 years ago
- A helper to create web scrapers using scrapy selector in a Model based structure☆57Updated 2 years ago
- Tools that will make writing tests, bots and scrapers using Selenium much easier☆140Updated 7 months ago
- A component that tries to avoid downloading duplicate content☆27Updated 7 years ago
- A generic crawler☆78Updated 7 years ago
- Utility library to turn country names into ISO two-letter codes☆70Updated last month
- A command line replacement for zapier and ifttt.☆39Updated 7 years ago
- Paginating the web☆37Updated 11 years ago
- [UNMAINTAINED] Deploy, run and monitor your Scrapy spiders.☆11Updated 10 years ago
- Detect and classify pagination links☆15Updated 4 years ago
- Tweet Lake is a commandline interface to Twitter Streaming API and big data project that extracts interesting stats out of tweet corpus.☆20Updated 3 years ago
- CDNjs for Humans.☆39Updated 7 years ago
- Must-read articles and books about Python. Inspired by https://github.com/s16h/py-must-watch☆11Updated 8 years ago
- Simple library to cleanup and prettify url patterns and emails☆138Updated 3 years ago
- Find which links on a web page are pagination links☆29Updated 8 years ago
- Definitions of Pardon jargon to help Python beginners understand Pythonista gobbletigook☆54Updated 5 years ago
- Proxy-list management application for Django☆23Updated 7 years ago
- Scrapy middleware which allows to crawl only new content☆80Updated 2 years ago
- Python and pandas tools to perform various analyses on different types of word lists☆16Updated 10 years ago
- A simple command line interface to the datamade/dedupe library.☆42Updated 2 years ago