omar-elmaria / python_scrapy_airflow_pipeline
This repo contains a full-fledged Python-based script that scrapes a JavaScript-rendered website, cleans the data, and pushes the results to a cloud-based database. The workflow is orchestrated on Airflow to run automatically
☆13Updated 2 years ago
Alternatives and similar repositories for python_scrapy_airflow_pipeline:
Users that are interested in python_scrapy_airflow_pipeline are comparing it to the libraries listed below
- Spider templates for automatic crawlers.☆28Updated this week
- Web scraping Page Objects core library☆99Updated last month
- The first Python validation package uses type, mime, extension, magic numbers, and size to validate files. ✅☆68Updated 3 weeks ago
- Page Object pattern for Scrapy☆121Updated last month
- Asynchronous alternative to the requests-ip-rotator library☆40Updated 2 months ago
- Parsing JavaScript objects into Python data structures☆203Updated 3 weeks ago
- Common interface for data container classes☆67Updated last week
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- Create files with fake data. In many formats. With no efforts.☆91Updated 2 weeks ago
- 🕶 Awesome list of Scrapy tools and libraries☆59Updated 4 years ago
- Zyte API integration for Scrapy☆38Updated 2 weeks ago
- Simple library for exploring/scraping the web or testing a website you’re developing☆127Updated 2 years ago
- Library to populate items using XPath and CSS with a convenient API☆48Updated last week
- Scrapy middleware which allows to crawl only new content☆80Updated 2 years ago
- Pluggable DSL that uses pipes to perform a series of linear transformations to extract data☆16Updated 8 months ago
- ☆62Updated last year
- Various Python 3.6+ helper classes/functions amalgamated into a single package: privex-helpers☆16Updated 5 months ago
- A pure-Python robots.txt parser with support for modern conventions.☆64Updated last week
- Lightweight browser hot reload for Python ASGI web apps☆151Updated 11 months ago
- A template for a FastAPI based Serverless Framework microservice running on AWS Lambda☆92Updated 9 months ago
- Repository Patterns for Python☆176Updated last year
- Celery worker for running asyncio coroutine tasks☆37Updated 5 months ago
- ODM with Pydantic made it simple☆24Updated this week
- Django package that provides auto indexing and searching capabilities for Django model instances using RediSearch.☆93Updated this week
- Scrapy extension that gives you all the scraping monitoring, alerting, scheduling, and data validation you will need straight out of the…☆36Updated 8 months ago
- Python Saleor App/Extension boilerplate. Batteries included.☆53Updated last year
- Detect and classify pagination links☆102Updated 4 years ago
- Run a Scrapy spider programmatically from a script or a Celery task - no project required.☆122Updated 10 months ago
- Python client and types generator for the Chrome DevTools Protocol (CDP)☆70Updated 3 weeks ago
- A better requests and urllib. A simple package for hitting multiple URLs and performing GET/POST requests in parallel.☆42Updated last year