TeamHG-Memex/scrapy-crawl-once

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/TeamHG-Memex/scrapy-crawl-once)

TeamHG-Memex / scrapy-crawl-once

Scrapy middleware which allows to crawl only new content

☆80

Alternatives and similar repositories for scrapy-crawl-once

Users that are interested in scrapy-crawl-once are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

scrapy-plugins / scrapy-deltafetch
View on GitHub
Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls
☆276Feb 26, 2025Updated last year
TeamHG-Memex / autologin-middleware
View on GitHub
Scrapy middleware for the autologin
☆36Apr 8, 2026Updated 3 months ago
alecxe / scrapy-beautifulsoup
View on GitHub
Simple Scrapy middleware to process non-well-formed HTML with BeautifulSoup
☆22Sep 26, 2016Updated 9 years ago
scrapy-plugins / scrapy-pagestorage
View on GitHub
A scrapy extension to store requests and responses information in storage service
☆27Mar 11, 2022Updated 4 years ago
TeamHG-Memex / MaybeDont
View on GitHub
A component that tries to avoid downloading duplicate content
☆28Apr 8, 2026Updated 3 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
TeamHG-Memex / scrapy-rotating-proxies
View on GitHub
use multiple proxies with Scrapy
☆775Apr 8, 2026Updated 3 months ago
scrapy-plugins / scrapy-dotpersistence
View on GitHub
A scrapy extension to sync `.scrapy` folder to an S3 bucket
☆18Mar 28, 2022Updated 4 years ago
TeamHG-Memex / undercrawler
View on GitHub
A generic crawler
☆81Apr 8, 2026Updated 3 months ago
scrapinghub / aduana
View on GitHub
Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…
☆54May 21, 2024Updated 2 years ago
scrapy / xtractmime
View on GitHub
https://mimesniff.spec.whatwg.org/ implementation for Python
☆13Jul 9, 2026Updated 2 weeks ago
stummjr / scrapy-fieldstats
View on GitHub
A Scrapy extension to log items coverage when the spider shuts down
☆18Apr 11, 2020Updated 6 years ago
stefanw / scrapa
View on GitHub
Python 3 AsyncIO powered scraping framework with batteries included
☆20Sep 8, 2016Updated 9 years ago
scrapy-plugins / scrapy-magicfields
View on GitHub
Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
☆56Mar 16, 2022Updated 4 years ago
ejulio / spider-feeder
View on GitHub
A library to make it easier to load input URLs to start scrapy processes
☆14Feb 21, 2021Updated 5 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
rafaelcapucho / scrapy-eagle
View on GitHub
Scrapy Eagle is a tool that allow us to run any Scrapy based project in a distributed fashion and monitor how it is going on and how many…
☆24Sep 4, 2020Updated 5 years ago
TeamHG-Memex / sitehound-frontend
View on GitHub
Site Hound (previously THH) is a Domain Discovery Tool
☆24Apr 8, 2026Updated 3 months ago
hiroshi-manabe / CRFSegmenter
View on GitHub
A multi-language segmenter using high-order CRF.
☆17Feb 27, 2020Updated 6 years ago
TeamHG-Memex / url-summary
View on GitHub
Show summary of a large number of URLs in a Jupyter Notebook
☆19Apr 8, 2026Updated 3 months ago
scrapedia / scrapy-pipelines
View on GitHub
A collection of pipelines for Scrapy
☆16Apr 27, 2026Updated 3 months ago
xelzmm / proxy_server_crawler
View on GitHub
an awesome public proxy server crawler based on scrapy framework
☆90May 17, 2017Updated 9 years ago
TeamHG-Memex / scrapy-kafka-export
View on GitHub
Scrapy extension which writes crawled items to Kafka
☆31Apr 8, 2026Updated 3 months ago
TeamHG-Memex / html-text
View on GitHub
Extract text from HTML
☆135Apr 8, 2026Updated 3 months ago
shadowofseaice / yabs.nvim
View on GitHub
☆11Aug 14, 2022Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
cocrawler / cocrawler
View on GitHub
CoCrawler is a versatile web crawler built using modern tools and concurrency.
☆194Apr 29, 2022Updated 4 years ago
scrapinghub / aile
View on GitHub
Automatic Item List Extraction
☆85Jun 15, 2016Updated 10 years ago
commonsearch / gumbocy
View on GitHub
Python binding for gumbo-parser using Cython
☆14Aug 16, 2016Updated 9 years ago
tumi8 / cca-privacy
View on GitHub
TLS Client Certificate Authentication and its Privacy Implications
☆15Jul 25, 2017Updated 9 years ago
cdrx / scrapyd-authenticated
View on GitHub
Docker container running scrapyd with HTTP authentication
☆41May 14, 2024Updated 2 years ago
orangain / scrapy-s3pipeline
View on GitHub
Scrapy pipeline to store chunked items into Amazon S3 or Google Cloud Storage bucket.
☆76Mar 18, 2022Updated 4 years ago
TeamHG-Memex / page-compare
View on GitHub
Simple heuristic for measuring web page similarity (& data set)
☆91Apr 8, 2026Updated 3 months ago
zytedata / html-text
View on GitHub
☆20Oct 6, 2025Updated 9 months ago
scrapinghub / scrapy-mosquitera
View on GitHub
Restrict crawl and scraping scope using matchers.
☆26Jun 8, 2016Updated 10 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
kyunghyuncho / skip-thoughts
View on GitHub
☆23Jun 30, 2015Updated 11 years ago
rmax / scrapy-boilerplate
View on GitHub
Small set of utilities to simplify writing Scrapy spiders.
☆50Jul 24, 2015Updated 11 years ago
scrapinghub / webpager
View on GitHub
Paginating the web
☆37Feb 11, 2014Updated 12 years ago
scrapy / scurl
View on GitHub
Performance-focused replacement for Python urllib
☆21Apr 13, 2026Updated 3 months ago
redapple / parslepy
View on GitHub
Python implementation of the Parsley language for extracting structured data from web pages
☆92Oct 26, 2017Updated 8 years ago
rmax / databrewer-recipes
View on GitHub
DataBrewer Recipes Repository.
☆21Jul 5, 2016Updated 10 years ago
SimitTomar / webdriverio-cucumber-pom-boilerplate
View on GitHub
A WebdriverIO & Cucumber Boilerplate based on Page Object Model!
☆10Jan 26, 2023Updated 3 years ago