apache/stormcrawler

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/apache/stormcrawler)

apache / stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm

☆986

Alternatives and similar repositories for stormcrawler

Users that are interested in stormcrawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

crawler-commons / crawler-commons
View on GitHub
A set of reusable Java components that implement functionality common to any web crawler
☆259Jul 2, 2026Updated 2 weeks ago
crawler-commons / url-frontier
View on GitHub
API definition, resources and reference implementation of URL Frontiers
☆63Jun 9, 2026Updated last month
USCDataScience / sparkler
View on GitHub
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
☆421Mar 30, 2023Updated 3 years ago
DigitalPebble / stormcrawler-docker
View on GitHub
Resources for running StormCrawler with Docker services
☆10Nov 10, 2024Updated last year
apache / nutch
View on GitHub
Apache Nutch is an extensible and scalable web crawler
☆3,264Updated this week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
lucidworks / storm-solr
View on GitHub
Storm / Solr Integration
☆19Feb 2, 2024Updated 2 years ago
istresearch / scrapy-cluster
View on GitHub
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
☆1,225Nov 7, 2023Updated 2 years ago
internetarchive / heritrix3
View on GitHub
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
☆3,280Updated this week
DigitalPebble / behemoth
View on GitHub
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
☆283Apr 25, 2018Updated 8 years ago
Norconex / crawler
View on GitHub
Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or fi…
☆203Updated this week
Aloisius / nutch
View on GitHub
CommonCrawl Test version of Nutch
☆16Jul 10, 2014Updated 12 years ago
BruceDone / awesome-crawler
View on GitHub
A collection of awesome web crawler,spider in different languages
☆7,255Jun 16, 2024Updated 2 years ago
yasserg / crawler4j
View on GitHub
Open Source Web Crawler for Java
☆4,620Nov 4, 2021Updated 4 years ago
scrapinghub / portia
View on GitHub
Visual scraping for Scrapy
☆9,506Jun 26, 2024Updated 2 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
helgeho / Web2Warc
View on GitHub
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
☆26Oct 9, 2017Updated 8 years ago
brendonboshell / supercrawler
View on GitHub
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and con…
☆381Dec 30, 2022Updated 3 years ago
scrapinghub / splash
View on GitHub
Lightweight, scriptable browser as a service with an HTTP API
☆4,190Aug 2, 2024Updated last year
archivesunleashed / aut
View on GitHub
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
☆158Dec 5, 2025Updated 7 months ago
internetarchive / brozzler
View on GitHub
brozzler - distributed browser-based web crawler
☆809Jul 7, 2026Updated 2 weeks ago
iipc / webarchive-commons
View on GitHub
Common web archive utility code.
☆65Jul 3, 2026Updated 2 weeks ago
apache / storm
View on GitHub
Apache Storm
☆6,692Updated this week
vespa-engine / vespa
View on GitHub
The AI search platform
☆7,023Updated this week
sematext / query-segmenter
View on GitHub
Solr Query Segmenter for structuring unstructured queries
☆22May 12, 2021Updated 5 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
gsh199449 / DistributedCrawler
View on GitHub
DistributeCrawler的Maven版
☆10Jun 20, 2022Updated 4 years ago
VIDA-NYU / ache
View on GitHub
ACHE is a web crawler for domain-specific search.
☆484Aug 31, 2025Updated 10 months ago
commoncrawl / cc-index-table
View on GitHub
Index Common Crawl archives in tabular format
☆132Updated this week
kohlschutter / boilerpipe
View on GitHub
Work in progress transmit from Google Code
☆1,126Jan 3, 2018Updated 8 years ago
cocrawler / cocrawler
View on GitHub
CoCrawler is a versatile web crawler built using modern tools and concurrency.
☆194Apr 29, 2022Updated 4 years ago
apache / incubator-heron
View on GitHub
Apache Heron (Incubating) is a realtime, distributed, fault-tolerant stream processing engine from Twitter
☆3,629Mar 1, 2023Updated 3 years ago
b-cube / nutch-crawler
View on GitHub
Apache Nutch fork tunned for web services and data discovery.
☆10May 18, 2015Updated 11 years ago
tokenmill / crawling-framework
View on GitHub
Easily crawl news portals or blog sites using Storm Crawler.
☆22Nov 15, 2022Updated 3 years ago
ikreymer / cdx-index-client
View on GitHub
A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/
☆203Oct 7, 2018Updated 7 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
tribbloid / spookystuff
View on GitHub
Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark
☆140Jan 5, 2026Updated 6 months ago
TeamHG-Memex / imageSimilarity
View on GitHub
Given a new image, determine if it is likely derived from a known image.
☆21Apr 8, 2026Updated 3 months ago
mesos / storm
View on GitHub
Storm on Mesos!
☆140Aug 17, 2021Updated 4 years ago
weblyzard / nilsimsa
View on GitHub
A Java library for computing and comparing Nilsimsa string similarity hashes.
☆11May 24, 2022Updated 4 years ago
socialsensor / storm-focused-crawler
View on GitHub
Collects multimedia content shared through social networks.
☆19Feb 18, 2015Updated 11 years ago
apache / pinot
View on GitHub
Apache Pinot - A realtime distributed OLAP datastore
☆6,116Updated this week
eugeneware / warc
View on GitHub
Parse WARC (Web Archive Files) as a node.js stream
☆23Oct 20, 2014Updated 11 years ago