internetarchive/heritrix3

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/internetarchive/heritrix3)

internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.

☆3,283

Alternatives and similar repositories for heritrix3

Users that are interested in heritrix3 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

apache / nutch
View on GitHub
Apache Nutch is an extensible and scalable web crawler
☆3,265Updated this week
yasserg / crawler4j
View on GitHub
Open Source Web Crawler for Java
☆4,620Nov 4, 2021Updated 4 years ago
CrawlScript / WebCollector
View on GitHub
WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup …
☆3,086Feb 10, 2026Updated 5 months ago
internetarchive / brozzler
View on GitHub
brozzler - distributed browser-based web crawler
☆809Jul 7, 2026Updated 2 weeks ago
code4craft / webmagic
View on GitHub
A scalable web crawler framework for Java.
☆11,684Dec 20, 2025Updated 7 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
xtuhcy / gecco
View on GitHub
Easy to use lightweight web crawler（易用的轻量化网络爬虫）
☆2,512Jan 23, 2026Updated 5 months ago
webrecorder / pywb
View on GitHub
Core Python Web Archiving Toolkit for replay and recording of web archives
☆1,682Apr 10, 2026Updated 3 months ago
iipc / openwayback
View on GitHub
The OpenWayback Development
☆522Jan 3, 2024Updated 2 years ago
machawk1 / wail
View on GitHub
Web Archiving Integration Layer: One-Click User Instigated Preservation
☆398Jun 19, 2026Updated last month
iipc / awesome-web-archiving
View on GitHub
An Awesome List for getting started with web archiving
☆2,605Apr 27, 2026Updated 2 months ago
internetarchive / warcprox
View on GitHub
WARC writing MITM HTTP/S proxy
☆456Jun 17, 2026Updated last month
shunfa / crawlzilla
View on GitHub
☆76Sep 13, 2022Updated 3 years ago
ArchiveTeam / grab-site
View on GitHub
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
☆1,601May 23, 2025Updated last year
internetarchive / warc
View on GitHub
Python library for reading and writing warc files
☆249Mar 7, 2022Updated 4 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
internetarchive / umbra
View on GitHub
A queue-controlled browser automation tool for improving web crawl quality
☆68May 28, 2026Updated last month
ArchiveTeam / wpull
View on GitHub
Wget-compatible web downloader and crawler.
☆612Apr 29, 2024Updated 2 years ago
N0taN3rd / Squidwarc
View on GitHub
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
☆178May 19, 2020Updated 6 years ago
Rhizome-Conifer / conifer
View on GitHub
Collect and revisit web pages.
☆1,542May 12, 2026Updated 2 months ago
netarchivesuite / solrwayback
View on GitHub
A search interface and wayback machine for the UKWA Solr based warc-indexer framework.
☆145Jul 13, 2026Updated last week
apache / stormcrawler
View on GitHub
A scalable, mature and versatile web crawler based on Apache Storm
☆986Updated this week
zhegexiaohuozi / SeimiCrawler
View on GitHub
一个简单、敏捷、分布式的支持SpringBoot的Java爬虫框架;An agile, distributed crawler framework.
☆1,991Jun 24, 2026Updated 3 weeks ago
ArchiveBox / ArchiveBox
View on GitHub
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and mor…
☆27,988Updated this week
webrecorder / warcio
View on GitHub
Streaming WARC/ARC library for fast web archive IO
☆461Jun 10, 2026Updated last month
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
iipc / warc-specifications
View on GitHub
Centralised repository for WARC usage specifications.
☆129Apr 4, 2026Updated 3 months ago
webrecorder / replayweb.page
View on GitHub
Serverless replay of web archives directly in the browser
☆965Jul 13, 2026Updated last week
internetarchive / bookreader
View on GitHub
The Internet Archive BookReader
☆1,160Updated this week
internetarchive / wayback
View on GitHub
IA's public Wayback Machine (moved from SourceForge)
☆849Mar 1, 2024Updated 2 years ago
nla / outbackcdx
View on GitHub
Web archive index server based on RocksDB
☆43Jul 9, 2026Updated last week
scrapy / scrapy
View on GitHub
Scrapy, a fast high-level web crawling & scraping framework for Python.
☆63,273Updated this week
webrecorder / browsertrix-crawler
View on GitHub
Run a high-fidelity browser-based web archiving crawler in a single Docker container
☆1,088Updated this week
iipc / jwarc
View on GitHub
Java library for reading and writing WARC files with a typed API
☆60Jun 27, 2026Updated 3 weeks ago
machawk1 / warcreate
View on GitHub
Chrome extension to "Create WARC files from any webpage"
☆229Dec 5, 2025Updated 7 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
internetarchive / warctools
View on GitHub
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)
☆176Aug 18, 2025Updated 11 months ago
helgeho / Web2Warc
View on GitHub
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
☆26Oct 9, 2017Updated 8 years ago
alibaba / druid
View on GitHub
阿里云计算平台DataWorks(https://help.aliyun.com/document_detail/137663.html) 团队出品，为监控而生的数据库连接池
☆28,182Jun 28, 2026Updated 3 weeks ago
helgeho / ArchiveSpark
View on GitHub
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…
☆161Oct 8, 2025Updated 9 months ago
webrecorder / webrecorder-player
View on GitHub
Webrecorder Player for Desktop (OSX/Windows/Linux). (Built with Electron + Webrecorder)
☆445Sep 17, 2020Updated 5 years ago
YahooArchive / anthelion
View on GitHub
Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages.
☆2,830Dec 17, 2015Updated 10 years ago
jhy / jsoup
View on GitHub
jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
☆11,377Updated this week