Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
☆3,227Jun 3, 2026Updated last week
Alternatives and similar repositories for heritrix3
Users that are interested in heritrix3 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Apache Nutch is an extensible and scalable web crawler☆3,158May 31, 2026Updated last week
- Open Source Web Crawler for Java☆4,625Nov 4, 2021Updated 4 years ago
- WebCollector is an open source web crawler framework based on Java.It provides some simple interfaces for crawling the Web,you can setup …☆3,093Feb 10, 2026Updated 4 months ago
- brozzler - distributed browser-based web crawler☆799May 19, 2026Updated 3 weeks ago
- A scalable web crawler framework for Java.☆11,679Dec 20, 2025Updated 5 months ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Easy to use lightweight web crawler(易用的轻量化网络爬虫)☆2,514Jan 23, 2026Updated 4 months ago
- Core Python Web Archiving Toolkit for replay and recording of web archives☆1,667Apr 10, 2026Updated 2 months ago
- The OpenWayback Development☆521Jan 3, 2024Updated 2 years ago
- Web Archiving Integration Layer: One-Click User Instigated Preservation☆397Apr 23, 2026Updated last month
- An Awesome List for getting started with web archiving☆2,563Apr 27, 2026Updated last month
- WARC writing MITM HTTP/S proxy☆453Jun 3, 2026Updated last week
- ☆76Sep 13, 2022Updated 3 years ago
- The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns☆1,572May 23, 2025Updated last year
- Wget-compatible web downloader and crawler.☆609Apr 29, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Python library for reading and writing warc files☆249Mar 7, 2022Updated 4 years ago
- Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head☆175May 19, 2020Updated 6 years ago
- Collect and revisit web pages.☆1,542May 12, 2026Updated last month
- A scalable, mature and versatile web crawler based on Apache Storm