helgeho/Web2Warc

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/helgeho/Web2Warc)

helgeho / Web2Warc

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

☆26

Alternatives and similar repositories for Web2Warc

Users that are interested in Web2Warc are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

helgeho / ArchiveSpark
View on GitHub
An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…
☆161Oct 8, 2025Updated 9 months ago
web-archive-group / hackathon
View on GitHub
☆14Feb 28, 2017Updated 9 years ago
web-archive-group / WAHR
View on GitHub
Web Archives for Historical Research
☆13Jun 12, 2017Updated 9 years ago
bsdphk / AardWARC
View on GitHub
Museum-quality bit-archive storage management
☆11Mar 25, 2026Updated 3 months ago
peterk / warcworker
View on GitHub
A dockerized, queued high fidelity web archiver based on Squidwarc
☆62Jul 9, 2024Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
WASAPI-Community / data-transfer-apis
View on GitHub
WASAPI data transfer APIs
☆50Apr 23, 2022Updated 4 years ago
renevoorburg / robustify.js
View on GitHub
A javascript for fighting link rot and content drift using link decoration and web archives.
☆17Oct 31, 2024Updated last year
wolfgangmeyers / go-warc
View on GitHub
A golang library to work with WARC files from the common crawl
☆15Feb 20, 2018Updated 8 years ago
oduwsdl / MementoEmbed
View on GitHub
A service that provides archive-aware oEmbed-compatible embeddable surrogates (social cards, thumbnails, etc.) for archived web pages (me…
☆14Nov 15, 2021Updated 4 years ago
DocNow / unshrtn
View on GitHub
A LevelDB backed URL unshortening microservice written in JavaScript
☆31Dec 10, 2022Updated 3 years ago
yasmina85 / OffTopic-Detection
View on GitHub
This repository contains tool and collections dataset for detecting off-topic pages from Web archived collections.
☆17Aug 20, 2015Updated 10 years ago
arocho / generative-art-workshop
View on GitHub
Data-Driven Generative Art using processing.py
☆11Mar 2, 2017Updated 9 years ago
archivesunleashed / graphpass
View on GitHub
GraphPass is a utility to filter networks and provide a default visualization output for Gephi or SigmaJS.
☆17Nov 14, 2020Updated 5 years ago
unt-libraries / py-wasapi-client
View on GitHub
A client for the Archive-It And Webrecorder WASAPI Data Transfer API
☆16Oct 18, 2019Updated 6 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
webis-de / wasp
View on GitHub
☆28Jun 30, 2026Updated 3 weeks ago
ncordon / smartdata
View on GitHub
R package for data preprocessing
☆13Dec 18, 2019Updated 6 years ago
web-archive-group / ELXN42-Article
View on GitHub
☆10Apr 26, 2016Updated 10 years ago
hmakki72 / pymarc_utilities
View on GitHub
Pymarc Utilities is a set of functions aimed to help manuplating large size MARC files. Pymarc Utilities works with Pymarc library for wo…
☆23Jun 25, 2026Updated 3 weeks ago
richardlehane / webarchive
View on GitHub
golang readers for ARC and WARC webarchive formats
☆20Apr 3, 2023Updated 3 years ago
iipc / webarchive-commons
View on GitHub
Common web archive utility code.
☆65Jul 3, 2026Updated 3 weeks ago
chrpr / dpla-analytics
View on GitHub
☆11Nov 4, 2015Updated 10 years ago
eugeneware / warc
View on GitHub
Parse WARC (Web Archive Files) as a node.js stream
☆23Oct 20, 2014Updated 11 years ago
archivesunleashed / docker-aut
View on GitHub
Docker image for the Archives Unleashed Toolkit
☆12Nov 17, 2022Updated 3 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
ukwa / webarchive-discovery
View on GitHub
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…
☆133Nov 21, 2025Updated 8 months ago
alexstorer / lexisparse
View on GitHub
A simple tool to process plain text output from Lexis Nexis news searches
☆10Aug 20, 2018Updated 7 years ago
project-open-data / G8_Metadata_Mapping
View on GitHub
G8 Metadata Mapping
☆24Jun 18, 2013Updated 13 years ago
project-open-data / catalog-generator
View on GitHub
A multi-format tool to generate and maintain agency.gov/data catalog files.
☆23Oct 1, 2019Updated 6 years ago
JiaWu-Repository / DeepFD-pyTorch
View on GitHub
Deep Structure Learning for Fraud Detection (ICDM 2018)
☆10Oct 2, 2020Updated 5 years ago
chfoo / huhhttp
View on GitHub
An evil web server.
☆13May 9, 2015Updated 11 years ago
datatogether / warc
View on GitHub
Golang WARC (Web ARChive) Library
☆30Aug 6, 2019Updated 6 years ago
ianmilligan1 / Historian-WARC-1
View on GitHub
The Historian's WARC Toolkit
☆16May 14, 2015Updated 11 years ago
qedsoftware / multipage-ocr
View on GitHub
(Python) Execute tesseract OCR on a multi-page PDF.
☆19Jun 30, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
archivesunleashed / twut
View on GitHub
An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.
☆10Mar 17, 2026Updated 4 months ago
webrecorder / cdxj-indexer
View on GitHub
CDXJ Indexing of WARC/ARCs
☆35May 11, 2026Updated 2 months ago
web-archive-group / heritrix-walkthrough
View on GitHub
☆10Jun 10, 2016Updated 10 years ago
instedd / cdx
View on GitHub
Connected Diagnostics Platform
☆11Aug 1, 2025Updated 11 months ago
m4rk3r / lan-before-time
View on GitHub

☆12Jan 18, 2016Updated 10 years ago
Rhizome-Conifer / conifer-deploy
View on GitHub
Conifer setup and deployment via Ansible
☆12Jun 15, 2020Updated 6 years ago
phonedude / cs595-s21
View on GitHub
CS 495/595 Web Security
☆10Feb 27, 2022Updated 4 years ago