iipc/webarchive-commons

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/iipc/webarchive-commons)

iipc / webarchive-commons

Common web archive utility code.

☆65

Alternatives and similar repositories for webarchive-commons

Users that are interested in webarchive-commons are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

internetarchive / ia-hadoop-tools
View on GitHub
☆23Feb 22, 2024Updated 2 years ago
ianmilligan1 / Historian-WARC-1
View on GitHub
The Historian's WARC Toolkit
☆16May 14, 2015Updated 11 years ago
internetarchive / webarchive-commons
View on GitHub
☆15Sep 8, 2016Updated 9 years ago
iipc / warc-specifications
View on GitHub
Centralised repository for WARC usage specifications.
☆129Apr 4, 2026Updated 3 months ago
DocNow / waybackprov
View on GitHub
utility to fetch provenance information from Internet Archive's Wayback Machine
☆15Feb 5, 2026Updated 5 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
vinaygoel / archive-analysis
View on GitHub
Tools to analyze web archives
☆20Jul 12, 2016Updated 10 years ago
lintool / clueweb
View on GitHub
Hadoop tools for manipulating ClueWeb collections
☆26Jul 15, 2016Updated 10 years ago
internetarchive / umbra
View on GitHub
A queue-controlled browser automation tool for improving web crawl quality
☆68May 28, 2026Updated last month
iai-group / arXivDigest
View on GitHub
☆27Feb 20, 2026Updated 5 months ago
helgeho / Web2Warc
View on GitHub
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
☆26Oct 9, 2017Updated 8 years ago
vrittis / OpenTextSummarizer
View on GitHub
fork of .net port and adaptation of libots, initially by PatrickBurrows
☆20Aug 15, 2019Updated 6 years ago
alard / warc-proxy
View on GitHub
Serving content from a WARC
☆61Jan 5, 2013Updated 13 years ago
irgroup / repro_eval
View on GitHub
A Python Interface to Reproducibility Measures of System-Oriented IR Experiments
☆11Dec 2, 2025Updated 7 months ago
ucsdlib / dams
View on GitHub
DAMS ontology and data model documentation
☆25Nov 20, 2015Updated 10 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
osirrc / osirrc2019-library
View on GitHub
Official library of images for the SIGIR 2019 Open-Source IR Replicability Challenge (OSIRRC 2019)
☆13Jul 7, 2019Updated 7 years ago
jermnelson / Discover-Aristotle
View on GitHub
A Django-based bibliographic, repository, and access framework for building cataloging applications. Project documentation available at
☆27Apr 14, 2023Updated 3 years ago
UAlbanyArchives / describingWebArchives
View on GitHub
Automating description for Web Archives in ArchivesSpace using the Archive-It CDX and Partner Data APIs
☆11Aug 10, 2018Updated 7 years ago
weltliteratur / vossanto
View on GitHub
Vossian Antonomasia
☆10Jun 19, 2026Updated last month
osirrc / jig
View on GitHub
Jig for the Open-Source IR Replicability Challenge (OSIRRC)
☆13Dec 8, 2022Updated 3 years ago
rightsstatements / data-model
View on GitHub
rightsstatements.org data model
☆13Apr 21, 2026Updated 3 months ago
ruby-microservices / noid
View on GitHub
Nice Opaque Identifier
☆16Sep 21, 2023Updated 2 years ago
commoncrawl / cc-warc-examples
View on GitHub
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Jun 30, 2026Updated 3 weeks ago
archivesunleashed / auk
View on GitHub
Rails application for the Archives Unleashed Cloud.
☆11Jun 30, 2021Updated 5 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
crawler-commons / crawler-commons
View on GitHub
A set of reusable Java components that implement functionality common to any web crawler
☆259Updated this week
edsu / memento-cli
View on GitHub
A command line utility for listing and searching snapshots in web archives
☆20Jun 4, 2026Updated last month
DocNow / awesome-social-media-archiving
View on GitHub
Tools for helping you work with web platform archive downloads.
☆18Mar 27, 2020Updated 6 years ago
cescoffier / Commons-Image-IO
View on GitHub
A small Java Library to manipulate images. It relies on Apache Commons Imaging and javax.image.io
☆21Apr 26, 2021Updated 5 years ago
WASAPI-Community / data-transfer-apis
View on GitHub
WASAPI data transfer APIs
☆50Apr 23, 2022Updated 4 years ago
leifos / simiir
View on GitHub
A toolkit for simulating interactive information retrieval
☆21Sep 7, 2018Updated 7 years ago
jellever / DominantColor
View on GitHub
Basic implementation for calculating the dominant color in an image.
☆12Jan 1, 2016Updated 10 years ago
maturban / WARCMerge
View on GitHub
Merging WARCs into a single WARC file
☆15Aug 29, 2014Updated 11 years ago
commoncrawl / cc-index-table
View on GitHub
Index Common Crawl archives in tabular format
☆132Updated this week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
webis-de / lightning-ir
View on GitHub
One-stop shop for running and fine-tuning transformer-based language models for retrieval
☆65Jul 9, 2026Updated 2 weeks ago
rajbot / CDX-Writer
View on GitHub
Python script to create CDX index files of WARC data
☆16Sep 7, 2018Updated 7 years ago
ukwa / webarchive-discovery
View on GitHub
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…
☆133Nov 21, 2025Updated 8 months ago
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
rossf7 / elasticrawl
View on GitHub
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Feb 15, 2017Updated 9 years ago
uttesh / exude
View on GitHub
Simple java library to filter the stopping,stemming words from input data or file and link
☆20Oct 7, 2018Updated 7 years ago
ukwa / w3act
View on GitHub
w3act is an annotation and curation tool for building web archive collections
☆21Jan 30, 2024Updated 2 years ago