internetarchive / Zeno
State-of-the-art web crawler π±
β81Updated this week
Related projects β
Alternatives and complementary repositories for Zeno
- CDXJ Indexing of WARC/ARCsβ21Updated this week
- Command line tool for digging into WARC filesβ34Updated last week
- Web application for distributed compute analysis of Archive-It web archive collections.β15Updated 2 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.β42Updated 6 years ago
- Web archive index server based on RocksDBβ32Updated 3 weeks ago
- wabac.js - Web Archive Browsing Augmentation Clientβ100Updated last week
- Digital Preservation of HTTP in documentary heritage.β22Updated last year
- WARC and ARC indexing and discovery tools.β116Updated 3 months ago
- A search interface and wayback machine for the UKWA Solr based warc-indexer framework.β102Updated 3 months ago
- Command line tool to convert a file in the WARC format to a file in the ZIM formatβ44Updated this week
- Specifications developed and maintained by the Webrecorder community.β123Updated 2 months ago
- Centralised repository for WARC usage specifications.β100Updated 2 months ago
- A Memento Aggregator CLI and Server in Goβ57Updated 5 months ago
- π¨ High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.β117Updated this week
- Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)β152Updated 4 years ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wikiβ25Updated 3 months ago
- Comparing warc filesβ14Updated 5 years ago
- Converts WARC files to static HTMLβ39Updated 4 months ago
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.β33Updated last month
- Experimental proxy and wrapper for safely embedding Web Archives (warc, warc.gz, wacz) into web pages.β23Updated 3 weeks ago
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.β137Updated 8 months ago
- A Github Action for turning Markdown into ReSpec HTMLβ13Updated 5 months ago
- Wombat.js client-side rewriting libraryβ82Updated last week
- An open source set of decks for learning about digital preservation.β23Updated 4 years ago
- β14Updated 10 months ago
- A collection of tools for archiving and analysing the internet.β69Updated 2 years ago
- A framework for quick web archiving; canonical repository: https://gitea.arpa.li/JustAnotherArchivist/qwarcβ27Updated 3 years ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developedβ¦β144Updated last month
- Collection of resources, papers, blog posts, and other documentation around working on and with Archivematica.β19Updated 10 months ago
- JavaScript module and CLI tool for working with web archive data using the WACZ format specification.β13Updated 2 months ago