internetarchive / Zeno
State-of-the-art web crawler π±
β83Updated this week
Related projects β
Alternatives and complementary repositories for Zeno
- A tool for detecting viruses and NSFW material in WARC filesβ11Updated 3 months ago
- wabac.js - Web Archive Browsing Augmentation Clientβ100Updated this week
- Command line tool for digging into WARC filesβ35Updated 3 weeks ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.β42Updated 6 years ago
- Wombat.js client-side rewriting libraryβ84Updated this week
- CDXJ Indexing of WARC/ARCsβ21Updated last week
- Centralised repository for WARC usage specifications.β100Updated this week
- Web archive index server based on RocksDBβ32Updated this week
- Digital Preservation of HTTP in documentary heritage.β22Updated last year
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.β33Updated 2 weeks ago
- Command line tool to convert a file in the WARC format to a file in the ZIM formatβ45Updated last week
- Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)β152Updated 4 years ago
- A search interface and wayback machine for the UKWA Solr based warc-indexer framework.β102Updated 2 weeks ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wikiβ25Updated 3 months ago
- π¨ High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.β117Updated this week
- Specifications developed and maintained by the Webrecorder community.β124Updated this week
- WARC and ARC indexing and discovery tools.β117Updated 3 months ago
- A framework for quick web archiving; canonical repository: https://gitea.arpa.li/JustAnotherArchivist/qwarcβ27Updated 3 years ago
- search interface for scholarly worksβ80Updated 3 months ago
- Python library for reading and writing warc filesβ237Updated 2 years ago
- Converts WARC files to static HTMLβ39Updated 4 months ago
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.β137Updated 8 months ago
- Web application for distributed compute analysis of Archive-It web archive collections.β15Updated 2 months ago
- Streaming WARC/ARC library for fast web archive IOβ387Updated last week
- A Memento Aggregator CLI and Server in Goβ57Updated 6 months ago
- A collection of tools for archiving and analysing the internet.β70Updated 2 years ago
- β39Updated 7 months ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developedβ¦β145Updated 2 months ago
- Comparing warc filesβ15Updated 5 years ago
- Index Common Crawl archives in tabular formatβ106Updated this week