benhoyt / soft404Links
Soft 404 (dead page) detector in Python
☆13Updated 6 years ago
Alternatives and similar repositories for soft404
Users that are interested in soft404 are comparing it to the libraries listed below
Sorting:
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 2 years ago
- Shepherding our web archives from crawl to access.☆10Updated last year
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.☆144Updated last year
- A classifier for detecting soft 404 pages☆56Updated 2 years ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆27Updated 11 months ago
- HyperLogLogLog: Counting Distinct Elements With One Log More☆18Updated 3 years ago
- ☆49Updated 2 years ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆151Updated last month
- Experimental continouous web crawler for web archiving☆9Updated 2 years ago
- Web archive index server based on RocksDB☆34Updated last week
- Tools to construct and process Common Crawl webgraphs☆92Updated 2 weeks ago
- Perpetual Access To The Scholarly Record☆120Updated 11 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- WARC and ARC indexing and discovery tools.☆127Updated 2 weeks ago
- 📚 A compilation of research relevant to Data Together's efforts tackling the general problem of data resilience & interactivity☆96Updated 6 years ago
- Social Feed Manager user interface application.☆155Updated last year
- The content management system and frontend website for The Tech, MIT's oldest and largest newspaper.☆22Updated last month
- Distributed similarity search☆9Updated 5 years ago
- Sort-friendly URI Reordering Transform (SURT) python module☆42Updated 11 months ago
- Prototype SOLR-powered web archive exploration UI.☆43Updated 5 years ago
- ☆16Updated 10 years ago
- A Memento Aggregator CLI and Server in Go☆67Updated 4 months ago
- Core Assignment Calculator / Research Project Calculator☆10Updated 7 years ago
- Web Archives for Historical Research☆13Updated 8 years ago
- Digital Preservation of HTTP in documentary heritage.☆22Updated 2 years ago
- Pageviews Analysis tool for Wikimedia Foundation wikis☆142Updated 2 weeks ago
- A distributed system for mining common crawl using SQS, AWS-EC2 and S3☆21Updated 11 years ago
- WebGraph framework with extensions☆23Updated 10 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆179Updated 6 months ago
- ☆8Updated 6 years ago