benhoyt / soft404
Soft 404 (dead page) detector in Python
☆13Updated 6 years ago
Alternatives and similar repositories for soft404:
Users that are interested in soft404 are comparing it to the libraries listed below
- A classifier for detecting soft 404 pages☆57Updated last year
- Shepherding our web archives from crawl to access.☆10Updated last year
- Distributed similarity search☆9Updated 4 years ago
- Experimental continouous web crawler for web archiving☆9Updated 2 years ago
- Streaming WARC/ARC library for fast web archive IO☆393Updated last month
- ☆11Updated 7 months ago
- Discord bot by Sanich for https://youtu.be/1lzPIhTaPDY☆13Updated 3 years ago
- Prototype SOLR-powered web archive exploration UI.☆43Updated 4 years ago
- Text-Induced Corpus Clean-up☆20Updated last year
- WARC and ARC indexing and discovery tools.☆118Updated 5 months ago
- Web archive index server based on RocksDB☆34Updated 2 months ago
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.☆140Updated 10 months ago
- Web application for distributed compute analysis of Archive-It web archive collections.☆15Updated 4 months ago
- A search interface and wayback machine for the UKWA Solr based warc-indexer framework.☆105Updated this week
- Archive Research Services Workshop☆31Updated 7 years ago
- The OpenWayback Development☆490Updated last year
- A Memento Aggregator CLI and Server in Go☆61Updated 7 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆43Updated 7 years ago
- WASAPI data transfer APIs☆43Updated 2 years ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆147Updated 4 months ago
- A simple Python library for searching on DuckDuckGo.☆36Updated last year
- World Wide Web site! For the Scholars' Lab!☆12Updated this week
- Python package for harvesting records from OAI-PMH provider(s).☆62Updated 2 years ago
- Seeder - Czech webarchive curating tool and public site☆15Updated last month
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆24Updated 2 years ago
- In progress Jupyter Book with Scholarly API programmatic code examples.☆21Updated last week
- Warcbase is an open-source platform for managing analyzing web archives☆162Updated 7 years ago
- A commandline tool and Python library for archiving data from Facebook using the Graph API.☆77Updated 6 years ago
- A queue-controlled browser automation tool for improving web crawl quality☆60Updated 4 years ago
- Illuminating the scope and content of a digital text collections☆13Updated 9 years ago