benhoyt / soft404Links
Soft 404 (dead page) detector in Python
☆13Updated 6 years ago
Alternatives and similar repositories for soft404
Users that are interested in soft404 are comparing it to the libraries listed below
Sorting:
- Shepherding our web archives from crawl to access.☆10Updated last year
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.☆144Updated last year
- A Memento Aggregator CLI and Server in Go☆65Updated 3 months ago
- WARC and ARC indexing and discovery tools.☆124Updated 3 months ago
- A collection of tools for archiving and analysing the internet.☆77Updated 2 years ago
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 2 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- ☆26Updated 3 weeks ago
- An approximate nearest-neighbor search for text reuse.☆12Updated 4 years ago
- Tracking significant changes to the Twitter API or platform as a whole☆20Updated 3 years ago
- ☆8Updated 5 years ago
- Experimental continouous web crawler for web archiving☆9Updated 2 years ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆27Updated 10 months ago
- A commandline tool and Python library for archiving data from Facebook using the Graph API.☆78Updated 7 years ago
- Tools to analyze web archives☆20Updated 8 years ago
- Web Archiving Course☆22Updated last year
- track changes to the news, where news is anything with an RSS feed☆178Updated 5 years ago
- ☆14Updated 8 years ago
- Web Archives for Historical Research☆13Updated 8 years ago
- Algorithms for URL Classification☆19Updated 10 years ago
- Web application for distributed compute analysis of Archive-It web archive collections.☆19Updated 3 months ago
- Tools to construct and process Common Crawl webgraphs☆92Updated last month
- ☆25Updated 2 years ago
- Simple web app that monitors major French real estate website (seloger.com, leboncoin.com, pap.com)☆12Updated 2 years ago
- Docker image for the Archives Unleashed Toolkit☆12Updated 2 years ago
- A PDF classifier ensemble with REST API service☆23Updated 4 years ago
- Crowdsourcing platform for full text transcription and tagging. https://crowd.loc.gov☆162Updated this week
- WASAPI data transfer APIs☆45Updated 3 years ago
- Illuminating the scope and content of a digital text collections☆13Updated 9 years ago
- Tools for tracking stories on news homepages☆48Updated 5 years ago