benhoyt / soft404Links
Soft 404 (dead page) detector in Python
☆13Updated 7 years ago
Alternatives and similar repositories for soft404
Users that are interested in soft404 are comparing it to the libraries listed below
Sorting:
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆28Updated last year
- Discord bot by Sanich for https://youtu.be/1lzPIhTaPDY☆13Updated 4 years ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆156Updated 3 months ago
- Perpetual Access To The Scholarly Record☆120Updated last year
- Fast extraction of all external links from wikipedia☆13Updated 7 years ago
- A classifier for detecting soft 404 pages☆58Updated last week
- track changes to the news, where news is anything with an RSS feed☆182Updated 5 years ago
- InvenioRDM Product Roadmap☆13Updated last year
- Scientific analysis of collaborative communities☆157Updated 8 months ago
- Generate RDFS vocabulary files from YAML☆21Updated this week
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 3 years ago
- A Memento Aggregator CLI and Server in Go☆76Updated 10 months ago
- Distributed similarity search☆10Updated 5 years ago
- Social Feed Manager user interface application.☆156Updated last year
- Tool for showing Freebase and Google Knowledge Graph entries☆22Updated 2 years ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆132Updated 2 months ago
- ☆52Updated 2 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- S2RDF (SPARQL on Spark for RDF) is a SPARQL query processor for Hadoop based on Spark SQL. It uses the relational interface of Spark for …☆13Updated 7 years ago
- Common Crawl fork of Apache Nutch☆40Updated 3 weeks ago
- Webgraph++ code (http://cnets.indiana.edu/groups/nan/webgraph/)☆33Updated last year
- Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head☆172Updated 5 years ago
- Source code for domain classification (scholar or non-scholar) of a web query.☆11Updated 9 years ago
- Python script that can remove watermark from TikTok videos☆16Updated 5 years ago
- The Openlink Structured Data Sniffer (OSDS) is a plugin for the Chrome, Firefox and Opera browsers that detects and shows structured data…☆129Updated 4 years ago
- The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.☆152Updated last month
- Docker image for the Archives Unleashed Toolkit☆12Updated 3 years ago
- The OpenWayback Development☆507Updated 2 years ago
- Python library for reading and writing warc files☆247Updated 3 years ago
- A commandline tool and Python library for archiving data from Facebook using the Graph API.☆78Updated 8 years ago