Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
☆28Jul 31, 2024Updated last year
Alternatives and similar repositories for sandcrawler
Users that are interested in sandcrawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Web archive index server based on RocksDB☆43Updated this week
- Official Python package for ArchiveBox, the self-hosted internet archiving solution.☆12Oct 5, 2024Updated last year
- Parses Wikipedia citation templates in Python☆17Mar 26, 2025Updated last year
- search interface for scholarly works☆85Aug 2, 2024Updated last year
- A tool for detecting viruses and NSFW material in WARC files☆18Apr 14, 2026Updated 2 weeks ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- ☆32Apr 10, 2026Updated 3 weeks ago
- Demo app built using AngularJS with Backand serving as the back end☆13Mar 1, 2017Updated 9 years ago
- Material parsers and other tools, scripts Initially developed for Grobid Superconductor☆13Feb 21, 2025Updated last year
- A prototype server to swarm multiple DATs for Webrecorder☆14Apr 27, 2019Updated 7 years ago
- ██████╗ ███████╗██████╗ ██╔══██╗██╔════╝██╔══██╗ ██████╔╝█████╗ ██║ ██║ ██╔══██╗██╔══╝ ██║ ██║ ██║ ██║███████╗██████╔╝ ╚═╝ ╚═╝╚═══…☆11Feb 17, 2022Updated 4 years ago
- Interfacing the Unpaywall Database with Python☆33Feb 19, 2024Updated 2 years ago
- 🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.☆57Aug 15, 2024Updated last year
- Scripts for Internet Archive☆14Mar 26, 2025Updated last year
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆12Feb 27, 2023Updated 3 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Archiving GitHub☆11Aug 5, 2025Updated 8 months ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆158Oct 8, 2025Updated 6 months ago
- My Docker-based setup for monitoring a Mastidin instance with Prometheus☆11Dec 8, 2022Updated 3 years ago
- code and data used to build a training dataset for dragnet models☆10Nov 29, 2020Updated 5 years ago
- Repository hosting the common code for the entity-fishing clients☆10Mar 26, 2026Updated last month
- Citation Classification using hybrid neural network model for Wikipedia References☆31Dec 8, 2022Updated 3 years ago
- The EHRI project's portal interface.☆15Updated this week
- consume data from Environment and Climate Change Canada☆13Jul 20, 2020Updated 5 years ago
- Verifiable Credential Extensions☆12Feb 12, 2025Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- WASAPI data transfer APIs☆50Apr 23, 2022Updated 4 years ago
- Run pkg.scripts subtasks in a runner-agnostic way (npm/yarn, whichever launched the main script)☆11Dec 25, 2023Updated 2 years ago
- Analytic platform for the HAL research archive (in development)☆12Oct 2, 2020Updated 5 years ago
- Small string compression using smaz compression algorithm. Fast, because it's in C. Supports Python 3+☆13Oct 18, 2025Updated 6 months ago
- Load, build and explore Patstat using the Google Cloud Platform☆10Jan 19, 2019Updated 7 years ago
- Poor man's simple harvester for arXiv resources☆14Jul 14, 2023Updated 2 years ago
- The Wikinflection Corpus, from the paper "Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus" (Metheni…☆12Dec 15, 2023Updated 2 years ago
- Scraper for German democracy documents☆44Sep 12, 2023Updated 2 years ago
- A reddit bot that finds original publish dates on linked articles.☆10Nov 30, 2024Updated last year
- Proton VPN Special Offer - Get 70% off • AdSpecial partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
- A tool for collecting page-level metadata of digitized book-like objects to share with the Internet Archive.☆13Mar 9, 2026Updated last month
- A machine learning software for extracting astronomical entities from scholarly documents☆10Oct 31, 2022Updated 3 years ago
- GreenLambert macOS IDA plugin to deobfuscate strings☆14Oct 4, 2021Updated 4 years ago
- Specifications for better computing☆10Nov 19, 2019Updated 6 years ago
- A simple 404 page that uses the pathname as input to generate a 404 message.☆13Apr 28, 2018Updated 8 years ago
- Standard implementation of TRC404☆10Jan 20, 2025Updated last year
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago