Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
☆28Jul 31, 2024Updated last year
Alternatives and similar repositories for sandcrawler
Users that are interested in sandcrawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Web archive index server based on RocksDB☆43Jun 8, 2026Updated 3 weeks ago
- Converts HTTrack crawls to WARC files☆34Aug 6, 2024Updated last year
- Homebrew formula for the ArchiveBox self-hosted internet archiving solution.☆29Jun 14, 2026Updated 2 weeks ago
- Tools to analyze web archives☆20Jul 12, 2016Updated 9 years ago
- EpochFS is a versioned cloud file system with git-like branching, transaction support.☆17Apr 23, 2026Updated 2 months ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Parses Wikipedia citation templates in Python☆17Mar 26, 2025Updated last year
- A tool for detecting viruses and NSFW material in WARC files☆18Jun 9, 2026Updated 3 weeks ago
- Material parsers and other tools, scripts Initially developed for Grobid Superconductor☆14Feb 21, 2025Updated last year
- Translation of query languages to serialized KoralQuery protocol☆15Jun 4, 2026Updated 3 weeks ago
- 🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.☆57Aug 15, 2024Updated last year
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆12Feb 27, 2023Updated 3 years ago
- Archiving GitHub☆11Aug 5, 2025Updated 10 months ago
- Python script to create CDX index files of WARC data☆21Sep 4, 2025Updated 9 months ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆161Oct 8, 2025Updated 8 months ago
- Open source password manager - Proton Pass • AdSecurely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
- My Docker-based setup for monitoring a Mastidin instance with Prometheus☆11Dec 8, 2022Updated 3 years ago
- Repository hosting the common code for the entity-fishing clients☆10May 18, 2026Updated last month
- ☆11Oct 5, 2023Updated 2 years ago
- Citation Classification using hybrid neural network model for Wikipedia References☆31Dec 8, 2022Updated 3 years ago
- The EHRI project's portal interface.☆15Updated this week
- consume data from Environment and Climate Change Canada☆13Jul 20, 2020Updated 5 years ago
- Verifiable Credential Extensions☆12Feb 12, 2025Updated last year
- WASAPI data transfer APIs☆50Apr 23, 2022Updated 4 years ago
- Run pkg.scripts subtasks in a runner-agnostic way (npm/yarn, whichever launched the main script)☆11Dec 25, 2023Updated 2 years ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Analytic platform for the HAL research archive (in development)☆12Oct 2, 2020Updated 5 years ago
- Poor man's simple harvester for arXiv resources☆14Jul 14, 2023Updated 2 years ago
- The Wikinflection Corpus, from the paper "Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus" (Metheni…☆12Dec 15, 2023Updated 2 years ago
- Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!☆16Oct 30, 2024Updated last year
- Tools and configurations for translating SNMP into Prometheus☆14Jun 14, 2026Updated 2 weeks ago
- A tool for collecting page-level metadata of digitized book-like objects to share with the Internet Archive.☆14Jun 10, 2026Updated 3 weeks ago
- A machine learning software for extracting astronomical entities from scholarly documents☆10Oct 31, 2022Updated 3 years ago
- ☆12Apr 16, 2025Updated last year
- WindSR Dataset contains more than 22,000 pairs of HR/LR wind speed images, which are processed using the NASA's GEOS-5 Nature Run dataset…☆12Jan 18, 2024Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Les réflexions menées au cours du 404CTF 2023 pour résoudre les challenges proposés☆10Dec 16, 2023Updated 2 years ago
- Utility to compile string of chemical terms into data structure with chemical formula and composition☆13Sep 17, 2021Updated 4 years ago
- A default backend (404 page) for nginx-ingress in Kubernetes☆13Jan 23, 2018Updated 8 years ago
- Nim and awk based bot for Wikipedia☆12Feb 28, 2020Updated 6 years ago
- Logiciel utilise sur la plateforme HAL☆12Jul 13, 2021Updated 4 years ago
- Anomaly detection in time-series networks. Spatio-temporal Anomaly Detection☆12Jan 9, 2020Updated 6 years ago
- Web privacy analysis of Sweden's 290 municipalities.☆11Nov 18, 2022Updated 3 years ago