Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
☆28Jul 31, 2024Updated last year
Alternatives and similar repositories for sandcrawler
Users that are interested in sandcrawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Official Python package for ArchiveBox, the self-hosted internet archiving solution.☆12Oct 5, 2024Updated last year
- Perpetual Access To The Scholarly Record☆121Jul 31, 2024Updated last year
- Homebrew formula for the ArchiveBox self-hosted internet archiving solution.☆28Updated this week
- Tools to analyze web archives☆20Jul 12, 2016Updated 9 years ago
- Trough: Big data, small databases.☆42Jul 25, 2024Updated last year
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- Parses Wikipedia citation templates in Python☆17Mar 26, 2025Updated last year
- search interface for scholarly works☆85Aug 2, 2024Updated last year
- A tool for detecting viruses and NSFW material in WARC files☆18Updated this week
- ☆32May 3, 2026Updated last month
- Material parsers and other tools, scripts Initially developed for Grobid Superconductor☆14Feb 21, 2025Updated last year
- ██████╗ ███████╗██████╗ ██╔══██╗██╔════╝██╔══██╗ ██████╔╝█████╗ ██║ ██║ ██╔══██╗██╔══╝ ██║ ██║ ██║ ██║███████╗██████╔╝ ╚═╝ ╚═╝╚═══…☆11Feb 17, 2022Updated 4 years ago
- produce a stream of citiation data coming off wikimedia☆12Mar 28, 2017Updated 9 years ago
- 🗄 Save an archived copy of websites from Pocket/Pinboard/Bookmarks/RSS. Outputs HTML, PDFs, and more...☆38Aug 12, 2018Updated 7 years ago
- Scripts for Internet Archive☆14Mar 26, 2025Updated last year
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Archiving GitHub☆11Aug 5, 2025Updated 10 months ago
- Python script to create CDX index files of WARC data☆21Sep 4, 2025Updated 9 months ago
- code and data used to build a training dataset for dragnet models☆10Nov 29, 2020Updated 5 years ago
- 404Games Wastelands V2 - Chernarus☆25Jun 25, 2013Updated 12 years ago
- Citation Classification using hybrid neural network model for Wikipedia References☆31Dec 8, 2022Updated 3 years ago
- consume data from Environment and Climate Change Canada☆13Jul 20, 2020Updated 5 years ago
- Verifiable Credential Extensions☆12Feb 12, 2025Updated last year
- Run pkg.scripts subtasks in a runner-agnostic way (npm/yarn, whichever launched the main script)☆11Dec 25, 2023Updated 2 years ago
- Analytic platform for the HAL research archive (in development)☆12Oct 2, 2020Updated 5 years ago
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- [WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages☆17Apr 14, 2026Updated last month
- Conifer setup and deployment via Ansible☆12Jun 15, 2020Updated 5 years ago
- Poor man's simple harvester for arXiv resources☆14Jul 14, 2023Updated 2 years ago
- The Wikinflection Corpus, from the paper "Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus" (Metheni…☆12Dec 15, 2023Updated 2 years ago
- Tools and configurations for translating SNMP into Prometheus☆14Apr 11, 2026Updated 2 months ago
- Yet another Solar System simulator, written in Go.☆13Dec 9, 2020Updated 5 years ago
- Specifications for better computing☆10Nov 19, 2019Updated 6 years ago
- SMOR (Stuttgart Morphology) with alternative lemmatization component☆13Aug 10, 2023Updated 2 years ago
- A simple 404 page that uses the pathname as input to generate a 404 message.☆13Apr 28, 2018Updated 8 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Search and Proxy for Google web fonts☆17Sep 28, 2024Updated last year
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- Utility to compile string of chemical terms into data structure with chemical formula and composition☆13Sep 17, 2021Updated 4 years ago
- A default backend (404 page) for nginx-ingress in Kubernetes☆13Jan 23, 2018Updated 8 years ago
- Anomaly detection in time-series networks. Spatio-temporal Anomaly Detection☆12Jan 9, 2020Updated 6 years ago
- Web privacy analysis of Sweden's 290 municipalities.☆11Nov 18, 2022Updated 3 years ago
- ☆17Jul 17, 2025Updated 10 months ago