Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
☆28Jul 31, 2024Updated last year
Alternatives and similar repositories for sandcrawler
Users that are interested in sandcrawler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Web archive index server based on RocksDB☆38Apr 1, 2026Updated last week
- Official Python package for ArchiveBox, the self-hosted internet archiving solution.☆12Oct 5, 2024Updated last year
- ☆30Jun 6, 2024Updated last year
- Homebrew formula for the ArchiveBox self-hosted internet archiving solution.☆28Oct 5, 2024Updated last year
- Trough: Big data, small databases.☆42Jul 25, 2024Updated last year
- Simple, predictable pricing with DigitalOcean hosting • AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- Parses Wikipedia citation templates in Python☆17Mar 26, 2025Updated last year
- A tool for detecting viruses and NSFW material in WARC files☆18Dec 16, 2025Updated 3 months ago
- ☆31Updated this week
- Material parsers and other tools, scripts Initially developed for Grobid Superconductor☆13Feb 21, 2025Updated last year
- A prototype server to swarm multiple DATs for Webrecorder☆14Apr 27, 2019Updated 6 years ago
- ██████╗ ███████╗██████╗ ██╔══██╗██╔════╝██╔══██╗ ██████╔╝█████╗ ██║ ██║ ██╔══██╗██╔══╝ ██║ ██║ ██║ ██║███████╗██████╔╝ ╚═╝ ╚═╝╚═══…☆11Feb 17, 2022Updated 4 years ago
- Translation of query languages to serialized KoralQuery protocol☆14Mar 30, 2026Updated last week
- Interfacing the Unpaywall Database with Python☆33Feb 19, 2024Updated 2 years ago
- 🎭 An introduction to the Internet Archiving ecosystem, tooling, and some of the ethical dilemmas that the community faces.☆57Aug 15, 2024Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- produce a stream of citiation data coming off wikimedia☆12Mar 28, 2017Updated 9 years ago
- Scripts for Internet Archive☆14Mar 26, 2025Updated last year
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆158Oct 8, 2025Updated 6 months ago
- code and data used to build a training dataset for dragnet models☆10Nov 29, 2020Updated 5 years ago
- A tool for harvesting media files from Open Access articles for upload into Wikimedia Commons☆25Jul 3, 2016Updated 9 years ago
- Repository hosting the common code for the entity-fishing clients☆10Mar 26, 2026Updated 2 weeks ago
- ☆11Oct 5, 2023Updated 2 years ago
- The EHRI project's portal interface.☆15Mar 9, 2026Updated last month
- consume data from Environment and Climate Change Canada☆13Jul 20, 2020Updated 5 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- WASAPI data transfer APIs☆50Apr 23, 2022Updated 3 years ago
- The grobidmonkey package is an open-source package designed for postprocessing GROBID outputs.☆12Mar 27, 2024Updated 2 years ago
- 🕸 GlotWeb: Web Indexing for Minority Languages (WWW 2026)☆17Feb 27, 2026Updated last month
- Conifer setup and deployment via Ansible☆12Jun 15, 2020Updated 5 years ago
- The Wikinflection Corpus, from the paper "Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus" (Metheni…☆12Dec 15, 2023Updated 2 years ago
- Tools and configurations for translating SNMP into Prometheus☆14Mar 28, 2026Updated 2 weeks ago
- Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!☆16Oct 30, 2024Updated last year
- Scraper for German democracy documents☆44Sep 12, 2023Updated 2 years ago
- A reddit bot that finds original publish dates on linked articles.☆10Nov 30, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- A tool for collecting page-level metadata of digitized book-like objects to share with the Internet Archive.☆14Mar 9, 2026Updated last month
- Specifications for better computing☆10Nov 19, 2019Updated 6 years ago
- Single file C header for UTF-x-to-y conversions + helpers☆13Jun 11, 2023Updated 2 years ago
- SMOR (Stuttgart Morphology) with alternative lemmatization component☆13Aug 10, 2023Updated 2 years ago
- Search and Proxy for Google web fonts☆16Sep 28, 2024Updated last year
- A timezone converter for online events☆10May 10, 2020Updated 5 years ago
- Natural language detection, Java bindings for CLD2☆17Feb 26, 2026Updated last month