Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Feb 15, 2017Updated 9 years ago
Alternatives and similar repositories for elasticrawl
Users that are interested in elasticrawl are comparing it to the libraries listed below
Sorting:
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- A simple Ruby example of how to process Common Crawl files using Elastic MapReduce☆29Mar 25, 2012Updated 13 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated last month
- CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop☆38Dec 17, 2024Updated last year
- Python script to create CDX index files of WARC data☆16Sep 7, 2018Updated 7 years ago
- Fcrepo4 webapp plus optional fcrepo dependencies☆13Sep 30, 2020Updated 5 years ago
- ☆17Apr 19, 2025Updated 10 months ago
- [DEPRECATED] Landing page for cloud.gov. New repo: https://github.com/18F/cg-site☆12Nov 22, 2016Updated 9 years ago
- GraphPass is a utility to filter networks and provide a default visualization output for Gephi or SigmaJS.☆17Nov 14, 2020Updated 5 years ago
- Apache Nutch fork tunned for web services and data discovery.☆10May 18, 2015Updated 10 years ago
- Docker containers for running VIVO☆13Oct 26, 2016Updated 9 years ago
- Rails engine for working with storage of OpenAnnotations stored in Fedora4☆13Aug 4, 2016Updated 9 years ago
- Common web archive utility code.☆62Mar 2, 2026Updated last week
- A collection of ready-to-use messaging applications with fcrepo-camel☆12Dec 12, 2025Updated 2 months ago
- Virtual patent marking crawler at iproduct.epfl.ch☆15Sep 13, 2017Updated 8 years ago
- Warcbase is an open-source platform for managing analyzing web archives☆162Dec 8, 2017Updated 8 years ago
- Collaborative collection development for web archives☆19Sep 5, 2019Updated 6 years ago
- A Ruby client for the OpenAI API support for multiple API configurations in a single app, robust and simple error handling, and network-l…☆20Feb 22, 2026Updated 2 weeks ago
- A simple task orchestration library for running complex processes or workflows in Ruby☆28Oct 4, 2024Updated last year
- Ansible Roles and Playbooks for Princeton University Library☆19Mar 2, 2026Updated last week
- Jekyll plugin to embed static IIIF images in jekyll pages☆23Mar 29, 2022Updated 3 years ago
- OpenRefine Reconciliation Framework in Python and Flask☆22May 1, 2023Updated 2 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Jun 12, 2020Updated 5 years ago
- An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed…☆157Oct 8, 2025Updated 5 months ago
- Human Readable audit reporting for users of PaperTrail gem☆25Oct 22, 2023Updated 2 years ago
- A crawler based on Phantom. Allows discovery of dynamic content and supports custom scrapers.☆24May 31, 2017Updated 8 years ago
- Repository for the markdownlint-mdl-action Github Action☆25Dec 26, 2025Updated 2 months ago
- A LDP Implementation backed by BlazeGraph☆26Oct 31, 2017Updated 8 years ago
- Python Linked Data Fragment Server.☆30Jun 4, 2018Updated 7 years ago
- Write PubMed search results with two display options (citation or listview) to PDF or Word☆13Oct 18, 2020Updated 5 years ago
- 😱 A synchronous HTTP screenshot service for headless Chrome☆33Jul 29, 2024Updated last year
- Prototype SOLR-powered web archive exploration UI.☆43Jun 3, 2020Updated 5 years ago
- A ruby gem that can convert HTML to formatted plain text.☆42Feb 14, 2019Updated 7 years ago
- Metadata ingestion system for Digital Public Library of America☆33Feb 27, 2026Updated last week
- My Angular2 ToDo project☆10Apr 2, 2016Updated 9 years ago
- ☆16Oct 3, 2025Updated 5 months ago
- Deadly Boss Mods (DBM) - Vanilla and Season of Discovery mods (Classic and Retail)☆12Mar 1, 2026Updated last week
- Ruby client for the Gem API☆10Jan 18, 2022Updated 4 years ago
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts☆59Sep 5, 2012Updated 13 years ago