rossf7/elasticrawl

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/rossf7/elasticrawl)

rossf7 / elasticrawl

Launch AWS Elastic MapReduce jobs that process Common Crawl data.

☆49

Alternatives and similar repositories for elasticrawl

Users that are interested in elasticrawl are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

rossf7 / wikireverse
View on GitHub
Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.
☆38Aug 12, 2018Updated 7 years ago
cloud-gov / cg-landing
View on GitHub
[DEPRECATED] Landing page for cloud.gov. New repo: https://github.com/18F/cg-site
☆12Nov 22, 2016Updated 9 years ago
commoncrawl / cc-warc-examples
View on GitHub
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Jun 30, 2026Updated last week
petewarden / common_crawl_types
View on GitHub
A simple Ruby example of how to process Common Crawl files using Elastic MapReduce
☆29Mar 25, 2012Updated 14 years ago
commoncrawl / cc-mrjob
View on GitHub
Demonstration of using Python to process the Common Crawl dataset with the mrjob framework
☆168Jan 27, 2026Updated 5 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
rajbot / CDX-Writer
View on GitHub
Python script to create CDX index files of WARC data
☆16Sep 7, 2018Updated 7 years ago
dkpro / dkpro-c4corpus
View on GitHub
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆53Jun 12, 2020Updated 6 years ago
gwu-libraries / vivo-docker
View on GitHub
Docker containers for running VIVO
☆13Oct 26, 2016Updated 9 years ago
gedankenstuecke / scihub_analysis
View on GitHub
Analyzing the April 2016 Data about the Usage of Sci-Hub
☆28May 24, 2016Updated 10 years ago
sul-dlss-deprecated / triannon
View on GitHub
Rails engine for working with storage of OpenAnnotations stored in Fedora4
☆13Aug 4, 2016Updated 9 years ago
iipc / warcaroo
View on GitHub
☆18Apr 29, 2026Updated 2 months ago
trivio / common_crawl_index
View on GitHub
Index URLs in Common Crawl
☆197Sep 19, 2017Updated 8 years ago
unt-libraries / django-premis-event-service
View on GitHub
Django app for managing PREMIS Events
☆14Apr 28, 2026Updated 2 months ago
skeskali / MPFormLetters
View on GitHub
Templates for form letters to Canadian MPs
☆20Jan 30, 2017Updated 9 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
fcrepo-exts / fcrepo-camel-toolbox
View on GitHub
A collection of ready-to-use messaging applications with fcrepo-camel
☆13Jul 1, 2026Updated last week
internetarchive / arklet
View on GitHub
ARK minter, binder, resolver
☆23May 28, 2026Updated last month
llawlor / meteorite
View on GitHub
Two-way data binding for Rails
☆13Sep 1, 2017Updated 8 years ago
eldraco / unanomaly
View on GitHub
A generic data anomaly finder. You can use a beautiful web page, drag-and-drop your csv dataset and easily find the top N anomalies in th…
☆33Oct 13, 2022Updated 3 years ago
newsreader / eso-and-ceo
View on GitHub
Events and Situations Ontology
☆14Apr 20, 2018Updated 8 years ago
ept / warc-hadoop
View on GitHub
WARC (Web Archive) Input and Output Formats for Hadoop
☆38Dec 7, 2014Updated 11 years ago
brianwc / bulk_scotus
View on GitHub
The JSON files from CourtListener.com for the Supreme Court of the United States
☆12Jul 9, 2015Updated 11 years ago
pulibrary / princeton_ansible
View on GitHub
Ansible Roles and Playbooks for Princeton University Library
☆19Updated this week
fcrepo / fcrepo-specification
View on GitHub
Fedora API Specification
☆17May 6, 2021Updated 5 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
jpbruinsslot / warc3
View on GitHub
Python 3 library for reading and writing warc files
☆21Jan 29, 2018Updated 8 years ago
nbgallery / ipydeps
View on GitHub
☆13Jul 2, 2025Updated last year
secML / secML.github.io
View on GitHub
Website for Security and Privacy of Machine Learning
☆14Dec 27, 2021Updated 4 years ago
rgrishman / ice
View on GitHub
Ice is a rapid information extraction customizer
☆15Apr 26, 2021Updated 5 years ago
kelseyhightower / hub-credential-helper
View on GitHub
☆17Apr 24, 2023Updated 3 years ago
jermnelson / linked-data-fragments
View on GitHub
Python Linked Data Fragment Server.
☆30Jun 4, 2018Updated 8 years ago
diadem / OXPath
View on GitHub
XPath extension for extraction from interactive web sites. NOTE: This code is currently out of sync. A more recent, but precompiled versi…
☆26Feb 27, 2013Updated 13 years ago
vinaygoel / archive-analysis
View on GitHub
Tools to analyze web archives
☆20Jul 12, 2016Updated 10 years ago
quandyfactory / sobidata
View on GitHub
Download your Social Bicycles (SoBi) route data and save it locally in various formats.
☆10Jun 9, 2017Updated 9 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
johnb30 / gdelt_download
View on GitHub
Set of scripts to aid in the download of the GDELT data files from gdelt.utdallas.edu
☆18May 14, 2014Updated 12 years ago
zouloux / phoenix-space-grid
View on GitHub
Phoenix script for mac Virtual Spaces
☆12Apr 17, 2025Updated last year
littleweaver / wagtail-react-deck
View on GitHub
decoupled Wagtail/React slide deck
☆12Jul 14, 2016Updated 9 years ago
apassant / doom-cover
View on GitHub
Automatically generate Doom-metal album covers
☆10May 29, 2015Updated 11 years ago
gazayas / key_change
View on GitHub
☆10May 18, 2017Updated 9 years ago
web-archive-group / ELXN42-Article
View on GitHub
☆10Apr 26, 2016Updated 10 years ago
TeamHG-Memex / scrapy-kafka-export
View on GitHub
Scrapy extension which writes crawled items to Kafka
☆31Apr 8, 2026Updated 3 months ago