trivio/common_crawl_index

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/trivio/common_crawl_index)

trivio / common_crawl_index

Index URLs in Common Crawl

☆197

Alternatives and similar repositories for common_crawl_index

Users that are interested in common_crawl_index are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

paxan / ccooo
View on GitHub
Common Crawl One-Oh-One (aka "A Common Crawl Experiment")
☆26Oct 31, 2014Updated 11 years ago
ikreymer / cc-index-server
View on GitHub
Deployment of pywb as a CommonCrawl Index Server
☆22Oct 6, 2017Updated 8 years ago
internetarchive / webarchive-commons
View on GitHub
☆15Sep 8, 2016Updated 9 years ago
ikreymer / webarchive-indexing
View on GitHub
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Dec 4, 2017Updated 8 years ago
commoncrawl / cc-index-table
View on GitHub
Index Common Crawl archives in tabular format
☆132Updated this week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
commoncrawl / commoncrawl
View on GitHub
Common Crawl support library to access 2008-2012 crawl archives (ARC files)
☆508Nov 29, 2017Updated 8 years ago
internetarchive / ia-hadoop-tools
View on GitHub
☆23Feb 22, 2024Updated 2 years ago
Smerity / cs205_ga
View on GitHub
How deep does Google Analytics go? Efficiently tackling Common Crawl using AWS & MapReduce
☆17Feb 5, 2014Updated 12 years ago
commoncrawl / commoncrawl-crawler
View on GitHub
The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)
☆226Dec 22, 2022Updated 3 years ago
commoncrawl / commoncrawl-examples
View on GitHub
A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)
☆66Aug 5, 2016Updated 9 years ago
rossf7 / elasticrawl
View on GitHub
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Feb 15, 2017Updated 9 years ago
Smerity / cc-warc-examples
View on GitHub
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆56Apr 26, 2021Updated 5 years ago
gregr / ina
View on GitHub
experimental computational medium and supporting tools
☆24Updated this week
decultured / Python-Language-Detector
View on GitHub
Python Language Detector
☆16Jul 19, 2013Updated 13 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
petewarden / common_crawl_types
View on GitHub
A simple Ruby example of how to process Common Crawl files using Elastic MapReduce
☆29Mar 25, 2012Updated 14 years ago
commoncrawl / cc-notebooks
View on GitHub
Various Jupyter notebooks about Common Crawl data
☆66Jul 3, 2026Updated 3 weeks ago
DigitalPebble / behemoth
View on GitHub
Behemoth is an open source platform for large scale document analysis based on Apache Hadoop.
☆282Apr 25, 2018Updated 8 years ago
ag-sc / lemon.dbpedia
View on GitHub
lemon lexicon for DBpedia
☆28Oct 13, 2015Updated 10 years ago
socialsensor / storm-focused-crawler
View on GitHub
Collects multimedia content shared through social networks.
☆19Feb 18, 2015Updated 11 years ago
iipc / webarchive-commons
View on GitHub
Common web archive utility code.
☆65Jul 3, 2026Updated 3 weeks ago
CI-Research / KeywordAnalysis
View on GitHub
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆57Jan 28, 2024Updated 2 years ago
E3-JSI / newsfeed
View on GitHub
A pipeline for crawling of RSS feeds and the associated content. Demo at newsfeed.ijs.si.
☆20Nov 12, 2012Updated 13 years ago
matpalm / common-crawl-quick-hacks
View on GitHub
common crawl quick hack examples
☆19Feb 11, 2015Updated 11 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 4 months ago
commoncrawl / gzipstream
View on GitHub
gzipstream allows Python to process multi-part gzip files from a streaming source
☆23Feb 24, 2017Updated 9 years ago
wpm / Hadoop-GATE
View on GitHub
A Hadoop job that runs GATE applications
☆15Oct 16, 2013Updated 12 years ago
wiseman / energid_nlp
View on GitHub
Natural language parsers and conceptual memory
☆15Aug 2, 2012Updated 13 years ago
trec-kba / streamcorpus
View on GitHub
common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text
☆35Sep 30, 2016Updated 9 years ago
xiaoganghan / wikientities
View on GitHub
Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts
☆59Sep 5, 2012Updated 13 years ago
liris / atami
View on GitHub
gevent-based RSS/Atom feed aggregator/filter written in Python
☆17May 1, 2013Updated 13 years ago
RoyalCaliber / vertexAPI2
View on GitHub
A vertex-centric CUDA/C++ API for large graph analytics on GPUs using the Gather-Apply-Scatter abstraction
☆24May 4, 2014Updated 12 years ago
YahooArchive / Glimmer
View on GitHub
An RDF Search Engine
☆58Aug 19, 2017Updated 8 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
dhutchis / LaraDB
View on GitHub
A platform for unified linear and relational algebra analytics, built on the Accumulo NoSQL database
☆13Feb 9, 2022Updated 4 years ago
ianmilligan1 / Historian-WARC-1
View on GitHub
The Historian's WARC Toolkit
☆16May 14, 2015Updated 11 years ago
cheshire3 / cheshire3
View on GitHub
Cheshire3 Search Engine and Information Framework
☆16Oct 3, 2015Updated 10 years ago
aprolog-lang / aprolog
View on GitHub
αProlog
☆18Jul 9, 2023Updated 3 years ago
CLLKazan / UIMA-Ext
View on GitHub
The set of Apache UIMA addons & utilities.Some of them are language-independent. The others may be Russian language-specific.
☆28Oct 8, 2021Updated 4 years ago
cemoody / wizlang
View on GitHub
Amazing language representation
☆79Dec 11, 2014Updated 11 years ago
ericprud / SWObjects
View on GitHub
Semantic Web swiss army knife C++ libraries
☆15Updated this week