Index Common Crawl archives in tabular format
☆125Feb 19, 2026Updated last week
Alternatives and similar repositories for cc-index-table
Users that are interested in cc-index-table are comparing it to the libraries listed below
Sorting:
- Process Common Crawl data with Python and Spark☆452Jan 20, 2026Updated last month
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆199Jan 23, 2026Updated last month
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Dec 4, 2017Updated 8 years ago
- Common web archive utility code.☆61Feb 6, 2026Updated 3 weeks ago
- Scientific articles using or citing Common Crawl data☆28Jan 9, 2026Updated last month
- Source real estate prices from the Common Crawl.☆27Oct 22, 2018Updated 7 years ago
- Tools to construct and process Common Crawl webgraphs☆105Feb 19, 2026Updated last week
- Deployment of pywb as a CommonCrawl Index Server☆21Oct 6, 2017Updated 8 years ago
- ☆24Mar 20, 2024Updated last year
- Build wordlists from the common-crawl index☆12Oct 9, 2022Updated 3 years ago
- A UI designer for constructing AI applications with OpenSearch☆16Updated this week
- Repository for public code and data associated with the paper "Fake News on Twitter During the 2016 U.S. Presidential Election☆12Dec 5, 2019Updated 6 years ago
- This is a solution accelerator for creating personalized content recommendations based on user activity.☆13Mar 26, 2024Updated last year
- Meta-Analysis of Robust04 Papers (Yang et al., SIGIR 2019)☆12May 25, 2019Updated 6 years ago
- An academic open source and open data web crawler☆27Nov 20, 2017Updated 8 years ago
- Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration☆15Jun 4, 2024Updated last year
- 空間情報システム入門I レポジトリ☆14Jul 13, 2017Updated 8 years ago
- Scripts to load the GDELT data set into MongoDB☆14Dec 8, 2022Updated 3 years ago
- Expose Datasette instances to LLM as a tool☆26May 27, 2025Updated 9 months ago
- Utility for cui2vec in Go☆13Feb 25, 2023Updated 3 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Jun 12, 2020Updated 5 years ago
- Wikidata authority file mapping tool☆11Sep 2, 2018Updated 7 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆168Jan 27, 2026Updated last month
- This is the ultimate web scraping tool for extracting the most relevant data points from products on Walmart.com! this powerful scraper i…☆19Mar 6, 2023Updated 2 years ago
- R code needed to reproduce Relationship between Reddit Comment Score and Comment Length for 1.66 Billion Comments visualization☆17Jul 8, 2015Updated 10 years ago
- Materials to reproduce findings in our story, "Google’s Top Search Result? Surprise! It’s Google"☆34Jul 28, 2020Updated 5 years ago
- Create and edit WARC and WACZ files☆24Dec 6, 2024Updated last year
- ☆20Jan 2, 2026Updated 2 months ago
- Ranking Entity Types using the Web of Data☆30Nov 22, 2016Updated 9 years ago
- Repository of data on web domains.☆19May 24, 2023Updated 2 years ago
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆21Feb 7, 2023Updated 3 years ago
- Common crawl extractor☆84May 21, 2024Updated last year
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Aug 12, 2018Updated 7 years ago
- A whirlwind tour of Common Crawl's data using Python☆35Feb 17, 2026Updated last week
- Pocketsphinx-based Linux Voice Dictation☆25Jun 12, 2020Updated 5 years ago
- Script to transform the Disconnect block-list into Safebrowsing v2 format for Firefox Tracking Protection☆16Updated this week
- Create maintainable nomad job files☆25Mar 25, 2024Updated last year
- Tools and libraries for interacting with the Netograph API☆48Mar 7, 2023Updated 2 years ago
- Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…☆135Jun 7, 2023Updated 2 years ago