commoncrawl / cc-downloaderView external linksLinks
A polite and user-friendly downloader for Common Crawl data
☆67Aug 17, 2025Updated 5 months ago
Alternatives and similar repositories for cc-downloader
Users that are interested in cc-downloader are comparing it to the libraries listed below
Sorting:
- Mini Model Daemon☆12Nov 9, 2024Updated last year
- Rails application for the Archives Unleashed Cloud.☆11Jun 30, 2021Updated 4 years ago
- Scientific articles using or citing Common Crawl data☆28Jan 9, 2026Updated last month
- utility to fetch provenance information from Internet Archive's Wayback Machine☆14Feb 5, 2026Updated last week
- ☆14Feb 28, 2017Updated 8 years ago
- Simple, fast dictionary-based language detector for short texts.☆20Feb 5, 2026Updated last week
- A utility to recursively search for files by name in a filesystem, also looking inside archives to an arbitary depth.☆13Dec 8, 2023Updated 2 years ago
- GraphPass is a utility to filter networks and provide a default visualization output for Gephi or SigmaJS.☆17Nov 14, 2020Updated 5 years ago
- Napkin is a simple tool to produce statistical analysis of a text☆12Feb 25, 2024Updated last year
- Internet Archive's Sparkling Data Processing Library☆15Feb 6, 2026Updated last week
- ☆17Mar 31, 2025Updated 10 months ago
- Simplified version of a common crawl fetcher☆17Dec 24, 2025Updated last month
- ReproZip for the Preservation of Web Applications☆17May 6, 2024Updated last year
- Submodule of evalverse forked from [google-research/instruction_following_eval](https://github.com/google-research/google-research/tree/m…☆14May 4, 2024Updated last year
- A UserScript to detect GPT generated comments on Hackernews.☆13Dec 10, 2022Updated 3 years ago
- A client for the Archive-It And Webrecorder WASAPI Data Transfer API☆16Oct 18, 2019Updated 6 years ago
- LLM FX: A LLM Server Desktop Client free for everyone!☆33Dec 19, 2025Updated last month
- Web application for distributed compute analysis of Archive-It web archive collections.☆20Oct 9, 2025Updated 4 months ago
- A tool for collection archival slivers of the web and web archives☆17Feb 18, 2025Updated 11 months ago
- Crawler that retrieves commoncrawl's crawled hosts and their corresponding IPs☆21Sep 1, 2025Updated 5 months ago
- Erku is an IPTV and video on demand client for the Roku OS.☆12Dec 29, 2024Updated last year
- Tools for helping you work with web platform archive downloads.☆18Mar 27, 2020Updated 5 years ago
- A robust web archive analytics toolkit☆130Oct 15, 2025Updated 3 months ago
- Base45☆22Apr 30, 2024Updated last year
- CocktailParty is a data broker system based on phoenix framework☆23Apr 23, 2025Updated 9 months ago
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆26Nov 25, 2024Updated last year
- The classic BBC Micro game Elite… in teletext☆22Jan 15, 2026Updated 3 weeks ago
- golang readers for ARC and WARC webarchive formats☆20Apr 3, 2023Updated 2 years ago
- Converts WARC files to static HTML☆51Sep 18, 2025Updated 4 months ago
- ☆24Mar 12, 2025Updated 11 months ago
- Capture a URL with Playwright☆30Updated this week
- Service for creating Twitter datasets for research and archiving.☆26Dec 7, 2022Updated 3 years ago
- A Simple Network Stream Recorder☆35Mar 23, 2019Updated 6 years ago
- A script to change authorship to ODT and DOCX comments, redlines and whatnot.☆34Feb 26, 2025Updated 11 months ago
- Some random tools for working with the GGUF file format☆30Nov 24, 2023Updated 2 years ago
- Continual pretraining of foundation LLM using ⚡ Lightning Fabric☆37Nov 27, 2024Updated last year
- ☆34Mar 22, 2025Updated 10 months ago
- Maltego Transform to put entities into MISP events☆28Jul 24, 2021Updated 4 years ago
- Statistics of Common Crawl monthly archives mined from URL index files☆208Feb 3, 2026Updated last week