commoncrawl / cc-downloaderLinks
A polite and user-friendly downloader for Common Crawl data
☆57Updated last month
Alternatives and similar repositories for cc-downloader
Users that are interested in cc-downloader are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆192Updated 2 weeks ago
- Tools to construct and process Common Crawl webgraphs☆96Updated 3 weeks ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆56Updated 2 weeks ago
- Common crawl extractor☆79Updated last year
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆183Updated 8 months ago
- Code for collecting, processing, and preparing datasets for the Common Pile☆227Updated last week
- A sentence segmentation library with wide language support optimized for speed and utility.☆67Updated 2 months ago
- search interface for scholarly works☆86Updated last year
- Simplified version of a common crawl fetcher☆16Updated this week
- Faster, modernized fork of the language identification tool langid.py☆56Updated 9 months ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated last month
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆12Updated 2 years ago
- Extracts plain text, language identification and more metadata from WARC records☆23Updated last week
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- A set of utilities for processing MediaWiki XML dump data.☆57Updated 7 months ago
- Libraries, Archives and Museums (LAM)☆85Updated 2 years ago
- State-of-the-art web crawler 🔱☆326Updated this week
- Efficiently computing & storing token n-grams from large corpora☆26Updated 11 months ago
- LLM plugin for embeddings using sentence-transformers☆71Updated 4 months ago
- an experimental implementation of Burrow's delta in Python 3☆21Updated 3 years ago
- ☆67Updated last year
- Streaming WARC/ARC library for fast web archive IO☆430Updated 9 months ago
- Tracking instruction-tuned LLM openness. Paper: Liesenfeld, Andreas, Alianda Lopez, and Mark Dingemanse. 2023. “Opening up ChatGPT: Track…☆119Updated 6 months ago
- Datasette enrichment for analyzing row data using OpenAI's GPT models☆20Updated last year
- 🔢 Work with static vector models☆29Updated 4 months ago
- A tool for detecting viruses and NSFW material in WARC files☆16Updated last year
- Fast and robust date extraction from web pages, with Python or on the command-line☆140Updated last month
- image-to-text model for PDF.js☆46Updated 6 months ago
- WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.☆258Updated 7 months ago