hynky1999 / CmonCrawl
Common crawl extractor
☆75Updated 11 months ago
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆137Updated 4 months ago
- Efficient few-shot learning with cross-encoders.☆51Updated last year
- Spider ported to Python☆82Updated 3 months ago
- Pre-train Static Word Embeddings☆60Updated last month
- 👩🤝🤖 A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)☆23Updated 2 years ago
- Detect and redact PII locally with SOTA performance☆47Updated last month
- Blazing fast fuzzy text search for Python.☆43Updated 3 weeks ago
- Writing Blog Posts with Generative Feedback Loops!☆47Updated last year
- GPU prices aggregator for cloud providers☆37Updated 2 weeks ago
- Various Jupyter notebooks about Common Crawl data☆53Updated last month
- Supervised instruction finetuning for LLM with HF trainer and Deepspeed☆35Updated last year
- Using open source LLMs to build synthetic datasets for direct preference optimization☆61Updated last year
- Fast and robust date extraction from web pages, with Python or on the command-line☆126Updated 4 months ago
- A CLI tool for managing OpenAI batch processing jobs with ease.☆35Updated 2 weeks ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 6 months ago
- Statistics of Common Crawl monthly archives mined from URL index files☆178Updated last week
- NLP with Rust for Python 🦀🐍☆62Updated this week
- A robust web archive analytics toolkit☆107Updated last month
- ☆20Updated last year
- Chrome Extension for exploring Hugging Face datasets 🔎☆50Updated 7 months ago
- Tools to construct and process Common Crawl webgraphs☆90Updated last week
- Zero-trust AI APIs for easy and private consumption of open-source LLMs☆40Updated 9 months ago
- Efficient BM25 with DuckDB 🦆☆48Updated 4 months ago
- Library for fast text representation and classification.☆28Updated last year
- One Line To Build Zero-Data Classifiers in Minutes☆54Updated 7 months ago
- Explore the use of DSPy for extracting features from PDFs 🔎☆39Updated last year
- Python API for https://vespa.ai, the open big data serving engine☆123Updated last week
- utilities for loading and running text embeddings with onnx☆44Updated 9 months ago
- get structured output from LLM's☆33Updated 2 years ago
- Demo example of consumer goods categorization☆28Updated last year