hynky1999 / CmonCrawl
Common crawl extractor
☆75Updated 10 months ago
Alternatives and similar repositories for CmonCrawl:
Users that are interested in CmonCrawl are comparing it to the libraries listed below
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆135Updated 2 months ago
- Spider ported to Python☆68Updated last month
- Unofficial Pytorch implementation of Dom-LM paper.☆33Updated 2 years ago
- Python API for https://vespa.ai, the open big data serving engine☆116Updated this week
- A robust web archive analytics toolkit☆100Updated 3 months ago
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated 6 months ago
- Index Common Crawl archives in tabular format☆113Updated last week
- Demo example of consumer goods categorization☆26Updated last year
- utilities for loading and running text embeddings with onnx☆44Updated 7 months ago
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆132Updated 3 months ago
- Statistics of Common Crawl monthly archives mined from URL index files☆175Updated this week
- 80x faster and 95% accurate language identification with Fasttext☆150Updated last year
- ☆11Updated 3 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆124Updated 2 months ago
- Pre-train Static Word Embeddings☆49Updated 2 weeks ago
- Efficient few-shot learning with cross-encoders.☆49Updated last year
- Python library for Entities, relationships and schemas extraction from documents☆37Updated 3 months ago
- A CLI tool for managing OpenAI batch processing jobs with ease.☆34Updated 6 months ago
- Tools to construct and process webgraphs from Common Crawl data☆87Updated last week
- ☆33Updated last year
- get structured output from LLM's☆32Updated last year
- Code for "The Whole Truth and Nothing But the Truth: Faithful and Controllable Dialogue Response Generation with Dataflow Transduction an…☆10Updated 10 months ago
- Tree-based indexes for neural-search☆29Updated last year
- A python utility for downloading Common Crawl data☆234Updated last year
- 📝 Reference-Free automatic summarization evaluation with potential hallucination detection☆100Updated last year
- Voyage AI Official Python Library☆53Updated 3 months ago
- ☆20Updated last year
- A visual tool to interpret and understand PyTorch machine learning models☆16Updated last year
- A library to extract the main content from html. Developed for information on LLM and for feeding data into LangChain and LlamaIndex.☆35Updated 10 months ago