hynky1999 / CmonCrawlLinks
Common crawl extractor
☆84Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆208Updated this week
- Tools to construct and process Common Crawl webgraphs☆104Updated 3 weeks ago
- A News Article Collection Library☆22Updated 2 years ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆158Updated last month
- Blazing fast fuzzy text search for Python.☆51Updated 9 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆143Updated 2 months ago
- Chrome Extension for exploring Hugging Face datasets 🔎☆48Updated last year
- Completion After Prompt Probability. Make your LLM make a choice☆82Updated last year
- Unofficial Pytorch implementation of Dom-LM paper.☆33Updated 2 years ago
- Efficient few-shot learning with cross-encoders.☆61Updated last year
- A robust web archive analytics toolkit☆127Updated 3 months ago
- Zero-trust AI APIs for easy and private consumption of open-source LLMs☆41Updated last year
- Pre-train Static Word Embeddings☆94Updated 4 months ago
- Detect and redact PII locally with SOTA performance☆87Updated 9 months ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆298Updated 8 months ago
- This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Larg…☆26Updated 10 months ago
- Article extraction benchmark: dataset and evaluation scripts☆346Updated 3 months ago
- Python API for https://vespa.ai, the open big data serving engine☆154Updated this week
- GPU-Powered Topic Modelling☆71Updated 3 years ago
- A polite and user-friendly downloader for Common Crawl data☆67Updated 5 months ago
- Query language for blending SQL and LLMs across structured + unstructured data, with type constraints.☆155Updated this week
- 📚 Datasets and models for instruction-tuning☆238Updated 2 years ago
- Spider ported to Python☆101Updated 11 months ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆196Updated this week
- Unofficial python bindings for the rust llm library. 🐍❤️🦀☆76Updated 2 years ago
- Vector Database with support for late interaction and token level embeddings.☆54Updated 7 months ago
- PyLate efficient inference engine☆69Updated 2 weeks ago
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆146Updated 9 months ago
- 🔢 Work with static vector models☆36Updated 8 months ago
- ☆185Updated 2 years ago