hynky1999 / CmonCrawlLinks
Common crawl extractor
☆80Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆149Updated 10 months ago
- Statistics of Common Crawl monthly archives mined from URL index files☆194Updated last week
- Fast and robust date extraction from web pages, with Python or on the command-line☆141Updated 2 months ago
- Index Common Crawl archives in tabular format☆121Updated this week
- Tools to construct and process Common Crawl webgraphs☆99Updated last week
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆142Updated 6 months ago
- GPU-Powered Topic Modelling☆69Updated 2 years ago
- Spider ported to Python☆94Updated 9 months ago
- Blazing fast fuzzy text search for Python.☆47Updated 6 months ago
- A polite and user-friendly downloader for Common Crawl data☆57Updated 2 months ago
- A News Article Collection Library☆22Updated 2 years ago
- Article extraction benchmark: dataset and evaluation scripts☆335Updated last month
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆295Updated 5 months ago
- A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)☆94Updated 3 weeks ago
- Detect and redact PII locally with SOTA performance☆79Updated 7 months ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆185Updated last week
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated last year
- Completion After Prompt Probability. Make your LLM make a choice☆80Updated 11 months ago
- This repository serves as a collection of scrapers procuring and structuring various legal datasets☆17Updated 2 years ago
- Python API for https://vespa.ai, the open big data serving engine☆144Updated last week
- LLM-powered autonomous agent with hierarchical task management☆52Updated 2 years ago
- Various Jupyter notebooks about Common Crawl data☆59Updated 6 months ago
- A pythonic library providing light-weighted interface with LLMs☆129Updated 5 months ago
- NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, …☆86Updated 11 months ago
- A robust web archive analytics toolkit☆119Updated 2 weeks ago
- 📚 Datasets and models for instruction-tuning☆237Updated 2 years ago
- This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Larg…☆23Updated 7 months ago
- Pre-train Static Word Embeddings☆87Updated last month
- A dataset for pretraining language models targeted for legal tasks.☆138Updated 3 years ago
- 💭 Build autonomous agents, retrieval augmented generation (RAG) processes and language model powered chat applications☆302Updated 5 months ago