hynky1999 / CmonCrawlLinks
Common crawl extractor
☆80Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆193Updated last week
- Tools to construct and process Common Crawl webgraphs☆98Updated last week
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆148Updated 9 months ago
- Spider ported to Python☆94Updated 8 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆141Updated 2 months ago
- A polite and user-friendly downloader for Common Crawl data☆56Updated last month
- This repository serves as a collection of scrapers procuring and structuring various legal datasets☆18Updated 2 years ago
- Chrome Extension for exploring Hugging Face datasets 🔎☆48Updated last year
- Index Common Crawl archives in tabular format☆122Updated 2 months ago
- Blazing fast fuzzy text search for Python.☆47Updated 5 months ago
- ☆20Updated last year
- 💭 Build autonomous agents, retrieval augmented generation (RAG) processes and language model powered chat applications☆300Updated 4 months ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆294Updated 4 months ago
- Various Jupyter notebooks about Common Crawl data☆58Updated 6 months ago
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆142Updated 6 months ago
- A pythonic library providing light-weighted interface with LLMs☆129Updated 4 months ago
- Detect and redact PII locally with SOTA performance☆76Updated 6 months ago
- A News Article Collection Library☆22Updated 2 years ago
- Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI☆113Updated 2 years ago
- Article extraction benchmark: dataset and evaluation scripts☆331Updated 2 weeks ago
- Python API for https://vespa.ai, the open big data serving engine☆143Updated last week
- Completion After Prompt Probability. Make your LLM make a choice☆80Updated 11 months ago
- Python client for txtai☆14Updated last month
- Entity resolution, also known as Data Matching or Record linkage is the task of finding a data set that refer to the same or similar real…☆29Updated 6 months ago
- This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Larg…☆24Updated 7 months ago
- 🖍️ Highlight text in documents☆109Updated 5 months ago
- GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extraction☆80Updated last year
- GPU-Powered Topic Modelling☆70Updated 2 years ago
- A dataset for pretraining language models targeted for legal tasks.☆138Updated 3 years ago
- Query language for blending SQL and LLMs across structured + unstructured data, with type constraints.☆114Updated last week