hynky1999 / CmonCrawl
Common crawl extractor
☆75Updated 9 months ago
Alternatives and similar repositories for CmonCrawl:
Users that are interested in CmonCrawl are comparing it to the libraries listed below
- Spider ported to Python☆68Updated last month
- Index Common Crawl archives in tabular format☆113Updated this week
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆135Updated 2 months ago
- Statistics of Common Crawl monthly archives mined from URL index files☆175Updated this week
- 👩🤝🤖 A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)☆23Updated last year
- Efficient few-shot learning with cross-encoders.☆50Updated last year
- Python API for https://vespa.ai, the open big data serving engine☆116Updated this week
- A News Article Collection Library☆22Updated last year
- Vector Database with support for late interaction and token level embeddings.☆53Updated 5 months ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆168Updated 2 months ago
- Parallel wasm Barnes-Hut t-SNE implementation written in Rust.☆17Updated 9 months ago
- Completion After Prompt Probability. Make your LLM make a choice☆74Updated 4 months ago
- Various Jupyter notebooks about Common Crawl data☆51Updated 3 weeks ago
- Explore the use of DSPy for extracting features from PDFs 🔎☆38Updated last year
- A robust web archive analytics toolkit☆99Updated 3 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆124Updated 2 months ago
- Unofficial Pytorch implementation of Dom-LM paper.☆33Updated 2 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated last year
- This is the repo for the container that holds the models for the text2vec-transformers module☆49Updated last month
- Transform Unstructured Data into Synthetic Datasets☆26Updated 6 months ago
- Article extraction benchmark: dataset and evaluation scripts☆305Updated 10 months ago
- LLM prompt language based on Jinja. Banks provides tools and functions to build prompts text and chat messages from generic blueprints. I…☆81Updated last week
- This repo is for handling Question Answering, especially for Multi-hop Question Answering☆67Updated last year
- Code and data for "StructLM: Towards Building Generalist Models for Structured Knowledge Grounding" (COLM 2024)☆76Updated 4 months ago
- Tools to construct and process webgraphs from Common Crawl data☆87Updated this week
- LLM plugin for embeddings using sentence-transformers☆52Updated last month
- GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extraction☆67Updated 7 months ago
- One Line To Build Zero-Data Classifiers in Minutes☆36Updated 5 months ago
- H&M Fashion Image similarity search with Weaviate and DocArray☆42Updated 11 months ago