hynky1999 / CmonCrawl
Common crawl extractor
☆75Updated 11 months ago
Alternatives and similar repositories for CmonCrawl:
Users that are interested in CmonCrawl are comparing it to the libraries listed below
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆137Updated 3 months ago
- Efficient few-shot learning with cross-encoders.☆51Updated last year
- Python API for https://vespa.ai, the open big data serving engine☆121Updated this week
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆66Updated 5 months ago
- Demo example of consumer goods categorization☆27Updated last year
- Statistics of Common Crawl monthly archives mined from URL index files☆177Updated last week
- Unofficial Pytorch implementation of Dom-LM paper.☆33Updated 2 years ago
- Pre-train Static Word Embeddings☆56Updated 2 weeks ago
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆127Updated 4 months ago
- Article extraction benchmark: dataset and evaluation scripts☆312Updated last year
- Notebooks for training universal 0-shot classifiers on many different tasks☆124Updated 3 months ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆43Updated 4 years ago
- 🔢 Work with static vector models☆28Updated this week
- A robust web archive analytics toolkit☆103Updated 3 weeks ago
- NLP with Rust for Python 🦀🐍☆62Updated 10 months ago
- Completion After Prompt Probability. Make your LLM make a choice☆76Updated 5 months ago
- a Python client library for SerpApi.☆84Updated 9 months ago
- Universal text classifier for generative models☆24Updated 9 months ago
- 👩🤝🤖 A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)☆23Updated last year
- Various Jupyter notebooks about Common Crawl data☆52Updated 3 weeks ago
- This is the repo for the container that holds the models for the text2vec-transformers module☆51Updated 3 weeks ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆108Updated 11 months ago
- Tree-based indexes for neural-search☆31Updated last year
- Python library for Entities, relationships and schemas extraction from documents☆38Updated 4 months ago
- ☆28Updated last year
- Weekly free datasets from global news sites☆22Updated this week
- ☆20Updated last year
- This repo is for handling Question Answering, especially for Multi-hop Question Answering☆67Updated last year
- LLM finetuning☆42Updated last year
- Self-hosted version of Microsoft's OmniParser Image-to-text model☆64Updated 4 months ago