commoncrawl / cc-notebooks
Various Jupyter notebooks about Common Crawl data
☆50Updated 2 weeks ago
Alternatives and similar repositories for cc-notebooks:
Users that are interested in cc-notebooks are comparing it to the libraries listed below
- Tools to construct and process webgraphs from Common Crawl data☆85Updated 2 weeks ago
- Fastlaw's purpose is to replace generic word embeddings for work on supervised machine learning NLP-tasks with legal texts.☆37Updated 5 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Index Common Crawl archives in tabular format☆110Updated 2 months ago
- MirrorDataGenerator is a python tool that generates synthetic data based on user-specified causal relations among features in the data. I…☆21Updated 2 years ago
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆55Updated last month
- 🌸 Train floret vectors☆18Updated last year
- Statistics of Common Crawl monthly archives mined from URL index files☆170Updated last week
- ☆21Updated 9 months ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 2 years ago
- spaCy entry points for Curated Transformers☆26Updated 4 months ago
- Search through Facebook Research's PyTorch BigGraph Wikidata-dataset with the Weaviate vector search engine☆31Updated 3 years ago
- Building a Job Dataset☆21Updated 2 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- ☆76Updated 2 years ago
- A robust web archive analytics toolkit☆98Updated 2 months ago
- Vespa application making an index of the CORD-19 dataset.☆39Updated 3 weeks ago
- Process Common Crawl data with Python and Spark☆416Updated this week
- A News Article Collection Library☆22Updated last year
- Python API for https://vespa.ai, the open big data serving engine☆113Updated this week
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated 10 months ago
- Source codes for the paper "Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints"☆27Updated 2 years ago
- ☆22Updated 9 months ago
- LLM plugin for clustering embeddings☆68Updated 11 months ago
- ☆89Updated 2 years ago
- ☆33Updated last year
- TextGraphs + LLMs + graph ML for entity extraction, linking, ranking, and constructing a lemma graph☆23Updated 11 months ago
- Scientific articles using or citing Common Crawl data☆13Updated 2 weeks ago
- ☆63Updated last month
- This repository contains the code and data download links to reproduce building the WDC Products Benchmark.☆12Updated last year