commoncrawl / cc-notebooks
Various Jupyter notebooks about Common Crawl data
☆49Updated this week
Alternatives and similar repositories for cc-notebooks:
Users that are interested in cc-notebooks are comparing it to the libraries listed below
- Tools to construct and process webgraphs from Common Crawl data☆84Updated last month
- Index Common Crawl archives in tabular format☆110Updated 2 months ago
- Building a Job Dataset☆21Updated 2 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- Process Common Crawl data with Python and Spark☆412Updated last month
- TextGraphs + LLMs + graph ML for entity extraction, linking, ranking, and constructing a lemma graph☆23Updated 11 months ago
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆55Updated last month
- Statistics of Common Crawl monthly archives mined from URL index files☆167Updated 3 weeks ago
- Fastlaw's purpose is to replace generic word embeddings for work on supervised machine learning NLP-tasks with legal texts.☆37Updated 5 years ago
- 🌸 Train floret vectors☆18Updated last year
- Vespa application making an index of the CORD-19 dataset.☆39Updated last week
- MirrorDataGenerator is a python tool that generates synthetic data based on user-specified causal relations among features in the data. I…☆21Updated 2 years ago
- A simple converter from SpaCy Entities (Spans) to Huggingface BILOU formatted data (tokens and ner_tags)☆14Updated 4 months ago
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆9Updated 3 years ago
- Graph databases, Knowledge Graphs, SPARQ☆75Updated 3 years ago
- A News Article Collection Library☆22Updated last year
- ☆22Updated 9 months ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆32Updated last year
- Virtual patent marking crawler at iproduct.epfl.ch☆14Updated 7 years ago
- Sentence Embedding as a Service☆14Updated last year
- A simple web application for searching Word2Vec embeddings derived from approximately 2,000 law reports published by the The Incorporated…☆26Updated 2 years ago
- Aim-spaCy integration☆34Updated last year
- ☄️ Parallel and distributed training with spaCy and Ray☆53Updated last year
- 🧬 A VS Code extension for annotating data with Prodigy☆30Updated 3 years ago
- spaCy entry points for Curated Transformers☆26Updated 4 months ago
- Generate a SQLite database from Wikipedia & Wikidata dumps.☆30Updated 10 months ago
- A simple library for training named entity recognition model from partially annotated data☆22Updated last year
- ☆76Updated last year
- Information extraction from English and German texts based on predicate logic☆135Updated last year