commoncrawl / cc-notebooks
Various Jupyter notebooks about Common Crawl data
☆51Updated last month
Alternatives and similar repositories for cc-notebooks:
Users that are interested in cc-notebooks are comparing it to the libraries listed below
- Tools to construct and process webgraphs from Common Crawl data☆87Updated last week
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Vespa application making an index of the CORD-19 dataset.☆39Updated 2 months ago
- NSS Capstone project to use natural language modeling, classification, and information extraction to get the exact employee count values …☆15Updated 6 years ago
- Index Common Crawl archives in tabular format☆113Updated last week
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆56Updated 3 months ago
- [archived]☆18Updated 3 years ago
- MirrorDataGenerator is a python tool that generates synthetic data based on user-specified causal relations among features in the data. I…☆21Updated 2 years ago
- Efficient BM25 with DuckDB 🦆☆44Updated 3 months ago
- arXiv plain text extraction☆41Updated 2 years ago
- Pytorch implementation of a BiLSTM model for the Wikification project.☆19Updated 4 years ago
- spaCy entry points for Curated Transformers☆27Updated 5 months ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- ☆33Updated last year
- TextGraphs + LLMs + graph ML for entity extraction, linking, ranking, and constructing a lemma graph☆23Updated last year
- Streamlit demo app to demonstrate the features of transformers interpret with multiple models.☆25Updated 3 years ago
- 🌸 Train floret vectors☆18Updated last year
- spaCy match and replace, maintaining conjugation☆35Updated 2 years ago
- ☆30Updated 2 years ago
- Statistics of Common Crawl monthly archives mined from URL index files☆175Updated this week
- Fastlaw's purpose is to replace generic word embeddings for work on supervised machine learning NLP-tasks with legal texts.☆37Updated 5 years ago
- Analyze trends in articles published on arXiv☆17Updated last year
- A simple web application for searching Word2Vec embeddings derived from approximately 2,000 law reports published by the The Incorporated…☆26Updated 2 years ago
- A collection of utilities for writing labeling functions, transformation functions, and slicing functions.☆20Updated 4 years ago
- ☄️ Parallel and distributed training with spaCy and Ray☆53Updated last year
- Aim-spaCy integration☆34Updated last year
- Neural Solr = Solr 9 + Mighty Inference + Node☆16Updated 2 years ago
- A python library for the Semantic Scholar (S2) API with typed pydantic objects and various nifty functionalities.☆21Updated 3 years ago
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆32Updated 9 months ago