commoncrawl / cc-notebooks
Various Jupyter notebooks about Common Crawl data
☆52Updated last month
Alternatives and similar repositories for cc-notebooks:
Users that are interested in cc-notebooks are comparing it to the libraries listed below
- Tools to construct and process Common Crawl webgraphs☆90Updated last month
- Index Common Crawl archives in tabular format☆118Updated last month
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆57Updated 2 weeks ago
- A News Article Collection Library☆22Updated 2 years ago
- A personal knowledge base that I can dump information to and help me learn☆24Updated 10 months ago
- Extracting Entities with Limited Evidence☆16Updated 2 years ago
- Topic Inference with Zeroshot models☆61Updated last year
- Graph databases, Knowledge Graphs, SPARQ☆81Updated 3 years ago
- DEPRECATED--all functionality moved to nbdev☆15Updated 2 years ago
- A simple web application for searching Word2Vec embeddings derived from approximately 2,000 law reports published by the The Incorporated…☆26Updated 2 years ago
- Analyze trends in articles published on arXiv☆17Updated 2 years ago
- Streamlit demo app to demonstrate the features of transformers interpret with multiple models.☆25Updated 3 years ago
- Fastlaw's purpose is to replace generic word embeddings for work on supervised machine learning NLP-tasks with legal texts.☆38Updated 5 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- spaCy entry points for Curated Transformers☆29Updated 7 months ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 2 years ago
- ☆54Updated last year
- Efficient BM25 with DuckDB 🦆☆48Updated 4 months ago
- A collection of utilities for writing labeling functions, transformation functions, and slicing functions.☆21Updated 5 years ago
- Production-grade embedding generation, for any length of text, for transformer models.☆23Updated this week
- A python library for the Semantic Scholar (S2) API with typed pydantic objects and various nifty functionalities.☆21Updated 4 years ago
- ☆30Updated 2 years ago
- A utility for labeling clusters of text data.☆28Updated 3 years ago
- 🌸 Train floret vectors☆18Updated 2 years ago
- Transforming textual descriptions into process models using deep learning☆14Updated 5 years ago
- Building a Job Dataset☆22Updated 3 years ago
- A curated list of ML awesome frameworks & libraries for text data☆16Updated 2 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆33Updated 11 months ago
- Dataset and code for three Web crawling-related papers from SIGIR-2019, NeurIPS-2019. and ICML-2020.☆40Updated 3 months ago