commoncrawl / cc-notebooksLinks
Various Jupyter notebooks about Common Crawl data
☆54Updated 2 months ago
Alternatives and similar repositories for cc-notebooks
Users that are interested in cc-notebooks are comparing it to the libraries listed below
Sorting:
- Tools to construct and process Common Crawl webgraphs☆91Updated 3 weeks ago
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆59Updated last week
- Index Common Crawl archives in tabular format☆122Updated last month
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- A News Article Collection Library☆23Updated 2 years ago
- Scientific articles using or citing Common Crawl data☆25Updated this week
- Graph databases, Knowledge Graphs, SPARQ☆81Updated 3 years ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆43Updated 4 years ago
- A search engine for Open Data☆53Updated 2 years ago
- Vespa application making an index of the CORD-19 dataset.☆39Updated 5 months ago
- Experimental form data extraction for journalism☆77Updated 4 years ago
- Leverage your LangChain trace data for fine tuning☆41Updated 10 months ago
- spaCy entry points for Curated Transformers☆31Updated 3 weeks ago
- TextGraphs + LLMs + graph ML for entity extraction, linking, ranking, and constructing a lemma graph☆24Updated last year
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Updated last year
- Open Access PDF harvester☆40Updated last year
- ☆79Updated 2 years ago
- Process Common Crawl data with Python and Spark☆433Updated 3 weeks ago
- 🌸 Train floret vectors☆18Updated 2 years ago
- Python based Wikidata framework for easy dataframe extraction☆44Updated last year
- Aim-spaCy integration☆34Updated last year
- Scripts to load the GDELT data set into MongoDB☆12Updated 2 years ago
- Source codes for the paper "Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints"☆28Updated 2 years ago
- Streamlit demo app to demonstrate the features of transformers interpret with multiple models.☆25Updated 4 years ago
- A collection of utilities for writing labeling functions, transformation functions, and slicing functions.☆22Updated 5 years ago
- 🤗 HuggingFace Inference Toolkit for Google Cloud Vertex AI (similar to SageMaker's Inference Toolkit, but for Vertex AI and unofficial)☆17Updated last year
- NSS Capstone project to use natural language modeling, classification, and information extraction to get the exact employee count values …☆15Updated 6 years ago
- Building a Job Dataset☆22Updated 3 years ago
- Awesome Orchest projects, both official and submitted by the community.☆25Updated last year