google-research-datasets / common-crawl-domain-namesLinks
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
☆18Updated last week
Alternatives and similar repositories for common-crawl-domain-names
Users that are interested in common-crawl-domain-names are comparing it to the libraries listed below
Sorting:
- ☆33Updated 2 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- scraper for facebook, gab, google and tiktok☆21Updated last week
- This repository contains code for fine-tuning GPT-2 on 76k quotes, and then make a Twitter bot out of it. Demo: @PeeingThoughts☆12Updated 2 years ago
- Neural Elastic Inference and Search☆19Updated 5 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- A tool that is built using several open source services and uses Open AI's GPT-2 as a base model.☆4Updated 2 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆177Updated 5 months ago
- classify a job description (or noisy job title) into a ONET job title☆19Updated 8 years ago
- An authorship attribution project with particular emphasis on Twitter analysis☆16Updated 3 years ago
- Privacy browser extension using machine learning to summarize privacy policies☆24Updated 9 months ago
- Matrix-based News Aggregation to Explore Media Bias☆20Updated 7 years ago
- LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship☆39Updated 5 years ago
- Common crawl extractor☆76Updated last year
- Tools & scripts to infer new Wikipedia infobox to ontology mappings☆20Updated 8 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Updated last year
- 📜Neural Text Simplification to Improve Chatbot Performance☆13Updated 6 years ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆43Updated 4 years ago
- ☆11Updated 6 years ago
- Training a model without a dataset for natural language inference (NLI)☆25Updated 4 years ago
- The News Landscape Toolkit (NELA)☆15Updated 4 years ago
- ☆22Updated this week
- Scripts for building a geo-located web corpus using Common Crawl data☆11Updated 2 months ago
- A repository of fact-checked and social media data on 2023 Israel–Hamas war☆8Updated last year
- Disambiguating biomedical and clinical concepts with word embeddings☆14Updated 7 years ago
- This repository contains code and data download instructions for the workshop paper "Improving Hierarchical Product Classification using …☆17Updated 4 years ago
- ☆22Updated 2 years ago
- Script and sample dataset of all urban dictionary entry names (around 1.4 million total)☆91Updated 3 years ago
- How Will Your Tweet Be Received? Predicting theSentiment Polarity of Tweet Replies☆11Updated 3 years ago