google-research-datasets / common-crawl-domain-namesLinks

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").

☆18

Alternatives and similar repositories for common-crawl-domain-names

Users that are interested in common-crawl-domain-names are comparing it to the libraries listed below

Sorting:

leogao2 / commoncrawl_downloader
☆33Updated 2 years ago
ikreymer / webarchive-indexing
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Updated 7 years ago
niczem / trawler
scraper for facebook, gab, google and tiktok
☆21Updated last week
jingw222 / twitterbot-gpt2
This repository contains code for fine-tuning GPT-2 on 76k quotes, and then make a Twitter bot out of it. Demo: @PeeingThoughts
☆12Updated 2 years ago
koursaros-ai / microservices
Neural Elastic Inference and Search
☆19Updated 5 years ago
johnbumgarner / newshound
This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…
☆33Updated 2 years ago
soumyadip1995 / TextBrain
A tool that is built using several open source services and uses Open AI's GPT-2 as a base model.
☆4Updated 2 years ago
cocrawler / cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
☆177Updated 5 months ago
afshinrahimi / jobdescription2jobtitle
classify a job description (or noisy job title) into a ONET job title
☆19Updated 8 years ago
analyticascent / stylext
An authorship attribution project with particular emphasis on Twitter analysis
☆16Updated 3 years ago
privacy-tech-lab / privee
Privacy browser extension using machine learning to summarize privacy policies
☆24Updated 9 months ago
fhamborg / NewsBirdServer
Matrix-based News Aggregation to Explore Media Bias
☆20Updated 7 years ago
trendsci / linkrun
LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship
☆39Updated 5 years ago
hynky1999 / CmonCrawl
Common crawl extractor
☆76Updated last year
dbpedia / mappings-autogeneration
Tools & scripts to infer new Wikipedia infobox to ontology mappings
☆20Updated 8 years ago
ikreymer / cc-index-server
Deployment of pywb as a CommonCrawl Index Server
☆21Updated 7 years ago
CI-Research / KeywordAnalysis
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆57Updated last year
vincent9514 / Text-Rewriting-Simplification
📜Neural Text Simplification to Improve Chatbot Performance
☆13Updated 6 years ago
sdtblck / youtube_subtitle_dataset
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆43Updated 4 years ago
inferlink / landmark-extractor
☆11Updated 6 years ago
krandiash / gpt3-nli
Training a model without a dataset for natural language inference (NLI)
☆25Updated 4 years ago
BenjaminDHorne / The-NELA-Toolkit
The News Landscape Toolkit (NELA)
☆15Updated 4 years ago
mlcommons / dynabench
☆22Updated this week
jonathandunn / common_crawl_corpus
Scripts for building a geo-located web corpus using Common Crawl data
☆11Updated 2 months ago
Gautamshahi / WarClaim
A repository of fact-checked and social media data on 2023 Israel–Hamas war
☆8Updated last year
clips / yarn
Disambiguating biomedical and clinical concepts with word embeddings
☆14Updated 7 years ago
wbsg-uni-mannheim / productCategorization
This repository contains code and data download instructions for the workshop paper "Improving Hierarchical Product Classification using …
☆17Updated 4 years ago
agermanidis / OpenGPT-2
☆22Updated 2 years ago
mattbierner / urban-dictionary-word-list
Script and sample dataset of all urban dictionary entry names (around 1.4 million total)
☆91Updated 3 years ago
tayebiarasteh / retweet
How Will Your Tweet Be Received? Predicting theSentiment Polarity of Tweet Replies
☆11Updated 3 years ago