google-research-datasets / common-crawl-domain-names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
☆17Updated 4 years ago
Related projects ⓘ
Alternatives and complementary repositories for common-crawl-domain-names
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆32Updated last year
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆159Updated last month
- Matrix-based News Aggregation to Explore Media Bias☆20Updated 6 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆42Updated 6 years ago
- ☆22Updated last year
- A tool that is built using several open source services and uses Open AI's GPT-2 as a base model.☆4Updated last year
- Bilingual sentence similarity classifier using Tensorflow☆19Updated 5 years ago
- scraper for facebook, gab, google and tiktok☆22Updated 4 months ago
- Architecture of Twint scrapper which allow download tweets on many instances without api restrictions☆10Updated 3 years ago
- Interpretable feature construction from taxonomies for text classification☆18Updated 2 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- A curated list of Natural Language Generation papers, tutorials, and blogs.☆12Updated 5 years ago
- Detects if a sentence is in a subjective or objective form☆24Updated last year
- A list of over 5000 US news domains and their social media accounts☆41Updated last year
- ☆14Updated 4 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆56Updated 9 months ago
- Statistics of Common Crawl monthly archives mined from URL index files☆157Updated this week
- Index Common Crawl archives in tabular format☆106Updated this week
- A classifier that distinguishes political from non-political news articles.☆29Updated last year
- Multilingual Language Modeling Toolkit☆11Updated 7 years ago
- The News Landscape Toolkit (NELA)☆15Updated 4 years ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆38Updated 4 years ago
- Social Media Machine Translation Toolkit☆20Updated 11 years ago
- This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity.☆35Updated 5 years ago
- Legal document classification with EuroVoc descriptors on 22 languages.☆25Updated last year
- A Flask application for analyzing activity on an online discussion forum, using scraping, indexing, analytics, relational graph and NLP.☆11Updated 3 years ago
- Common Crawl Index Server☆65Updated 10 months ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- Aiohttp web server API, which scrapes Google and returns scrape results as response. Supports proxies, multiple geos and number of result…☆53Updated 9 months ago
- NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.☆18Updated 3 years ago