google-research-datasets / common-crawl-domain-names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
☆17Updated 4 years ago
Alternatives and similar repositories for common-crawl-domain-names:
Users that are interested in common-crawl-domain-names are comparing it to the libraries listed below
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆169Updated 2 months ago
- scraper for facebook, gab, google and tiktok☆22Updated 8 months ago
- ☆12Updated 5 months ago
- The ScriptBase Corpus☆43Updated 6 years ago
- Cleans Reddit Text Data☆81Updated 4 years ago
- This is a document concerning Data Readiness in the context of machine learning and Natural Language Processing.☆11Updated 3 years ago
- LinkRun - Data Engineering project done in 3 weeks during the Insight fellowship☆38Updated 4 years ago
- Architecture of Twint scrapper which allow download tweets on many instances without api restrictions☆10Updated 4 years ago
- A list of over 5000 US news domains and their social media accounts☆45Updated 2 years ago
- DomainsProject.org HTTP worker☆22Updated 2 years ago
- Index Common Crawl archives in tabular format☆113Updated last week
- Language-Agnostic Website Embedding and Classification☆41Updated last year
- Data and code related to the report "Truth, Lies, and Automation: How Language Models Could Change Disinformation"☆27Updated 3 years ago
- Deep Dependency Representation☆16Updated 6 years ago
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- Unreliable News Index (for Columbia Journalism Review)☆56Updated 3 years ago
- Matrix-based News Aggregation to Explore Media Bias☆20Updated 6 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆44Updated 7 years ago
- TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis☆13Updated 2 years ago
- Algorithms for training state-of-the-art neural topic models☆33Updated this week
- FEVER (Fact Extraction and VERification) Annotation Platform and Baselines☆107Updated 10 months ago
- arXiv plain text extraction☆41Updated 2 years ago
- Automatically exported from code.google.com/p/wiki-links☆42Updated 9 years ago
- Dataset and model for disentangling chat on IRC☆54Updated 10 months ago
- Supplementary materials for DeepCPCFG☆23Updated 3 years ago
- ☆14Updated 4 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Source codes for the paper "Examining the Ordering of Rhetorical Strategies in Persuasive Requests"☆17Updated 3 years ago
- Email Datasets can be found here☆63Updated 5 years ago