google-research-datasets / common-crawl-domain-names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
☆17Updated 4 years ago
Alternatives and similar repositories for common-crawl-domain-names:
Users that are interested in common-crawl-domain-names are comparing it to the libraries listed below
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆166Updated last month
- Statistics of Common Crawl monthly archives mined from URL index files☆171Updated 2 weeks ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆32Updated last year
- Training a model without a dataset for natural language inference (NLI)☆25Updated 4 years ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated last year
- Label Efficient Learning From Explanations☆23Updated 2 years ago
- classify a job description (or noisy job title) into a ONET job title☆18Updated 8 years ago
- content.rdf.u8.gz☆10Updated 4 years ago
- Code and Dataset for Memeify: A Large-scale Meme Generation System☆25Updated 4 years ago
- Index Common Crawl archives in tabular format☆110Updated 3 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆44Updated 7 years ago
- Unsupervised method for extracting quotation-speaker pairs from large news corpora.☆29Updated 6 years ago
- CrowdTruth framework for crowdsourcing ground truth for training & evaluation of AI systems☆58Updated 10 months ago
- This repository contains the Domain Discovery Tool (DDT) project. DDT is an interactive system that helps users explore and better unders…☆45Updated 3 years ago
- Extraction of the five journalistic W-questions (5W) from news articles☆19Updated 6 years ago
- TweetBERT: A Pretrained Language Representation Model for Twitter Text Analysis☆13Updated 2 years ago
- Adversarial Training on Transformer Networks to discover check-worthy factual claims☆73Updated last year
- Expletives vomiting library...☆13Updated 7 years ago
- Hybrid Approaches to Detect Comments Violating Macro Norms on Reddit☆26Updated 5 years ago
- A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contai…☆106Updated 5 years ago
- Official implementation of the paper "IteraTeR: Understanding Iterative Revision from Human-Written Text" (ACL 2022)☆78Updated last year
- Cleans Reddit Text Data☆81Updated 4 years ago
- ☆14Updated 4 years ago
- Code for NAACL 2022 paper "Reframing Human-AI Collaboration for Generating Free-Text Explanations"☆31Updated last year
- Creative writing with an AI (OpenAI's GPT-2) in a Medium-style text editor☆15Updated 5 years ago
- Data and code related to the report "Truth, Lies, and Automation: How Language Models Could Change Disinformation"☆27Updated 3 years ago
- Source codes for the paper "Examining the Ordering of Rhetorical Strategies in Persuasive Requests"☆17Updated 3 years ago
- Game code and data for Fool Me Twice: Entailment from Wikipedia Gamification https://arxiv.org/abs/2104.04725☆18Updated this week
- A few-shot learning method based on siamese networks.☆28Updated last year
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆62Updated last month