google-research-datasets / common-crawl-domain-namesLinks
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
☆20Updated 5 months ago
Alternatives and similar repositories for common-crawl-domain-names
Users that are interested in common-crawl-domain-names are comparing it to the libraries listed below
Sorting:
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆189Updated 3 weeks ago
- Index Common Crawl archives in tabular format☆124Updated last week
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 8 years ago
- Statistics of Common Crawl monthly archives mined from URL index files☆203Updated last week
- Deep Dependency Representation☆16Updated 7 years ago
- Cleans Reddit Text Data☆84Updated 5 years ago
- Tools to construct and process Common Crawl webgraphs☆102Updated last week
- Source codes for the paper "Examining the Ordering of Rhetorical Strategies in Persuasive Requests"☆18Updated 4 years ago
- Scripts for building a geo-located web corpus using Common Crawl data☆11Updated last month
- ☆32Updated 2 years ago
- Adversarial Training on Transformer Networks to discover check-worthy factual claims☆83Updated 2 years ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆45Updated 5 years ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated 2 years ago
- An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.☆86Updated 4 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆205Updated 7 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- Target-dependent sentiment classification in news articles reporting on political events. Includes a high-quality data set of over 11k se…☆156Updated 4 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆142Updated last month
- Pre-trained models and code and data to train and use models from "Pushing the Limits of Paraphrastic Sentence Embeddings with Millions o…☆103Updated 2 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 6 years ago
- Process Common Crawl data with Python and Spark☆448Updated 3 weeks ago
- A python module to process data for Frame Semantic Parsing☆23Updated 5 years ago
- Unreliable News Index (for Columbia Journalism Review)☆56Updated 3 years ago
- FEVER (Fact Extraction and VERification) Annotation Platform and Baselines☆119Updated last year
- arXiv plain text extraction☆41Updated 3 years ago
- This Python module can be used to obtain antonyms, synonyms, hypernyms, hyponyms, homophones and definitions.☆125Updated last year
- Social Media Mining Toolkit (SMMT) main repository☆137Updated 3 years ago
- The WebSplit Benchmark introducing "Split and Rephrase" task☆63Updated 7 years ago
- annotated hateful speech☆24Updated 6 years ago
- CrowdTruth framework for crowdsourcing ground truth for training & evaluation of AI systems☆62Updated last year