google-research-datasets / common-crawl-domain-names
Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
☆17Updated 3 years ago
Related projects: ⓘ
- scraper for facebook, gab, google and tiktok☆22Updated 2 months ago
- Unreliable News Index (for Columbia Journalism Review)☆55Updated 2 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆31Updated last year
- ☆31Updated last year
- Chrome extension to scrape a user's entire timeline, bypassing the Twitter API 3200 tweet limit☆25Updated last year
- arXiv plain text extraction☆41Updated last year
- Code and Dataset for Memeify: A Large-scale Meme Generation System☆25Updated 4 years ago
- Architecture of Twint scrapper which allow download tweets on many instances without api restrictions☆10Updated 3 years ago
- Automatically exported from code.google.com/p/wiki-links