gfjreg / CommonCrawl
A distributed system for mining common crawl using SQS, AWS-EC2 and S3
☆14Updated 10 years ago
Related projects ⓘ
Alternatives and complementary repositories for CommonCrawl
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Updated 9 months ago
- Traptor -- A distributed Twitter feed☆26Updated 2 years ago
- Elwha is a Java application for monitoring topics, sentiment and events on Twitter streams with the ability to generate notification mess…☆14Updated 9 years ago
- Example how to pre-process news articles with textbox and index on Elastic Search☆13Updated 7 years ago
- Scraper built with Scrapy.☆14Updated 3 months ago
- Virtual patent marking crawler at iproduct.epfl.ch☆14Updated 7 years ago
- Automated NLP sentiment predictions- batteries included, or use your own data☆18Updated 6 years ago
- Data cleaning made easy☆8Updated 6 years ago
- FeedCrunch.IO - Take RSS Feeds to the next level with personnalized recommendations☆15Updated 2 years ago
- An open source search engine written in C/C++ for Linux on Intel/AMD. From gigablast dot com. See the README.md file below for instructio…☆23Updated 6 years ago
- Site Hound (previously THH) is a Domain Discovery Tool☆23Updated 3 years ago
- Source real estate prices from the Common Crawl.☆27Updated 6 years ago
- This is a set of ontologies used by different parts of the Open Semantic Framework. These ontologies should normally be loaded in OSF usi…☆14Updated 10 years ago
- Machine learning model to recommend related content☆19Updated last year
- Watchman: An open-source social-media event-detection system☆20Updated 6 years ago
- Minimum Entropy is a DDL hosted question/answer site for beginners who need answers to Data Science questions.☆16Updated 8 years ago
- Google Refine extension for adding columns (extending data) from DBpedia☆39Updated 11 years ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 2 years ago
- Whit is an open source SMS service, which allows you to query CrunchBase, Wikipedia, and several other data APIs.☆198Updated 11 years ago
- General Architecture for Text Engineering☆45Updated 8 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- Neural Elastic Inference and Search☆19Updated 5 years ago
- Slides to learn a little natural language processing (NLP) with Python. Written in reST with S5/Docutils.☆28Updated 12 years ago
- A pipeline for crawling of RSS feeds and the associated content. Demo at newsfeed.ijs.si.☆21Updated 12 years ago
- Extensions for using Scrapy on Amazon AWS☆32Updated 11 years ago
- Python video summarization. Visit the public API at -- www.shorten.tv (EDIT: The domain expired and youtube blocked it ..)☆81Updated 2 years ago
- RxNLP APIs for clustering sentences, extracting topics, counting words & n-grams, extracting text from html or URL, computing similarity …☆15Updated 4 years ago
- Paginating the web☆37Updated 10 years ago
- common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text☆34Updated 8 years ago