rodricios / crawl-to-the-futureView external linksLinks
An attempt at creating a gold standard dataset for backtesting yesterday & today's content-extractors
☆35Mar 19, 2015Updated 10 years ago
Alternatives and similar repositories for crawl-to-the-future
Users that are interested in crawl-to-the-future are comparing it to the libraries listed below
Sorting:
- Fast structured perceptron sequential labeler☆15Dec 8, 2015Updated 10 years ago
- The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!☆41May 29, 2017Updated 8 years ago
- get facebook data☆10Sep 14, 2014Updated 11 years ago
- This project deals with hierarchical classification of web pages based on dmoz dataset.☆14Apr 10, 2014Updated 11 years ago
- A semantic web crawler☆20Sep 20, 2010Updated 15 years ago
- Fito is a python library that helps to organize your data so you can access it in a more understandable and easy way☆10Feb 26, 2018Updated 7 years ago
- Fast links parser for Python & Humans☆11Dec 27, 2012Updated 13 years ago
- An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.☆432Jan 16, 2026Updated last month
- Pattern-of-Behavior Search Tool☆11Jun 20, 2022Updated 3 years ago
- System for mining Wikipedia Usage data to read our collective mind☆20Sep 28, 2014Updated 11 years ago
- Manage and load dataprotocols.org Data Packages☆27Sep 17, 2015Updated 10 years ago
- Vizlinc☆15Jan 14, 2016Updated 10 years ago
- Failover AWS Spot Instances☆11Dec 8, 2017Updated 8 years ago
- Mirror of Apache usergrid JavaScript SDK☆15Apr 28, 2017Updated 8 years ago
- Linked SDMX☆17Oct 26, 2014Updated 11 years ago
- A distributed in-memory fabric based on shared-memory blocks and datashape. Any language can operate on the data.☆13Feb 12, 2016Updated 10 years ago
- iServe is what we refer to as service warehouse which unifies service publication, analysis, and discovery through the use of lightweigh…☆24Feb 18, 2016Updated 9 years ago
- A recommender system for GitHub repositories☆14Jun 21, 2014Updated 11 years ago
- Python interfaces to popular bulk WHOIS servers such as Shadowserver and Team Cymru.☆21Sep 12, 2011Updated 14 years ago
- Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum☆18Jul 1, 2022Updated 3 years ago
- A Lit web-component for viewing a Whisper JSON transcript file☆14Updated this week
- Links parts of input text to Wikipedia articles☆16Sep 9, 2012Updated 13 years ago
- Tweets annotated with coarse-grained sense labels (supersenses)☆13Jun 13, 2014Updated 11 years ago
- A library of examples showing how to use the Common Crawl corpus (2008-2012, ARC format)☆65Aug 5, 2016Updated 9 years ago
- Extract data from websites using basic statistical magic☆505Oct 2, 2020Updated 5 years ago
- Collects multimedia content shared through social networks.☆19Feb 18, 2015Updated 10 years ago
- ☆16Jun 7, 2018Updated 7 years ago
- Content-based Recommendation Generator☆13Jan 21, 2015Updated 11 years ago
- Convert Wikidata Items to vector embeddings☆33Oct 1, 2025Updated 4 months ago
- Tools for web page segmentation. In development☆17Nov 7, 2018Updated 7 years ago
- TellMeFirst is a tool for classifying and enriching textual documents via Linked Open Data.☆25Sep 1, 2022Updated 3 years ago
- Investigative tool for extracting relevant areas from many documents☆14Nov 17, 2015Updated 10 years ago
- Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignm…☆49Jun 9, 2012Updated 13 years ago
- Browser add-on and web server to support collection and analysis of web browsing data.☆14Mar 9, 2016Updated 9 years ago
- mltk - Moz Language Tool Kit☆12Mar 6, 2015Updated 10 years ago
- Design patterns for the ontology-lexicon interface using lemon and OWL☆21Jul 27, 2018Updated 7 years ago
- A Flask+Elasticsearch UI for exploring the DC Inbox dataset from http://web.stevens.edu/dcinbox/Home.html☆16Jan 21, 2022Updated 4 years ago
- An experiment in visualizing your Solr index via term counts, document counts, and memory usage per field and data type.☆15Feb 13, 2015Updated 11 years ago
- Asynchronous HTTP client built on top of Crochet and Twisted☆20Mar 3, 2021Updated 4 years ago