fcibecchini / smart-crawler
A smart distributed crawler that infers navigation models of structured websites, used to cluster pages based on their structure and extract data from them.
☆9Updated 4 years ago
Alternatives and similar repositories for smart-crawler:
Users that are interested in smart-crawler are comparing it to the libraries listed below
- Movielens collaborative filtering with Solr streaming expression☆11Updated 8 years ago
- Python and Scala APIs for enhanced Spark analytics☆12Updated 8 years ago
- phData Pulse application log aggregation and monitoring☆13Updated 4 years ago
- Text similarity based on Word2Vec vectors.☆11Updated 8 years ago
- Code and Data Samples for Big Data Warehousing.☆10Updated 9 years ago
- Connect DBVisualizer to Hortonwork HiveServer2☆9Updated 10 years ago
- ☆11Updated 9 years ago
- Example application demonstrating how to integrate all of the components of Hortonworks DataFlow.☆14Updated 7 years ago
- Visualization of result returning by Solr 6 graph query☆10Updated 8 years ago
- Named Entity Extraction on Twitter Stream using Apache Spark Streaming and Stanford CoreNLP☆15Updated 8 years ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆17Updated 2 years ago
- Collects multimedia content shared through social networks.☆19Updated 10 years ago
- A toolkit for clustering web pages based on various similarity measures.☆33Updated 3 years ago
- ☆16Updated 8 years ago
- KnowledgeStore☆20Updated 7 years ago
- An Apache Spark app for making data movement between Apache Hive and Apache Phoenix/HBase☆14Updated 9 years ago
- Notes from Stanford NLP class☆24Updated 11 years ago
- ☆9Updated 9 years ago
- Preliminary Solr DQ / Data Quality experiments and prototype, and SolrJ wrapper utilities☆26Updated last month
- Document Image Classification☆11Updated 6 years ago
- A bridge to Apache Atlas for provenance metadata created in course of using Apache NiFi☆15Updated 2 years ago
- Named Entity Recognition demo with the NLTK☆13Updated 13 years ago
- ☆19Updated 7 years ago
- Traptor -- A distributed Twitter feed☆26Updated 2 years ago
- Javascript library to talk to multiple OLAP backends from multiple frontends☆17Updated 12 years ago
- A subgroup discovery tool that can use ontological domain knowledge (RDF graphs) in the learning process. Subgroup descriptions contain t…☆12Updated 7 years ago
- An HTTP proxy for Elasticsearch, Solr (etc.) to prevent a 100% full disk situation.☆11Updated 6 years ago
- Simple FieldCache based query introspection Solr Search Component - solves the 'red sofa' problem☆12Updated last month
- Sample code for Splice Community☆10Updated 2 years ago
- from zero to storm cluster for realtime classification using sklearn☆12Updated 10 years ago