fcibecchini / smart-crawler
A smart distributed crawler that infers navigation models of structured websites, used to cluster pages based on their structure and extract data from them.
☆9Updated 4 years ago
Alternatives and similar repositories for smart-crawler:
Users that are interested in smart-crawler are comparing it to the libraries listed below
- Python and Scala APIs for enhanced Spark analytics☆12Updated 8 years ago
- ☆20Updated 8 years ago
- Java library for Concrete, a data serialization format for NLP☆6Updated 5 years ago
- Collects multimedia content shared through social networks.☆19Updated 10 years ago
- Extract statistics from Wikipedia Dump files.☆26Updated 3 years ago
- Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts☆59Updated 12 years ago
- Provides the implementation of a topic detection framework developed for the MULTISENSOR project.☆9Updated 9 years ago
- ☆10Updated last year
- Code and Data Samples for Big Data Warehousing.☆10Updated 9 years ago
- Simple FieldCache based query introspection Solr Search Component - solves the 'red sofa' problem☆12Updated 2 months ago
- ☆16Updated 8 years ago
- This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the re…☆12Updated 7 months ago
- Short Text Similarity as described in https://dl.acm.org/citation.cfm?id=2806475☆16Updated 6 years ago
- framework for making streamcorpus data☆11Updated 8 years ago
- Graphical techniques for text mining.☆19Updated 9 years ago
- An Apache Lucene TokenFilter that uses a word2vec vectors for term expansion.☆24Updated 11 years ago
- Neural Elastic Inference and Search☆19Updated 5 years ago
- DKPro WSD: A Java framework for word sense disambiguation☆20Updated 2 years ago
- ☆22Updated last year
- System for mining Wikipedia Usage data to read our collective mind☆21Updated 10 years ago
- D3 and Play based visualization for entity-relation graphs, especially for NLP and information extraction