USCDataScience/sparkler

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/USCDataScience/sparkler)

USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.

☆421

Alternatives and similar repositories for sparkler

Users that are interested in sparkler are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

apache / stormcrawler
View on GitHub
A scalable, mature and versatile web crawler based on Apache Storm
☆986Updated this week
USCDataScience / autoextractor
View on GitHub
A toolkit for clustering web pages based on various similarity measures.
☆34Oct 27, 2021Updated 4 years ago
chrismattmann / lucene-geo-gazetteer
View on GitHub
Uses Apache Lucene, OpenNLP and geonames and extracts locations from text and geocodes them.
☆38Jun 5, 2026Updated last month
chrismattmann / trec-dd-polar
View on GitHub
A dataset downloaded from the deep and scientific web across three major Polar data centers for use in research.
☆13Sep 8, 2017Updated 8 years ago
USCDataScience / polar.usc.edu
View on GitHub
Polar USC activities related to NSF Polar CyberInfrastructure program at the University of Southern California
☆15Jan 15, 2023Updated 3 years ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
chrismattmann / imagecat
View on GitHub
ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (image…
☆96Aug 26, 2018Updated 7 years ago
apache / drat
View on GitHub
A distributed, parallelized (Map Reduce) wrapper around Apache RAT™ to allow it to complete on large code repositories of multiple file t…
☆31Feb 4, 2020Updated 6 years ago
ContinuumIO / nutchpy
View on GitHub
For interacting with nutch via Python
☆29Jul 5, 2026Updated 2 weeks ago
USCDataScience / AgePredictor
View on GitHub
Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum
☆18Jul 1, 2022Updated 4 years ago
khundman / marve
View on GitHub
For extracting measurements and related entities from text
☆58May 6, 2020Updated 6 years ago
USCDataScience / dl4j-kerasimport-examples
View on GitHub
This repository contains deeplearning4j examples for importing and making use of models trained in keras
☆27May 7, 2017Updated 9 years ago
lewismc / bash-httpd
View on GitHub
bash-httpd is a web server written in bash, the GNU bourne shell replacement.
☆29Jul 20, 2024Updated 2 years ago
chrismattmann / nutch-python
View on GitHub
Nutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit
☆39Apr 15, 2016Updated 10 years ago
b-cube / nutch-crawler
View on GitHub
Apache Nutch fork tunned for web services and data discovery.
☆10May 18, 2015Updated 11 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
thammegowda / tika-ner-corenlp
View on GitHub
Stanford CoreNLP NER addon for Apache Tika's NamerEntityParser
☆13Feb 26, 2022Updated 4 years ago
lucidworks / spark-solr
View on GitHub
Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
☆445Sep 4, 2025Updated 10 months ago
ParallelProcessingLab / fogfaas
View on GitHub
FoGFaaS: Add serverless computing (faas) to ifogsim
☆22Mar 30, 2025Updated last year
chrismattmann / tika-similarity
View on GitHub
Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
☆108Jun 2, 2026Updated last month
nasa-jpl-memex / weapons
View on GitHub
MEMEX Weapons Pilot for the illegal weapons domain.
☆15May 20, 2016Updated 10 years ago
sematext / query-segmenter
View on GitHub
Solr Query Segmenter for structuring unstructured queries
☆22May 12, 2021Updated 5 years ago
USCDataScience / NLTKRest
View on GitHub
This is a REST Server endpoint built using Flask and Python.
☆24Nov 16, 2022Updated 3 years ago
lucidworks / data-quality
View on GitHub
Preliminary Solr DQ / Data Quality experiments and prototype, and SolrJ wrapper utilities
☆26Jan 27, 2025Updated last year
databricks / spark-corenlp
View on GitHub
Stanford CoreNLP wrapper for Apache Spark
☆419Nov 15, 2018Updated 7 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
scrapinghub / frontera
View on GitHub
A scalable frontier for web crawlers
☆1,332Jun 6, 2025Updated last year
nasa-jpl-memex / memex-explorer
View on GitHub
Viewers for statistics and dashboarding of Domain Search Engine data
☆128Jan 19, 2016Updated 10 years ago
mitll / vizlinc
View on GitHub
Vizlinc
☆15Jan 14, 2016Updated 10 years ago
lucidworks / hadoop-solr
View on GitHub
Code to index HDFS to Solr using MapReduce
☆51Nov 27, 2018Updated 7 years ago
lucidworks / query-autofiltering-component
View on GitHub
A Query Autofiltering SearchComponent for Solr that can translate free-text queries into structured queries using index metadata
☆25Oct 16, 2018Updated 7 years ago
tokenmill / crawling-framework
View on GitHub
Easily crawl news portals or blog sites using Storm Crawler.
☆22Nov 15, 2022Updated 3 years ago
tribbloid / spookystuff
View on GitHub
Scalable query engine for web scrapping/data mashup/acceptance QA, powered by Apache Spark
☆140Jan 5, 2026Updated 6 months ago
SciSpark / SciSpark
View on GitHub
Scientific Spark - a NASA AIST14 project
☆88Mar 31, 2018Updated 8 years ago
jayunit100 / SparkStreamingApps
View on GitHub
A spark sbt blueprint to build your own spark apps off of (for cloud native runtime, see the kube/spark examples)
☆57Jun 1, 2019Updated 7 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
mitre / rhapsode
View on GitHub
Advanced desktop search/corpus exploration prototype
☆21Jun 23, 2021Updated 5 years ago
zouzias / spark-lucenerdd-examples
View on GitHub
Examples of spark-lucenerdd
☆15Oct 6, 2023Updated 2 years ago
uma-pi1 / OPIEC-pipeline
View on GitHub
☆14Feb 26, 2022Updated 4 years ago
apache / sdap-nexus
View on GitHub
Mirror of Apache sdap (Incubating)
☆25Dec 21, 2025Updated 7 months ago
nasa-jpl-memex / topic_space
View on GitHub
Topic modeling web application
☆40Jul 23, 2015Updated 10 years ago
sematext / solr-researcher
View on GitHub
Solr SearchComponent for altering and re-executing queries that product poor results
☆13May 12, 2021Updated 5 years ago
ericwhyne / open-catalog-generator
View on GitHub
Code and templates required to build the DARPA open catalog.
☆18Mar 23, 2016Updated 10 years ago