Apache Spark

Apache Spark is an open-source, distributed computing system designed for big data processing and analytics, offering an alternative to traditional MapReduce models with improved performance and ease of use. It provides a unified analytics engine capable of handling large-scale data processing tasks efficiently by leveraging in-memory computation and a resilient distributed dataset (RDD) framework. Spark supports a variety of programming languages such as Java, Scala, Python, and R, enabling developers to build sophisticated data pipelines with ease. Additionally, it integrates seamlessly with a variety of data sources like Hadoop Distributed File System (HDFS), Apache HBase, and Apache Cassandra, among others. The ecosystem includes libraries for SQL (Spark SQL), streaming data (Spark Streaming), machine learning (MLlib), and graph processing (GraphX), making it highly versatile in addressing a wide range of data processing and analytic workloads across various industries. For application developers, Apache Spark provides the tools to scale applications efficiently and handle complex data transformations, ultimately enabling faster insights from data-intensive applications.

View the most prominent open source Apache Spark projects in the list below. Click on a specific project to view its alternative or complementary packages. Make comparisons and find the best package for your app.

Popular Apache Spark repositories:

apache / spark
Apache Spark - A unified analytics engine for large-scale data processing
☆41,661Updated this week
DataTalksClub / data-engineering-zoomcamp
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
☆32,341Updated last month
donnemartin / data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce,…
☆28,461Updated last year
getredash / redash
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
☆27,655Updated last week
yeasy / docker_practice
Learn and understand Docker&Container technologies, with real DevOps practice!
☆25,520Updated 7 months ago
heibaiying / BigData-Notes
大数据入门指南
☆16,608Updated last year
FavioVazquez / ds-cheatsheets
List of Data Science Cheatsheets to rule the world
☆15,656Updated last year
GaiZhenbiao / ChuanhuChatGPT
GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.
☆15,395Updated this week
zhisheng17 / flink-learning
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Ta…
☆14,895Updated 5 months ago
aalansehaiyang / technology-talk
【大厂面试专栏】一份Java程序员需要的技术指南，这里有面试题、系统架构、职场锦囊、主流中间件等，让你成为更牛的自己！
☆14,575Updated 3 weeks ago
horovod / horovod
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
☆14,571Updated 2 weeks ago
apache / doris
Apache Doris is an easy-to-use, high performance and unified analytics database.
☆14,110Updated this week
deeplearning4j / deeplearning4j
Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and …
☆14,072Updated last week
wangzhiwubigdata / God-Of-BigData
专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
☆10,233Updated 2 years ago
mage-ai / mage-ai
🧙 Build, run, and manage data pipelines for integrating and transforming data.
☆8,447Updated this week
delta-io / delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Tr…
☆8,215Updated this week
tobymao / sqlglot
Python SQL Parser and Transpiler
☆8,151Updated this week
h2oai / h2o-3
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random F…
☆7,260Updated this week
Alluxio / alluxio
Alluxio, data orchestration for analytics and machine learning in the cloud
☆7,058Updated 3 months ago
Angel-ML / angel
A Flexible and Powerful Parameter Server for large-scale machine learning
☆6,771Updated 2 weeks ago
apache / zeppelin
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
☆6,549Updated this week
donnemartin / dev-setup
macOS development environment setup: Easy-to-understand instructions with automated setup scripts for developer tools like Vim, Sublime …
☆6,210Updated 2 years ago
microsoft / SynapseML
Simple and Distributed Machine Learning
☆5,159Updated this week
tencentmusic / cube-studio
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台，mlops算法链路全流程，支持大数据平台对接，notebook在线开发，拖拉拽任务流pipeline编排，多机多卡分布式训练，超参搜索，推理服务VGPU虚拟化，边缘计算，标注平台自动化标注，deeps…
☆4,517Updated 2 months ago
PipelineAI / pipeline
PipelineAI
☆4,171Updated last year
JohnSnowLabs / spark-nlp
State of the Art Natural Language Processing
☆4,027Updated this week
Cyb3rWard0g / HELK
The Hunting ELK
☆3,870Updated last year
yahoo / TensorFlowOnSpark
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
☆3,867Updated 2 years ago
RoaringBitmap / RoaringBitmap
A better compressed bitset in Java: used by Apache Spark, Netflix Atlas, Apache Pinot, Tablesaw, and many others
☆3,720Updated this week
lw-lin / CoolplaySpark
酷玩 Spark: Spark 源代码解析、Spark 类库等
☆3,487Updated 3 years ago
awslabs / deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
☆3,482Updated this week
liyupi / sql-generator
🔨 用 JSON 来生成结构化的 SQL 语句，基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现，项目简单（重逻辑轻页面）、适合练手~
☆3,463Updated last year
apache / linkis
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications…
☆3,385Updated last week
WeBankFinTech / DataSphereStudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitizati…
☆3,198Updated 4 months ago
spark-notebook / spark-notebook
Interactive and Reactive Data Science using Scala and Spark.
☆3,151Updated 2 years ago
MoRan1607 / BigDataGuide
大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料
☆3,047Updated 2 months ago
kubeflow / spark-operator
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
☆3,001Updated this week
apache / paimon
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch …
☆2,939Updated this week
lakesoul-io / LakeSoul
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data…
☆2,922Updated this week