NYUBigDataProject/SparkClean

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NYUBigDataProject/SparkClean)

NYUBigDataProject / SparkClean

A Scalable Data Cleaning Library for PySpark.

☆29

Alternatives and similar repositories for SparkClean

Users that are interested in SparkClean are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

SvenskaSpel / cobra-policytool
View on GitHub
Manage Apache Atlas and Ranger configuration for your Hadoop environment.
☆16May 4, 2021Updated 5 years ago
xiaorancs / feature-select
View on GitHub
featselector是一个基于统计分析和模型选择的特征选择器.
☆14Mar 4, 2019Updated 7 years ago
mozilla / python_mozetl
View on GitHub
ETL jobs for Firefox Telemetry
☆29May 7, 2026Updated 2 months ago
cevoaustralia / glue-vscode
View on GitHub
Local Development of AWS Glue with Docker and Visual Studio Code
☆14Nov 29, 2021Updated 4 years ago
DIYBigData / spark-data-analysis-projects
View on GitHub
A collection of data analysis projects done using PySpark via Jupyter notebooks.
☆10Oct 8, 2022Updated 3 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
DC-777 / ML-construction-cost-prediction
View on GitHub
In this work, we compared the predictive capabilities of six different machine learning algorithms - linear regression, random forest, ex…
☆17Sep 21, 2020Updated 5 years ago
itversity / retail_db_json
View on GitHub
☆14Sep 14, 2021Updated 4 years ago
belaalb / CEVAE-VampPrior
View on GitHub
CEVAE with VampPrior
☆11Jul 18, 2018Updated 8 years ago
scravy / pysparkextra
View on GitHub
☆10Jun 29, 2021Updated 5 years ago
pplonski / gafe
View on GitHub
Genetic Algorithm Feature Engineering
☆15Oct 3, 2017Updated 8 years ago
FavioVazquez / ODSC_India_2018
View on GitHub
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
☆26Sep 1, 2018Updated 7 years ago
RishiSankineni / Machine-Learning-Pipeline-LR-Pyspark
View on GitHub
Power Plant ML Pipeline Application - Apache Spark
☆12Dec 12, 2016Updated 9 years ago
redapt / pyspark-s3-parquet-example
View on GitHub
This repo demonstrates how to load a sample Parquet formatted file from an AWS S3 Bucket. A python job will then be submitted to a Apach…
☆19Jun 23, 2016Updated 10 years ago
tsdataclinic / open-data-week
View on GitHub
This data analysis provided information for the March 6th, 2018, NYC Open Data Week event hosted by the Two Sigma Data Clinic, "The State…
☆13Jan 9, 2025Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
chuktuk / Alzheimers_Disease_Analysis
View on GitHub
This repo contains a data science project to identify patients at high-risk of Alzheimer's disease.
☆12Feb 20, 2021Updated 5 years ago
cs224 / pybnl
View on GitHub
python interface to bnlearn and other probabilistic graphical model libraries
☆10Mar 26, 2020Updated 6 years ago
adityajain10 / pyspark-mlib-based-stock-predictor
View on GitHub
PredictorFinc is a scalable supervised machine learning model the predicts stock price change through Decision Tree Regressor using data …
☆12Sep 5, 2023Updated 2 years ago
lanceseidman / GoogleMapsReviewScrapeJS
View on GitHub
Scrape the latest Google Review from Google Maps using Node.js & Puppeteer
☆13Jun 24, 2018Updated 8 years ago
prakashdontaraju / google-cloud-ecommerce
View on GitHub
ecommerce GCP Streaming pipeline ― Cloud Storage, Compute Engine, Pub/Sub, Dataflow, Apache Beam, BigQuery and Tableau; GCP Batch pipelin…
☆11Mar 9, 2022Updated 4 years ago
CSKrishna / Optimal-bidding-policy-using-Policy-Gradient-in-a-Multi-agent-Contextual-Bandit-setting
View on GitHub
We use policy gradient to help agents learn optimal policies in a competitive multi-agent contextual bandit setting
☆12Mar 9, 2018Updated 8 years ago
MLWhiz / Spark_Projects
View on GitHub
Spark Projects for the Berkeley Data Science Course
☆13Aug 12, 2015Updated 10 years ago
AndreyBozhko / TaxiOptimizer
View on GitHub
My Data Engineering project @ Insight Data Science
☆10Jul 23, 2018Updated 8 years ago
bishwarup307 / BNP_Paribas_Cardiff_Claim_Management
View on GitHub
Automate claim approval in personal insurance sector.
☆20Apr 21, 2016Updated 10 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
mdlindsey / DealerData
View on GitHub
Open-source software for tracking and analyzing CarMax vehicle data
☆13May 29, 2018Updated 8 years ago
vefthym / MinoanER
View on GitHub
Minoan ER is an Entity Resolution (ER) framework, built by researchers in Crete (the land of the ancient Minoan civilization). Entity res…
☆18Nov 18, 2020Updated 5 years ago
nathanhaigh / snakemake-tutorial
View on GitHub
☆13Jan 8, 2020Updated 6 years ago
SisiMa1729 / Causal_Feature_Selection
View on GitHub
Causal Feature Selection Tutorial for AMIA2018
☆12Nov 3, 2018Updated 7 years ago
vsouza / spark-kinesis-redshift
View on GitHub
Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark
☆11May 22, 2018Updated 8 years ago
AWS-Big-Data-Projects / Run-a-Spark-job-within-Amazon-EMR
View on GitHub
Run a Spark job within Amazon EMR
☆12Sep 12, 2020Updated 5 years ago
AWS-Big-Data-Projects / AWS-EMR
View on GitHub
Analyzing Big Data with Amazon EMR
☆12Sep 14, 2020Updated 5 years ago
ketgo / marshmallow-pyspark
View on GitHub
Marshmallow serializer integration with pyspark
☆12Dec 29, 2023Updated 2 years ago
francescotescari / noiseprint2
View on GitHub
noiseprint2 is a porting of noiseprint to tensorflow 2 and keras
☆12Feb 20, 2021Updated 5 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
sbl-sdsc / mmtf-proteomics
View on GitHub
Methods for mapping proteomics data on 3D protein structure.
☆15Jan 18, 2020Updated 6 years ago
codspire / chicago-taxi-trips-analysis
View on GitHub
Analysis of City Of Chicago Taxi Trip Dataset Using AWS EMR, Spark, PySpark, Zeppelin and Airbnb's Superset
☆15Jul 16, 2017Updated 9 years ago
mozilla-services / mozilla-pipeline-schemas
View on GitHub
Schemas for Mozilla's data ingestion pipeline and data lake outputs
☆52Updated this week
rebremer / devopsai_databricks
View on GitHub
DevOps for AI project using Azure Databricks, Azure DevOps and Azure Machine Learning Service
☆15Jul 21, 2021Updated 5 years ago
ragraw26 / FreddieMac_Single_Loan_Analysis_MachineLearning
View on GitHub
Freddie Mac Single Loan Data Analysis & Machine Learning (Regression / Classification)
☆12Jun 11, 2017Updated 9 years ago
zekeriyyaa / Traffic-Data-Analysis-with-Apache-Spark-Based-on-Mobile-Robot-Data
View on GitHub
Mobile robot data were analyzed with Apache-Spark to extract five different statistical result such as travel time, waiting time, average…
☆15Apr 5, 2022Updated 4 years ago
JanMisker / Audio-Pipes
View on GitHub
Chrome extension to redirect WebAudio between webpages
☆14Jun 10, 2021Updated 5 years ago