Projects

Discover projects that showcase my skills in data science, machine learning, and more.

Predictive Housing Price Analytics Using Advanced Data Science Techniques

Nov 2023 – Jan 2024

Implemented comprehensive data preprocessing and feature engineering, including handling missing values, encoding, and normalization. Employed visualizations to identify and address outliers and skewness, improving data quality for modeling.

Developed and optimized a variety of machine learning models, including Random Forest and LightGBM, using advanced techniques like grid search and feature selection. Achieved optimal prediction accuracy with LightGBM.

View on Google Colab

Handwritten Digits Classification using Ensemble Learning

Mar 2023 – Apr 2023

Implemented and compared Random Forest and Gradient Boosting classifiers on the MNIST dataset, highlighting differences between parallel and sequential ensemble approaches.

Analyzed and visualized model performance, enabling insights into optimal hyperparameters and classifier performance.

View on Google Colab

PySpark Data Processing

Mar 2023 – Mar 2023

Designed and implemented advanced PySpark scripts for data manipulation and analysis on Dataproc Clusters, utilizing Spark SQL and DataFrame transformations for enhanced data querying and aggregation.

IApplied sophisticated data engineering techniques (adaptive query execution, dynamic partition pruning, data repartitioning) to fine-tune Spark SQL queries and DataFrames, introducing a sorting-based method that notably reduced query execution times.

View on GitHub

Optimizing Large-Scale Climate Data Analysis for Performance and Resource Management

Feb 2023 – Apr 2023

Conducted an in-depth analysis of massive climate dataset (GHCN data) on NYU HPC supercomputer clusters, employing Dask to parallelize and scale computations efficiently while utilizing Python techniques and packages.

Improved query execution time by 50 percent through implementing partitioning, processing, and parqueting.

View on GitHub

Advanced Big Data Analysis with SQLite on Music Metadata Database

Jan 2023 – Mar 2023

Leveraged Python’s sqlite3 library to execute queries with efficient database management techniques by creating well-defined indexes that resulted in faster response times for complex data inquiries.

View on GitHub

Machine Learning Project on Movie Ratings Data

Oct 2022 – Dec 2022

Utilized linear and multiple regressions to predict movie ratings based on other movies, gender identity, sibship status, and social viewing preferences, reporting average Coefficient of Determination values.

Implemented ridge and LASSO regression models with hyperparameter tuning to predict ratings for a subset of 30 movies, analyzing performance and feature importance. Employed logistic regression and cross-validation to predict movie enjoyment based on average user ratings, calculating AUC values to assess model quality.

View on Google Colab