Projects
Discover projects that showcase my skills in data science, machine learning, and more.
Predictive Housing Price Analytics Using Advanced Data Science Techniques
Nov 2023 – Jan 2024
Implemented comprehensive data preprocessing and feature engineering, including handling missing values, encoding, and normalization. Employed visualizations to identify and address outliers and skewness, improving data quality for modeling.
Developed and optimized a variety of machine learning models, including Random Forest and LightGBM, using advanced techniques like grid search and feature selection. Achieved optimal prediction accuracy with LightGBM.
View on Google ColabHandwritten Digits Classification using Ensemble Learning
Mar 2023 – Apr 2023
Implemented and compared Random Forest and Gradient Boosting classifiers on the MNIST dataset, highlighting differences between parallel and sequential ensemble approaches.
Analyzed and visualized model performance, enabling insights into optimal hyperparameters and classifier performance.
View on Google ColabPySpark Data Processing
Mar 2023 – Mar 2023
Designed and implemented advanced PySpark scripts for data manipulation and analysis on Dataproc Clusters, utilizing Spark SQL and DataFrame transformations for enhanced data querying and aggregation.
IApplied sophisticated data engineering techniques (adaptive query execution, dynamic partition pruning, data repartitioning) to fine-tune Spark SQL queries and DataFrames, introducing a sorting-based method that notably reduced query execution times.
View on GitHubOptimizing Large-Scale Climate Data Analysis for Performance and Resource Management
Feb 2023 – Apr 2023
Conducted an in-depth analysis of massive climate dataset (GHCN data) on NYU HPC supercomputer clusters, employing Dask to parallelize and scale computations efficiently while utilizing Python techniques and packages.
Improved query execution time by 50 percent through implementing partitioning, processing, and parqueting.
View on GitHubAdvanced Big Data Analysis with SQLite on Music Metadata Database
Jan 2023 – Mar 2023
Leveraged Python’s sqlite3 library to execute queries with efficient database management techniques by creating well-defined indexes that resulted in faster response times for complex data inquiries.
View on GitHubMachine Learning Project on Movie Ratings Data
Oct 2022 – Dec 2022
Utilized linear and multiple regressions to predict movie ratings based on other movies, gender identity, sibship status, and social viewing preferences, reporting average Coefficient of Determination values.
Implemented ridge and LASSO regression models with hyperparameter tuning to predict ratings for a subset of 30 movies, analyzing performance and feature importance. Employed logistic regression and cross-validation to predict movie enjoyment based on average user ratings, calculating AUC values to assess model quality.
View on Google Colab