Predict daily bike rentals in the Capital Bikeshare system (Washington D.C.) using historical data. Accurate forecasts help optimize bike availability, reduce costs, and improve user satisfaction by accounting for weather, seasonality, and temporal factors.
Urban bikesharing systems face challenges in managing bike distribution amid variable demand. The goal is to build a regression model that estimates daily rental counts (cnt) based on features such as temperature, humidity, weather conditions, seasonality, and holidays.
Source: Public Kaggle dataset (daily records from 2011–2012)
Size: 731 records, 16 features
Key features: season, yr, mnth, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed, casual, registered
Target: cnt (total rentals = casual + registered)
Inspected structure, checked duplicates/missing values , visualized distributions (histograms, boxplots), scatter matrices, correlation analysis to identify key predictors (e.g., strong positive correlation with temp and atemp)
One-hot encoding: season, weathersit, and other categorical variables
Feature scaling/normalization
Train/test split
Implemented and compared multiple regression techniques:
Linear models:
Linear Regression
Ridge
Lasso
Polynomial Regression (degrees 2–3)
Partial Least Squares (PLS)
Ensemble methods:
Random Forest
XGBoost
Hyperparameter tuning via GridSearchCV with 5-fold cross-validation.
Metrics: RMSE, MAE, R² Experiments with categorical encodings to assess impact on performance
Best model: XGBoost (without additional categorical adjustments) Test R²: 0.896 RMSE: ~645
XGBoost captured non-linear patterns effectively
All models showed moderate performance, limited by dataset size.
Opportunities: expand data, apply time-series modeling
Pandas, seaborn, matplotlib (correlation matrices, histograms, scatter plots)
Regression modeling, feature engineering (polynomial features, scaling), hyperparameter tuning (GridSearchCV), cross-validation (KFold), evaluation (MSE, RMSE, MAE, R²)
Regularization (Ridge/Lasso), dimensionality reduction (PLS), ensemble learning (Random Forest, XGBoost)
Python (scikit-learn, XGBoost, NumPy), Jupyter Notebook for reproducible workflows
Addressed overfitting, multicollinearity, non-linearity; proposed temporal modeling for weather dependencies