Small Business Loan Default Prediction

Project Overview

This project is focused on predicting whether a small business in the United States will default on an SBA (Small Business Administration) loan. Given the critical role of small businesses in job creation and economic growth, accurate loan default prediction can help financial institutions make informed lending decisions, mitigate risks, and support business sustainability.

Context

As a Data Scientist at the United States Small Business Administration (US SBA), your role is to develop a machine learning model to assess the risk of loan default before granting loans. The challenge is to perform binary classification (i.e., should the loan be granted? Yes or No).

Data Considerations & Challenges

Data Leakage

One of the most important aspects of this project is to avoid data leakage. This means that features that would not be available at the time of loan approval should not be used for prediction. For example:

The total amount repaid should not be used, as it is only available after the loan has been granted.
Features derived from future events should be carefully considered to prevent biased results.

Performance Metrics

Since this is a classification problem, we need to evaluate the model using appropriate metrics such as:

Confusion Matrix
ROC-AUC Score
Precision & Recall
F1 Score

A well-balanced approach to these metrics is critical to ensure that the model minimizes false negatives (approving bad loans) while maintaining a reasonable false positive rate (rejecting good loans).

Project Workflow

This project is divided into two major phases:

Phase 1: Data Cleaning, Exploration & Feature Engineering (EDA)

Load and inspect the dataset.
Handle missing values and outliers.
Perform univariate and bivariate analysis.
Conduct statistical tests where necessary.
Engineer new features and perform feature selection.

Phase 2: Model Training & Evaluation

Prepare the dataset for modeling (train-test split, scaling, encoding, etc.).
Implement and evaluate multiple models:
- Logistic Regression
- Boosting models: XGBoost, CatBoost, LightGBM
Interpret model outputs:
- Feature importance analysis.
- Bonus: Use SHAP values for explainability.

Implementation

The project is implemented using Jupyter Notebooks, where the following steps are documented:

Data Cleaning & Preprocessing: Handling missing values, outliers, encoding categorical variables, and feature selection.
Exploratory Data Analysis (EDA): Visualizing distributions, correlations, and key insights.
Modeling & Evaluation: Training various models, hyperparameter tuning, and assessing model performance.
Feature Importance & Explainability: Understanding which features influence loan repayment predictions.

Tools & Libraries Used

Python
Pandas, NumPy (Data manipulation & analysis)
Matplotlib, Seaborn (Data visualization)
Scikit-learn (Machine learning models & metrics)
XGBoost, CatBoost, LightGBM (Boosting algorithms)
SHAP (Feature interpretability)
Jupyter Notebook (Code execution & documentation)

Usage

Running the Notebooks

Clone the repository:

git clone <https://github.com/MichAdebayo/brief-loan-prediction>
cd brief-loan-prediction

Install required dependencies:
```
pip install -r requirements.txt
```
Open Jupyter Notebook and run the analysis:
```
jupyter notebook
```

Conclusion

This project provides an end-to-end solution for predicting small business loan defaults using machine learning. The insights and models developed can help financial institutions and policymakers make data-driven lending decisions while minimizing risks. Future work can include:

Implementing additional models and ensemble techniques.
Deploying the model as an API for real-time loan approval predictions.
Incorporating alternative data sources for enhanced predictive performance.

For further improvements or contributions, feel free to open an issue or submit a pull request!

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
model		model
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Small Business Loan Default Prediction

Project Overview

Context

Data Considerations & Challenges

Data Leakage

Performance Metrics

Project Workflow

Phase 1: Data Cleaning, Exploration & Feature Engineering (EDA)

Phase 2: Model Training & Evaluation

Implementation

Tools & Libraries Used

Usage

Running the Notebooks

Conclusion

Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

MichAdebayo/sba-loan-default-prediction

Folders and files

Latest commit

History

Repository files navigation

Small Business Loan Default Prediction

Project Overview

Context

Data Considerations & Challenges

Data Leakage

Performance Metrics

Project Workflow

Phase 1: Data Cleaning, Exploration & Feature Engineering (EDA)

Phase 2: Model Training & Evaluation

Implementation

Tools & Libraries Used

Usage

Running the Notebooks

Conclusion

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages