Skip to content

Commit e776886

Browse files
committed
ML foundation review module completed
1 parent 9908252 commit e776886

20 files changed

+5140
-0
lines changed

ML - Applied Machine Learning - Algorithms/01.Review of Foundation/01.Foundations - Clean Data.ipynb

Lines changed: 1325 additions & 0 deletions
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"## Foundations: Split data into train, validation, and test set\n",
8+
"\n",
9+
"Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.\n",
10+
"\n",
11+
"In this section, we will split the data into train, validation, and test set in preparation for fitting a basic model in the next section."
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"### Read in Data"
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": 1,
24+
"metadata": {
25+
"scrolled": true
26+
},
27+
"outputs": [
28+
{
29+
"data": {
30+
"text/html": [
31+
"<div>\n",
32+
"<style scoped>\n",
33+
" .dataframe tbody tr th:only-of-type {\n",
34+
" vertical-align: middle;\n",
35+
" }\n",
36+
"\n",
37+
" .dataframe tbody tr th {\n",
38+
" vertical-align: top;\n",
39+
" }\n",
40+
"\n",
41+
" .dataframe thead th {\n",
42+
" text-align: right;\n",
43+
" }\n",
44+
"</style>\n",
45+
"<table border=\"1\" class=\"dataframe\">\n",
46+
" <thead>\n",
47+
" <tr style=\"text-align: right;\">\n",
48+
" <th></th>\n",
49+
" <th>Survived</th>\n",
50+
" <th>Pclass</th>\n",
51+
" <th>Sex</th>\n",
52+
" <th>Age</th>\n",
53+
" <th>Fare</th>\n",
54+
" <th>Family_cnt</th>\n",
55+
" <th>Cabin_ind</th>\n",
56+
" </tr>\n",
57+
" </thead>\n",
58+
" <tbody>\n",
59+
" <tr>\n",
60+
" <th>0</th>\n",
61+
" <td>0</td>\n",
62+
" <td>3</td>\n",
63+
" <td>0</td>\n",
64+
" <td>22.0</td>\n",
65+
" <td>7.2500</td>\n",
66+
" <td>1</td>\n",
67+
" <td>0</td>\n",
68+
" </tr>\n",
69+
" <tr>\n",
70+
" <th>1</th>\n",
71+
" <td>1</td>\n",
72+
" <td>1</td>\n",
73+
" <td>1</td>\n",
74+
" <td>38.0</td>\n",
75+
" <td>71.2833</td>\n",
76+
" <td>1</td>\n",
77+
" <td>1</td>\n",
78+
" </tr>\n",
79+
" <tr>\n",
80+
" <th>2</th>\n",
81+
" <td>1</td>\n",
82+
" <td>3</td>\n",
83+
" <td>1</td>\n",
84+
" <td>26.0</td>\n",
85+
" <td>7.9250</td>\n",
86+
" <td>0</td>\n",
87+
" <td>0</td>\n",
88+
" </tr>\n",
89+
" <tr>\n",
90+
" <th>3</th>\n",
91+
" <td>1</td>\n",
92+
" <td>1</td>\n",
93+
" <td>1</td>\n",
94+
" <td>35.0</td>\n",
95+
" <td>53.1000</td>\n",
96+
" <td>1</td>\n",
97+
" <td>1</td>\n",
98+
" </tr>\n",
99+
" <tr>\n",
100+
" <th>4</th>\n",
101+
" <td>0</td>\n",
102+
" <td>3</td>\n",
103+
" <td>0</td>\n",
104+
" <td>35.0</td>\n",
105+
" <td>8.0500</td>\n",
106+
" <td>0</td>\n",
107+
" <td>0</td>\n",
108+
" </tr>\n",
109+
" </tbody>\n",
110+
"</table>\n",
111+
"</div>"
112+
],
113+
"text/plain": [
114+
" Survived Pclass Sex Age Fare Family_cnt Cabin_ind\n",
115+
"0 0 3 0 22.0 7.2500 1 0\n",
116+
"1 1 1 1 38.0 71.2833 1 1\n",
117+
"2 1 3 1 26.0 7.9250 0 0\n",
118+
"3 1 1 1 35.0 53.1000 1 1\n",
119+
"4 0 3 0 35.0 8.0500 0 0"
120+
]
121+
},
122+
"execution_count": 1,
123+
"metadata": {},
124+
"output_type": "execute_result"
125+
}
126+
],
127+
"source": [
128+
"import pandas as pd\n",
129+
"from sklearn.model_selection import train_test_split\n",
130+
"\n",
131+
"titanic_df = pd.read_csv('../Data/titanic_cleaned.csv')\n",
132+
"titanic_df.head()"
133+
]
134+
},
135+
{
136+
"cell_type": "markdown",
137+
"metadata": {},
138+
"source": [
139+
"### Split into train, validation, and test set\n",
140+
"\n",
141+
"![Split Data](img/split_data.png)"
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": 2,
147+
"metadata": {},
148+
"outputs": [],
149+
"source": [
150+
"features = titanic_df.drop(['Survived'], axis=1)\n",
151+
"labels = titanic_df['Survived']"
152+
]
153+
},
154+
{
155+
"cell_type": "code",
156+
"execution_count": 3,
157+
"metadata": {},
158+
"outputs": [],
159+
"source": [
160+
"X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.4, random_state=42)"
161+
]
162+
},
163+
{
164+
"cell_type": "code",
165+
"execution_count": 4,
166+
"metadata": {},
167+
"outputs": [],
168+
"source": [
169+
"X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)"
170+
]
171+
},
172+
{
173+
"cell_type": "code",
174+
"execution_count": 6,
175+
"metadata": {},
176+
"outputs": [
177+
{
178+
"name": "stdout",
179+
"output_type": "stream",
180+
"text": [
181+
"0.6\n",
182+
"0.2\n",
183+
"0.2\n"
184+
]
185+
}
186+
],
187+
"source": [
188+
"for dataset in [y_train, y_val, y_test]:\n",
189+
" print(round(len(dataset) / len(labels), 2))"
190+
]
191+
},
192+
{
193+
"cell_type": "markdown",
194+
"metadata": {},
195+
"source": [
196+
"### Write out all data"
197+
]
198+
},
199+
{
200+
"cell_type": "code",
201+
"execution_count": 7,
202+
"metadata": {},
203+
"outputs": [],
204+
"source": [
205+
"X_train.to_csv('../Data/train_features.csv', index=False)\n",
206+
"X_val.to_csv('../Data/val_features.csv', index=False)\n",
207+
"X_test.to_csv('../Data/test_features.csv', index=False)\n",
208+
"\n",
209+
"y_train.to_csv('../Data/train_labels.csv', index=False)\n",
210+
"y_val.to_csv('../Data/val_labels.csv', index=False)\n",
211+
"y_test.to_csv('../Data/test_labels.csv', index=False)"
212+
]
213+
},
214+
{
215+
"cell_type": "code",
216+
"execution_count": null,
217+
"metadata": {},
218+
"outputs": [],
219+
"source": []
220+
}
221+
],
222+
"metadata": {
223+
"kernelspec": {
224+
"display_name": "Python 3",
225+
"language": "python",
226+
"name": "python3"
227+
},
228+
"language_info": {
229+
"codemirror_mode": {
230+
"name": "ipython",
231+
"version": 3
232+
},
233+
"file_extension": ".py",
234+
"mimetype": "text/x-python",
235+
"name": "python",
236+
"nbconvert_exporter": "python",
237+
"pygments_lexer": "ipython3",
238+
"version": "3.8.3"
239+
}
240+
},
241+
"nbformat": 4,
242+
"nbformat_minor": 2
243+
}
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)