Skip to content

Commit 0144db4

Browse files
author
Eric Wynn
committed
Overfitting and Logistic Regression docs and code
1 parent 8d87d93 commit 0144db4

File tree

14 files changed

+401
-81
lines changed

14 files changed

+401
-81
lines changed

code/overview/overfitting/example.py

Lines changed: 0 additions & 50 deletions
This file was deleted.
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
import matplotlib.pyplot as plt
2+
3+
def real_funct(x):
4+
return [-(i**2) for i in x]
5+
6+
def over_funct(x):
7+
return [-0.5*(i**3) - (i**2) for i in x]
8+
9+
def under_funct(x):
10+
return [6*i + 9 for i in x]
11+
12+
#create x values, and run them through each function
13+
x = range(-3, 4, 1)
14+
real_y = real_funct(x)
15+
over_y = over_funct(x)
16+
under_y = under_funct(x)
17+
18+
#Use matplotlib to plot the functions so they can be visually compared.
19+
plt.plot(x, real_y, 'k', label='Real function')
20+
plt.plot(x, over_y, 'r', label='Overfit function')
21+
plt.plot(x, under_y, 'b', label='Underfit function')
22+
plt.legend()
23+
plt.show()
24+
25+
#Output the data in a well formatted way, for the more numerically inclined.
26+
print("An underfit model may output something like this:")
27+
for i in range(0, 7):
28+
print("x: "+ str(x[i]) + ", real y: " + str(real_y[i]) + ", y: " + str(under_y[i]))
29+
30+
print("An overfit model may look a little like this")
31+
for i in range(0, 7):
32+
print("x: "+ str(x[i]) + ", real y: " + str(real_y[i]) + ", y: " + str(over_y[i]))
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
from sklearn.linear_model import LogisticRegression
2+
import numpy as np
3+
import random
4+
5+
#defines the classification for the training data.
6+
def true_classifier(i):
7+
if i >= 700:
8+
return 1
9+
return 0
10+
11+
#Generate a random dataset which includes random scores from 0 to 1000.
12+
x = np.array([ random.randint(0,1000) for i in range(0,1000) ])
13+
14+
#The model will expect a 2D array, so we must reshape
15+
#For the model, the 2D array must have rows equal to the number of samples,
16+
#and columns equal to the number of features.
17+
#For this example, we have 1000 samples and 1 feature.
18+
x = x.reshape((-1, 1))
19+
20+
#For each point, y is a pass/fail for the grade. The simple threshold is arbitrary,
21+
#and can be changed as you would like. Classes are 1 for success and 0 for failure
22+
y = [ true_classifier(x[i][0]) for i in range(0,1000) ]
23+
24+
25+
#Again, we need a numpy array, so we convert.
26+
y = np.array(y)
27+
28+
#Our goal will be to train a logistic regression model to do pass/fail to the same threshold.
29+
model = LogisticRegression(solver='liblinear')
30+
31+
#The fit method actually fits the model to our training data
32+
model = model.fit(x,y)
33+
34+
#Create 100 random samples to try against our model as test data
35+
samples = [random.randint(0,1000) for i in range(0,100)]
36+
#Once again, we need a 2d Numpy array
37+
samples = np.array(samples)
38+
samples = samples.reshape(-1, 1)
39+
40+
#Now we use our model against the samples. output is the probability, and _class is the class.
41+
_class = model.predict(samples)
42+
proba = model.predict_proba(samples)
43+
44+
num_accurate = 0
45+
46+
#Finally, output the results, formatted for nicer viewing.
47+
#The format is [<sample value>]: Class <class number>, probability [ <probability for class 0> <probability for class 1>]
48+
#So, the probability array is the probability of failure, followed by the probability of passing.
49+
#In an example run, [7]: Class 0, probability [ 9.99966694e-01 3.33062825e-05]
50+
#Means that for value 7, the class is 0 (failure) and the probability of failure is 99.9%
51+
for i in range(0,100):
52+
if (true_classifier(samples[i])) == (_class[i] == 1):
53+
num_accurate = num_accurate + 1
54+
print("" + str(samples[i]) + ": Class " + str(_class[i]) + ", probability " + str(proba[i]))
55+
#skip a line to separate overall result from sample output
56+
print("")
57+
print(str(num_accurate) +" out of 100 correct.")
10.2 KB
Loading
Binary file not shown.
41.7 KB
Loading
Loading
7.12 KB
Loading
Binary file not shown.

docs/source/content/overview/overfitting.rst

Lines changed: 81 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,26 +9,100 @@ Overfitting and Underfitting
99
----------------------------
1010
Overview
1111
----------------------------
12-
When using machine learning, there are many ways to go wrong. Some of the most common issues in machine learning are **overfitting** and **underfitting**. To understand these concepts, let's imagine a machine learning model that is trying to learn to classify numbers, and has access to a training set of data and a testing set of data. Be sure to **follow along with the sample code** courtesy of scikit-learn.org
12+
When using machine learning, there are many ways to go wrong.
13+
Some of the most common issues in machine learning are **overfitting** and **underfitting**.
14+
To understand these concepts, let's imagine a machine learning model that is
15+
trying to learn to classify numbers, and has access to a training set of data and a testing set of data.
1316

1417
----------------------------
1518
Overfitting
1619
----------------------------
1720

18-
A model suffers from **Overfitting** when it has learned too much from the training data, and does not perform well in practice as a result. This is usually caused by the model having too much exposure to the training data. For the number classification example, if the model is overfit in this way, it may be picking up on tiny details that are misleading, like stray marks as an indication of a specific number.
21+
A model suffers from **Overfitting** when it has learned too much from the
22+
training data, and does not perform well in practice as a result.
23+
This is usually caused by the model having too much exposure to the training data.
24+
For the number classification example, if the model is overfit in this way, it
25+
may be picking up on tiny details that are misleading, like stray marks as an indication of a specific number.
1926

20-
The estimate looks pretty good when you look at the middle of the graph, but the edges have large error. In practice, this error isn't always at edge cases and can pop up anywhere. The noise in training can cause the error seen in the graph
27+
The estimate looks pretty good when you look at the middle of the graph, but the edges have large error.
28+
In practice, this error isn't always at edge cases and can pop up anywhere.
29+
The noise in training can cause error as seen in the graph below.
30+
31+
.. figure:: _img/Overfit_small.png
32+
:scale: 10 %
33+
:alt: Overfit
34+
(Created using https://www.desmos.com/calculator/dffnj2jbow)
35+
36+
In this example, the data is overfit by a polynomial degree.
37+
The points indicated are true to the function y = x^2, but does not approximate the function well outside of those points.
2138

2239
----------------------------
2340
Underfitting
2441
----------------------------
2542

26-
A model suffers from **Underfitting** when it has not learned enough from the training data, and does not perform well in practice as a result. As a direct contrast to the previous idea, this issue is caused by not letting the model learn enough from training data. In the number classification example, if the training set is too small or the model has not had enough attempts to learn from it, then it will not be able to pick out key features of the numbers.
43+
A model suffers from **Underfitting** when it has not learned enough from the
44+
training data, and does not perform well in practice as a result.
45+
As a direct contrast to the previous idea, this issue is caused by not letting
46+
the model learn enough from training data.
47+
In the number classification example, if the training set is too small or the
48+
model has not had enough attempts to learn from it, then it will not be able to pick out key features of the numbers.
49+
50+
51+
The issue with this estimate is clear to the human eye, the model should be
52+
nonlinear, and is instead just a simple line.
53+
In machine learning, this could be a result of underfitting, the model has not
54+
had enough exposure to training data to adapt to it, and is currently in a simple state.
55+
56+
.. figure:: _img/Underfit.PNG
57+
:scale: 50 %
58+
:alt: Underfit
59+
(Created using Wolfram Alpha)
60+
61+
----------------------------
62+
Motivation
63+
----------------------------
64+
65+
Finding a good fit is one of the central problems in machine learning.
66+
Gaining a good grasp of how to avoid fitting problems before even worrying
67+
about specific methods can keep models on track.
68+
The mindset of hunting for a good fit, rather than throwing more learning
69+
time at a model is very important to have.
2770

71+
----------------------------
72+
Code
73+
----------------------------
74+
75+
The example code for overfitting shows some basic examples based in polynomial
76+
interpolation, trying to find the equation of a graph.
77+
The overfitting.py_ file, you can see that there is a true function being
78+
modeled, as well as some estimates that are shown to not be accurate.
79+
80+
.. _overfitting.py: https://github.com/machinelearningmindset/machine-learning-course/blob/master/code/overview/overfitting/overfitting.py
81+
82+
The estimates are representations of overfitting and underfitting.
83+
For overfitting, a higher degree polynomial is used (x cubed instead of squared).
84+
While the data is relatively close for the chosen points, there are some artifacts outside of them.
85+
The example of underfitting, however, does not even achieve accuracy at many of the points.
86+
Underfitting is similar to having a linear model when trying to model a quadratic function.
87+
The model does well on the point(s) it trained on, in this case the point used
88+
for the linear estimate, but poorly otherwise.
2889

29-
The issue with this estimate is clear to the human eye, the model should be nonlinear, and is instead just a simple line. In machine learning, this could be a result of underfitting, the model has not had enough exposure to training data to adapt to it, and is currently in a simple state.
3090

3191
----------------------------
32-
How to avoid overfitting
92+
Conclusion
3393
----------------------------
34-
A key idea in avoiding overfitting issues in machine learning is to maintain a **validation set**. This set is used for training purposes, but most importantly the model **does not learn from it**. So, the model first learns from the **training set**, then checks its knowledge on a completely different **validation set**. When it performs well enough on the **validation set**, we can be more confident that it is not overfit to the training data than if we just looked at training results.
94+
95+
Check out the cross-validation and regularization sections for information on
96+
how to avoid overfitting in machine learning models.
97+
Ideally, a good fit looks something like this:
98+
99+
.. figure:: _img/GoodFit.PNG
100+
:scale: 50 %
101+
:alt: Underfit
102+
(Created using Wolfram Alpha)
103+
104+
105+
When using machine learning in any capacity, issues such as overfitting
106+
frequently come up, and having a grasp of the concept is very important.
107+
The modules in this section are among the most important in the whole repository,
108+
since regardless of the implementation, machine learning always includes these fundamentals.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)