Skip to content

Commit 756d259

Browse files
hokie45astorfi
authored andcommitted
Knn revision (#20)
* Added code and finished tutorial for knn * Fixed some grammar in tutorial * Added changes requested by Eric, added a k-d diagram to tutorial, and made some minor changes * Fixed image size * Added content that Eric requested * Fixed the one small change
1 parent 6d4d3dc commit 756d259

File tree

6 files changed

+214
-6
lines changed

6 files changed

+214
-6
lines changed

code/supervised/KNN/knn.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# All the libraries we need for KNN
2+
import numpy as np
3+
import matplotlib.pyplot as plt
4+
5+
from sklearn.neighbors import KNeighborsClassifier
6+
# This is used for our dataset
7+
from sklearn.datasets import load_breast_cancer
8+
9+
10+
# =============================================================================
11+
# We are using sklearn datasets to create the set of data points about breast cancer
12+
# Data is the set data points
13+
# target is the classification of those data points.
14+
# More information can be found at:
15+
#https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer
16+
# =============================================================================
17+
dataCancer = load_breast_cancer()
18+
19+
# The data[:, x:n] gets two features for the data given.
20+
# The : part gets all the rows in the matrix. And 0:2 gets the first 2 columns
21+
# If you want to get a different two features you can replace 0:2 with 1:3, 2:4,... 28:30,
22+
# there are 30 features in the set so it can only go up to 30.
23+
# If we wanted to plot a 3 dimensional plot then the difference between x and n needs to be 3 instead of two
24+
data = dataCancer.data[:, 0:2]
25+
target = dataCancer.target
26+
27+
# =============================================================================
28+
# This creates the KNN classifier and specifies the algorithm being used and the k
29+
# nearest neighbors used for the algorithm. more information can about KNeighborsClassifier
30+
# can be found at: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
31+
#
32+
# Then it trains the model using the breast cancer dataset.
33+
# =============================================================================
34+
model = KNeighborsClassifier(n_neighbors = 9, algorithm = 'auto')
35+
model.fit(data, target)
36+
37+
38+
# plots the points
39+
plt.scatter(data[:, 0], data[:, 1], c=target, s=30, cmap=plt.cm.prism)
40+
41+
# Creates the axis bounds for the grid
42+
axis = plt.gca()
43+
x_limit = axis.get_xlim()
44+
y_limit = axis.get_ylim()
45+
46+
# Creates a grid to evaluate model
47+
x = np.linspace(x_limit[0], x_limit[1])
48+
y = np.linspace(y_limit[0], y_limit[1])
49+
X, Y = np.meshgrid(x, y)
50+
xy = np.c_[X.ravel(), Y.ravel()]
51+
52+
# Creates the line that will separate the data
53+
boundary = model.predict(xy)
54+
boundary = boundary.reshape(X.shape)
55+
56+
57+
# Plot the decision boundary
58+
axis.contour(X, Y, boundary, colors = 'k')
59+
60+
# Shows the graph
61+
plt.show()
62+
63+
Loading
Loading
Loading

docs/source/content/supervised/knn.rst

Lines changed: 143 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,32 +2,169 @@
22
k-Nearest Neighbors
33
====================
44

5-
K-Nearest Neighbors (KNN) is a basic classifier for machine learning. So we are trying to identify what class an object is in. To do this we look at the closest points (neighbors) to the object and the class with the majority of neighbors will be the class that we identify the object to be in. The k is the number of nearest neighbors to the object. So if k = 1 then the class the object would be in is the class of the closest neighbor. Let's look at an example.
5+
.. contents::
6+
:local:
7+
:depth: 3
8+
9+
Introduction
10+
-------------
11+
12+
K-Nearest Neighbors (KNN) is a basic classifier for machine learning.
13+
A **classifier** takes an already labeled data set, and then it trys to
14+
label new data points into one of the catagories.
15+
So, we are trying to identify what class an object is in. To do this we
16+
look at the closest points (neighbors) to the object and the class with
17+
the majority of neighbors will be the class that we identify the object
18+
to be in. The k is the number of nearest neighbors to the object. So, if
19+
k = 1 then the class the object would be in is the class of the closest
20+
neighbor. Let's look at an example.
621

722
.. figure:: _img/knn.png
823
:scale: 50 %
924
:alt: KNN
1025

1126
Ref: https://coxdocs.org
1227

13-
So in this example we are trying to classify the red star to be either a green square or a blue octagon. So first if we look at the inner circle where k = 3, we can see that there are 2 blue octagons and 1 green square. So there is a majority of blue octagons, so the red star would be classified as a blue octagon. Now we take a look at k = 5, the outer circle. In this one there is 2 blue octagons and 3 green squares. So the red star would be classified as a green square.
28+
In this example we are trying to classify the red star to be either
29+
a green square or a blue octagon. First, if we look at the inner circle
30+
where k = 3, we can see that there are 2 blue octagons and 1 green square.
31+
So there is a majority of blue octagons, so the red star would be classified
32+
as a blue octagon. Now we look at k = 5, the outer circle. In this one
33+
there is 2 blue octagons and 3 green squares. Then, the red star would be
34+
classified as a green square.
1435

1536
How does it work?
1637
-----------------
1738

18-
We will look at two different ways to go about this. The two ways we will look at is the brute force method and the K-D tree method.
39+
We will look at two different ways to go about this. The two ways are
40+
the brute force method and the K-D tree method.
1941

2042
Brute Force Method
2143
--------------------
2244

23-
This is the simpliest method. Basically it's just calculating the euclidean distance from the object being classified to each point in the set. You want to use this method when the dimensions are small or the number of points are small.
45+
This is the simplest method. Basically, it's just calculating the **Euclidean
46+
distance** from the object being classified to each point in the set. The Euclidean distance
47+
is simply the length of a line segment that connects two points. The Brute Force method is
48+
useful when the dimensions of the points are small or the number of points is small.
49+
As the number of points increases the number of times the method will have to calculate
50+
the Euclidean distance also increases, so the performance of the method drops. Luckily,
51+
the K-D tree method is better equipped for larger sets of data.
2452

2553
K-D Tree Method
2654
-----------------
2755

28-
This method tries to improve the running time by reducing the amount of times we calculate the euclidean distance. The idea behind this method is that if we know that two data points are close to each other and we calculate the euclidean distance to one of them and then we know that distance is roughly close to the other point. If you have a larger data set it is better to use this method.
56+
This method tries to improve the running time by reducing the amount of times we
57+
calculate the Euclidean distance. The idea behind this method is that if we know
58+
that two data points are close to each other and we calculate the Euclidean distance
59+
to one of them and then we know that distance is roughly close to the other point.
60+
Here is an example of how the K-D tree looks like.
61+
62+
.. figure:: _img/KNN_KDTree.jpg
63+
:scale: 50 %
64+
:alt: KNN K-d tree
65+
66+
Ref: https://slideplayer.com/slide/3273367/
67+
68+
How a K-D tree works is that a node in the tree represents and holds data from an n-dimensional
69+
graph. Each node represents a box in the graph. First we can build a K-D tree out of a set of data, then
70+
when it's time to classify a point we would just look at where the point will fall in the
71+
tree then calculate the Euclidean distance between only the points it is close to until we reach
72+
k neighbors.
73+
74+
If you have a larger data set it is recommended to use this method. This is because the cost of creating
75+
the K-D tree is relatively low if the data set is larger, and the cost of classifiying a point is
76+
constant as the data gets larger.
77+
2978

3079
Choosing k
3180
-----------
3281

33-
Choosing k typically depends on the dataset you are looking at. You never want to choose k = 2 because it has a very high chance that there won't be a majority class, so in the example above the there would be one of each so we wouldn't be able to classify the red star. Typically k you want the value of k to be small. As k goes to infinity all unidentified data points will always be classified to one class or the other depending on which class has more data points. So typically you don't want this to happen, so it is wise to choose a k that is relatively small.
82+
Choosing k typically depends on the dataset you are looking at. You never want to
83+
choose k = 2 because it has a very high chance that there won't be a majority class,
84+
so in the example above there would be one of each so we wouldn't be able to
85+
classify the red star. Typically, you want the value of k to be small. As k goes to
86+
infinity all unidentified data points will always be classified to one class or the other
87+
depending on which class has more data points. You don't want this to happen,
88+
so it is wise to choose a k that is relatively small.
89+
90+
Conclusion
91+
------------
92+
93+
Here are some things to take away:
94+
95+
- The different methods to KNN only affect the performance, not the output
96+
- The Brute Force Method is best when the dimensions of the points or the number of points are small
97+
- The K-D Tree Method is best when you have a larger data set
98+
- SKLearn KNN classifier has a auto method which decides what method to use given what data it's trained on.
99+
100+
Choosing the value of k will drastically change how the data is classified. A higher k value will ignore outliers to the data
101+
and a lower will give more weight to them. If the k value is too high it will not be able to classify the data, so k needs to
102+
be relatively small.
103+
104+
Motivation
105+
------------
106+
107+
So why would someone use this classifier over another? Is this the best classifier? The answer to these questions are that it depends.
108+
There is no classifier that is best, it all depends on the data that a classifier is given. KNN might be the best for one dataset but
109+
not another. It's good to know about other classifiers like `Support Vector Machines`_, and then decide which one best classifies the
110+
a given dataset.
111+
112+
Code Example
113+
-------------
114+
115+
Check out our code, `knn.py`_ to learn how to implement a k nearest neighbor classifier using Python's Scikit-learn library.
116+
More information about Scikit-Learn can be found `here`_.
117+
118+
`knn.py`_, Classifies a set of data on breast cancer, loaded from Scikit-Learn's dataset library.
119+
The program will take the data and plot them on a graph, then use the KNN algorithm to best separate the data.
120+
The output should look like this:
121+
122+
.. figure:: _img/knn_output_k9.png
123+
:scale: 50%
124+
:alt: KNN k = 9 output
125+
126+
The green points are classified as benign.
127+
The red points are classified as malignant.
128+
The boundary line is the prediction that the classifier makes. This boundary line is determined by the k value, for this instance
129+
k = 9.
130+
131+
This loads the data from the Scikit-Learn's dataset library. You can change the data to whatever you would like.
132+
Just make sure you have data points and an array of targets to classify those data points.
133+
134+
.. code:: python
135+
136+
dataCancer = load_breast_cancer()
137+
data = dataCancer.data[:, :2]
138+
target = dataCancer.target
139+
140+
You can also change the k value or n_neighbors value that will change the algorithm. It is suggested that you
141+
choose a k that is relatively small.
142+
143+
You can also change the algorithm used, the options are
144+
{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}. These don't change the output of the prediction, they will just
145+
change the time it takes to predict the data.
146+
147+
Try changing the value of n_neighbors to 1 in the code below.
148+
149+
.. code:: python
150+
151+
model = KNeighborsClassifier(n_neighbors = 9, algorithm = 'auto')
152+
model.fit(data, target)
153+
154+
If you changed the value of n_neighbors to 1 this will classify by the point that is closest to the point. The output should look like this:
155+
156+
.. figure:: _img/knn_output_k1.png
157+
:scale: 50%
158+
:alt: KNN k = 1 output
159+
160+
Comparing this output to k = 9 you can see a large difference on how it classifies the data. So if you want to ignore outliers you
161+
will want a higher k value, otherwise choose a smaller k like 1, 3 or 5. You can experiment by choosing a very high k greater than 100.
162+
Eventually the algorithm will classify all the data into 1 class, and there will be no line to split the data.
163+
164+
.. _here: https://scikit-learn.org
165+
166+
.. _knn.py: https://github.com/machinelearningmindset/machine-learning-course/blob/master/code/supervised/KNN/knn.py
167+
168+
.. _Support Vector Machines: linear_SVM.rst
169+
170+

docs/source/content/supervised/linear_SVM.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,13 @@
22
Linear Support Vector Machines
33
==========================
44

5+
.. contents::
6+
:local:
7+
:depth: 3
8+
9+
Introduction
10+
-------------
11+
512
A **Support Vector Machine** (SVM for short) is another machine learning algorithm that is used to classify data.
613
The point of SVM's are to try and find a line or **hyperplane** to divide a dimensional space which best classifies
714
the data points. If we were trying to divide two classes A and B, we would try to best separate the two classes with a
@@ -129,6 +136,7 @@ by using kernel tricks, so try using the different kernels like Radial Basis Fun
129136

130137
Code Example
131138
-------------
139+
132140
Check out our code, `linear_svm.py`_ to learn how to implement a linear SVM using Python's Scikit-learn library.
133141
More information about Scikit-Learn can be found `here`_.
134142

0 commit comments

Comments
 (0)