Skip to content

Commit 1c2de1e

Browse files
merguastorfi
authored andcommitted
PCA (#25)
* Initial pca writeup * Expand some sections and fix changes requested in review * Add pca code for python3 * Add code example discussion to pca.rst * Add print output to code
1 parent 13295e0 commit 1c2de1e

File tree

8 files changed

+265
-0
lines changed

8 files changed

+265
-0
lines changed

code/unsupervised/PCA/pca.py

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
from sklearn.decomposition import PCA
2+
import matplotlib.pyplot as plt
3+
import numpy as np
4+
5+
# A value we picked to always display the same results
6+
# Feel free to change this to any value greater than 0 view different random value outcomes
7+
seed = 9000
8+
9+
# We're using a seeded random state so we always get the same outcome
10+
seeded_state = np.random.RandomState(seed=seed)
11+
12+
# Returns a random 150 points (x, y pairs) in a gaussian distribution,
13+
# IE most of the points fall close to the average with a few outliers
14+
rand_points = seeded_state.randn(150, 2)
15+
16+
# The @ operator performs matrix multiplication, and serves to bring
17+
# our gaussian distribution points closer together
18+
points = rand_points @ seeded_state.rand(2, 2)
19+
x = points[:, 0]
20+
y = points[:, 1]
21+
22+
# Now we have a sample dataset of 150 points to perform PCA on, so
23+
# go ahead and display this in a plot.
24+
plt.scatter(x, y, alpha=0.5)
25+
plt.title("Sample Dataset")
26+
27+
print("Plotting our created dataset...\n")
28+
print("Points:")
29+
for p in points[:10, :]:
30+
print("({:7.4f}, {:7.4f})".format(p[0], p[1]))
31+
print("...\n")
32+
33+
plt.show()
34+
35+
# Find two principal components from our given dataset
36+
pca = PCA(n_components = 2)
37+
pca.fit(points)
38+
39+
# Once we are fitted, we have access to inner mean_, components_, and explained_variance_ variables
40+
# Use these to add some arrows to our plot
41+
plt.scatter(x, y, alpha=0.5)
42+
plt.title("Sample Dataset with Principal Component Lines")
43+
for var, component in zip(pca.explained_variance_, pca.components_):
44+
plt.annotate(
45+
"",
46+
component * np.sqrt(var) * 2 + pca.mean_,
47+
pca.mean_,
48+
arrowprops = {
49+
"arrowstyle": "->",
50+
"linewidth": 2
51+
}
52+
)
53+
54+
print("Plotting our calculated principal components...\n")
55+
56+
plt.show()
57+
58+
# Reduce the dimensionality of our data using a PCA transformation
59+
pca = PCA(n_components = 1)
60+
transformed_points = pca.fit_transform(points)
61+
62+
# Note that all the inverse transformation does is transforms the data to its original space.
63+
# In practice, this is unnecessary. For this example, all data would be along the x axis.
64+
# We use it here for visualization purposes
65+
inverse = pca.inverse_transform(transformed_points)
66+
t_x = inverse[:, 0]
67+
t_y = inverse[:, 0]
68+
69+
# Plot the original and transformed data sets
70+
plt.scatter(x, y, alpha=0.3)
71+
plt.scatter(t_x, t_y, alpha=0.7)
72+
plt.title("Sample Dataset (Blue) and Transformed Dataset (Orange)")
73+
74+
print("Plotting our dataset with a dimensionality reduction...")
75+
76+
plt.show()
Loading
8.07 KB
Loading
14.6 KB
Loading
8.97 KB
Loading
Loading
29.8 KB
Loading
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
Principal Component Analysis
2+
============================
3+
4+
.. contents::
5+
:local:
6+
:depth: 2
7+
8+
Introduction
9+
------------
10+
11+
Principal component analysis is one technique used to take a large list
12+
of interconnected variables and choose the ones that best suit a model.
13+
This process of focusing in on only a few variables is called
14+
**dimensionality reduction**, and helps reduce complexity of our
15+
dataset. At its root, principal component analysis *summarizes* data.
16+
17+
.. figure:: _img/pca4.png
18+
19+
Ref: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
20+
21+
Motivation
22+
----------
23+
24+
Principal component analysis is extremely useful for deriving an overall,
25+
linearly independent, trend for a given dataset with many variables.
26+
It allows you to extract important relationships out of variables that
27+
may or may not be related. Another application of principal component
28+
analysis is for display - instead of representing a number of different
29+
variables, you can create principal components for just a few and plot
30+
them.
31+
32+
Dimensionality Reduction
33+
------------------------
34+
35+
There are two types of dimensionality reduction: feature elimination
36+
and feature extraction.
37+
38+
**Feature elimination** simply involves pruning
39+
features from a dataset we deem unnecessary. A downside of feature
40+
elimination is that we lose any potential information gained from the
41+
dropped features.
42+
43+
**Feature extraction**, however, creates new variables
44+
by combining existing features. At the cost of some simplicity or
45+
interpretability, feature extraction allows you to maintain all
46+
important information held within features.
47+
48+
Principal component analysis deals with feature extraction (rather than
49+
elimation) by creating a set of independent variables called principal
50+
components.
51+
52+
PCA Example
53+
-----------
54+
55+
Principal component analysis is performed by considering all of our
56+
variables and calculating a set of direction and magnitude pairs (vectors)
57+
to represent them. For example, let's consider a small example dataset
58+
plotted below:
59+
60+
.. figure:: _img/pca1.png
61+
62+
Ref: https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
63+
64+
Here we can see two direction pairs, represented by the red and green
65+
lines. In this scenario, the red line has a greater magnitude as the
66+
points are more clustered across a greater distance than with the
67+
green direction. Principal component analysis will use the vector
68+
with the greater magnitude to transform the data into a smaller
69+
feature space, reducing dimensionality. For example, the above graph
70+
would be transformed into the following:
71+
72+
.. figure:: _img/pca2.png
73+
74+
Ref: https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
75+
76+
By transforming our data in this way, we've ignored a feature that
77+
is less important to our model - that is, higher variation along the
78+
green dimension will have a greater impact on our results than
79+
variation along the red.
80+
81+
The mathematics behind principal component analysis are left out of
82+
this discussion for brevity, but if you're interested in learning
83+
about them we highly recommend visiting the references listed at the
84+
bottom of this page.
85+
86+
Number of Components
87+
--------------------
88+
89+
In the example above, we took a two-dimensional feature space and
90+
reduced it to a single dimension. In most scenarios though, you will
91+
be working with far more than two variables. Principal component
92+
analysis can be used to just remove a single feature, but it is often
93+
useful to reduce several. There are several strategies you can employ
94+
to decide how many feature reductions to perform:
95+
96+
1. **Arbitrarily**
97+
98+
This simply involves picking a number of features to keep for your
99+
given model. This method is highly dependent on your dataset and
100+
what you want to convey. For instance, it may be beneficial to
101+
represent your higher-order data on a 2D space for visualization.
102+
In this case, you would perform feature reduction until you have
103+
two features.
104+
105+
2. **Percent of cumulative variability**
106+
107+
Part of the principal component analysis calculation involves
108+
finding a proportion of variance which approaches 1 through each
109+
round of PCA performed. This method of choosing the number of
110+
feature reduction steps involves selecting a target variance
111+
percentage. For instance, let's look at a graph of cumulative
112+
variance at each level of PCA for a theoretical dataset:
113+
114+
.. figure:: _img/pca3.png
115+
116+
Ref: https://www.centerspace.net/clustering-analysis-part-i-principal-component-analysis-pca
117+
118+
The above image is called a scree plot, and is a representation
119+
of the cumulative and current proportion of variance for each
120+
principal component. If we wanted at least 80% cumulative variance,
121+
we would use at least 6 principal components based on this scree plot.
122+
Aiming for 100% variance is not generally recommended, as reaching
123+
this means your dataset has redundant data.
124+
125+
3. **Percent of individual variability**
126+
127+
Instead of using principal components until we reach a cumulative
128+
percent of variability, we can instead use principal components
129+
until a new component wouldn't add much variability. In the plot
130+
above, we might choose to use 3 principal components since the
131+
next components don't have as strong a drop in variability.
132+
133+
Conclusion
134+
----------
135+
136+
Principal component analysis is a technique to summarize data, and is
137+
highly flexible depending on your use case. It can be valuable in both
138+
displaying and analyzing a large number of possibly dependent variables.
139+
Techniques of performing principal component analysis range from
140+
arbitrarily selecting principal components, to automatically finding
141+
them until a variance is reached.
142+
143+
Code Example
144+
------------
145+
146+
Our example code, `pca.py`_, shows you how to perform principal component
147+
analysis on a dataset of random x, y pairs. The script goes through a
148+
short process of generating this data, then calls sklearn's PCA module:
149+
150+
.. _pca.py: https://github.com/machinelearningmindset/machine-learning-course/blob/master/code/unsupervised/PCA/pca.py
151+
152+
.. code:: python
153+
154+
# Find two principal components from our given dataset
155+
pca = PCA(n_components = 2)
156+
pca.fit(points)
157+
158+
Each step in the process includes helpful visualizations using
159+
matplotlib. For instance, the principal components fitted above are
160+
plotted as two vectors on the dataset:
161+
162+
.. figure:: _img/pca5.png
163+
164+
The script also shows how to perform dimensionality reduction, discussed
165+
above. In sklearn, this is done by simply calling the transform method
166+
once a PCA is fitted, or doing both steps at the same time with
167+
fit_transform:
168+
169+
.. code:: python
170+
171+
# Reduce the dimensionality of our data using a PCA transformation
172+
pca = PCA(n_components = 1)
173+
transformed_points = pca.fit_transform(points)
174+
175+
The end result of our transformation is just a series of X values,
176+
though the code example performs an inverse transformation for plotting
177+
the result in the following graph:
178+
179+
.. figure:: _img/pca6.png
180+
181+
References
182+
----------
183+
184+
1. http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
185+
2. https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c
186+
3. https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60
187+
4. https://en.wikipedia.org/wiki/Principal_component_analysis
188+
5. https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues
189+
6. https://www.centerspace.net/clustering-analysis-part-i-principal-component-analysis-pca

0 commit comments

Comments
 (0)