|
| 1 | +############################# |
| 2 | +Convolutional Neural Networks |
| 3 | +############################# |
| 4 | + |
| 5 | +.. contents:: |
| 6 | + :local: |
| 7 | + :depth: 2 |
| 8 | + |
| 9 | + |
| 10 | +******** |
| 11 | +Overview |
| 12 | +******** |
| 13 | +In the last module, we started our dive into deep learning by talking about |
| 14 | +multi-layer perceptrons. In this module, we will learn about **convolutional |
| 15 | +neural networks** also called **CNNs** or **ConvNets**. CNNs differ from other |
| 16 | +neural networks in that sequential layers are not necessarily fully connected. |
| 17 | +This means that a subset of the input neurons may only feed into a single |
| 18 | +neuron in the next layer. Another interesting feature of CNNs is their inputs. |
| 19 | +With other neural networks we might use vectors as inputs, but with CNNs we |
| 20 | +are typically working with images and other objects with many dimensions. |
| 21 | +*Figure 1* shows some sample images that are each 6 pixels by 6 pixels. The |
| 22 | +first image is colored and has three channels for red, green, and blue values. |
| 23 | +The second image is black-and-white and only has one channel for gray values |
| 24 | + |
| 25 | +.. figure:: _img/Images.png |
| 26 | + |
| 27 | + **Figure 1. Two sample images and their color channels** |
| 28 | + |
| 29 | + |
| 30 | +********** |
| 31 | +Motivation |
| 32 | +********** |
| 33 | +CNNs are widely used in computer vision where we are trying to analyze visual |
| 34 | +imagery. CNNs can also be used for other applications such as natural language |
| 35 | +processing. We will be focusing on the former case here because it is one of |
| 36 | +the most common applications of CNNs. |
| 37 | + |
| 38 | +Because we assume that we’re working with images, we can design our |
| 39 | +architecture so that it specifically does a good job at analyzing images. |
| 40 | +Images have heights, depths, and one or more channels for color. In an image, |
| 41 | +there might be lines and edges that make up shapes as well as more complex |
| 42 | +structures such as cars and faces. We will potentially need to identify a |
| 43 | +large set of relevant features in order to properly classify an image. But |
| 44 | +just identifying individual features in an image usually isn’t enough. Say we |
| 45 | +have an image that may or may not be a face. If we saw three noses, an eye, |
| 46 | +and an ear, we probably wouldn’t call it a face even though those are common |
| 47 | +features of a face. So then we must also care about where features are located |
| 48 | +in the image and their proximity to other features. This is a lot of |
| 49 | +information to keep track of! Fortunately, the architecture of CNNs will cover |
| 50 | +a lot of these requirements. |
| 51 | + |
| 52 | + |
| 53 | +************ |
| 54 | +Architecture |
| 55 | +************ |
| 56 | +The architecture of a CNN can be broken down into an input layer, a set of |
| 57 | +hidden layers, and an output layer. These are shown in *Figure 2*. |
| 58 | + |
| 59 | +.. figure:: _img/Layers.png |
| 60 | + |
| 61 | + **Figure 2. The layers of a CNN** |
| 62 | + |
| 63 | +The hidden layers are where the magic happens. The hidden layers will break |
| 64 | +down our input image in order to identify features present in the image. The |
| 65 | +initial layers focus on low-level features such as edges while the later |
| 66 | +layers progressively get more abstract. At the end of all the layers, we have |
| 67 | +a fully connected layer with neurons for each of our classification values. |
| 68 | +What we end up with is a probability for each of the classification values. We |
| 69 | +choose the classification with the highest probability as our guess for what |
| 70 | +the image show. |
| 71 | + |
| 72 | +Below, we will talk about some types of layers we might use in our hidden |
| 73 | +layers. Remember that sequential layers are not necessarily fully connected |
| 74 | +with the exception of the final output layer. |
| 75 | + |
| 76 | +Convolutional Layers |
| 77 | +==================== |
| 78 | +The first type of layer we will discuss is called a **convolutional layer**. |
| 79 | +The convolutional description comes from the concept of a convolution in |
| 80 | +mathematics. Roughly, a convolution is some operation that acts on two input |
| 81 | +functions and produces an output function that combines the information |
| 82 | +present in the inputs. The first input will be our image and the second input |
| 83 | +will be some sort of filter such as a blur or sharpen. When we combine our |
| 84 | +image with the filter, we extract some information about the image. This |
| 85 | +process is shown in *Figure 3*. This is precisely how the CNN will go about |
| 86 | +extracting features. |
| 87 | + |
| 88 | +.. figure:: _img/Filtering.png |
| 89 | + |
| 90 | + **Figure 3. An image before and after filtering** |
| 91 | + |
| 92 | +In the human eye, a single neuron is only responsible for a small region of |
| 93 | +our field of view. It is through many neurons with overlapping regions that we |
| 94 | +are able to see the world. CNNs are similar. The neurons in a convolutional |
| 95 | +layer are only responsible for analyzing a small region of the input image but |
| 96 | +overlap so that we ultimately analyze the whole image. Let’s examine that |
| 97 | +filter concept we mentioned above. |
| 98 | + |
| 99 | +The **filter** or **kernel** is one of the functions used in the convolution. |
| 100 | +The filter will likely have a smaller height and width than the input image |
| 101 | +and can be thought of as a window sliding over the image. *Figure 4* shows a |
| 102 | +sample filter and the region of the image it will interact with in the first |
| 103 | +step of convolution. |
| 104 | + |
| 105 | +.. figure:: _img/Filter1.png |
| 106 | + |
| 107 | + **Figure 4. A sample filter and sample window of an image** |
| 108 | + |
| 109 | +As the filter moves across the image, we are calculating values for the |
| 110 | +convolution output called a **feature map**. At each step, we multiply each |
| 111 | +entry in the image sample and filter elementwise and sum up all the products. |
| 112 | +This becomes an entry in the feature map. This process is shown in *Figure 5*. |
| 113 | + |
| 114 | +.. figure:: _img/Filter2.png |
| 115 | + |
| 116 | + **Figure 5. Calculating an entry in the feature map** |
| 117 | + |
| 118 | +After the window traverses the entire image, we have the complete feature map. |
| 119 | +This is shown in *Figure 6*. |
| 120 | + |
| 121 | +.. figure:: _img/Filter3.png |
| 122 | + |
| 123 | + **Figure 6. The complete feature map** |
| 124 | + |
| 125 | +In the example above, we moved the filter one unit horizontally or one unit |
| 126 | +vertically from some previous position. This value is called the **stride**. We |
| 127 | +could have used other values for the stride but using one everywhere tends to |
| 128 | +produce the best results. |
| 129 | + |
| 130 | +You may have noticed that the feature map we ended up with had a smaller |
| 131 | +height and width than the original image sample. This is a result of the way |
| 132 | +we moved the filter around the sample. If we wanted the feature map to have |
| 133 | +the same height and width, we could **pad** the sample. This involves adding |
| 134 | +zero entries around the sample so that moving the filter keeps the dimensions |
| 135 | +of the original sample in the feature map. *Figure 7* illustrates this process. |
| 136 | + |
| 137 | +.. figure:: _img/Padding.png |
| 138 | + |
| 139 | + **Figure 7. Padding before applying a filter** |
| 140 | + |
| 141 | +A feature map represents one type of feature we’re analyzing the image for. |
| 142 | +Often, we want to analyze the image for a bunch of features so we end up with |
| 143 | +a bunch of feature maps! The output of the convolution layer is a set of |
| 144 | +feature maps. *Figure 8* shows the process of going from an image to the |
| 145 | +resulting feature maps. |
| 146 | + |
| 147 | +.. figure:: _img/Convo_Output.png |
| 148 | + |
| 149 | + **Figure 8. The output of a convolutional layer** |
| 150 | + |
| 151 | +After a convolutional layer, it is common to have a **ReLU** (rectified linear |
| 152 | +unit) layer. The purpose of this layer is to introduce non-linearity into the |
| 153 | +system. Basically, real-world problems are rarely nice and linear so we want |
| 154 | +our CNN to account for this when it trains. A good explanation of this layer |
| 155 | +requires math that we don’t expect you to know. If you are curious about the |
| 156 | +topic, you can find an explanation here_. |
| 157 | + |
| 158 | +.. _here: https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning |
| 159 | + |
| 160 | +Pooling Layers |
| 161 | +============== |
| 162 | +The next type of layer we will cover is called a **pooling layer**. The |
| 163 | +purpose of pooling layers are to reduce the spatial size of the problem. This |
| 164 | +in turn reduces the number of parameters needed for processing and the total |
| 165 | +amount of computation in the CNN. There are several options for pooling but we |
| 166 | +will cover the most common approach, **max pooling**. |
| 167 | + |
| 168 | +In max pooling, we slide a window over the input and take the max value in the |
| 169 | +window at each step. This process is shown in *Figure 9*. |
| 170 | + |
| 171 | +.. figure:: _img/Pooled.png |
| 172 | + |
| 173 | + **Figure 9. Max pooling on a feature map** |
| 174 | + |
| 175 | +Max pooling is good because it maintains important features about the input, |
| 176 | +reduces noise by ignoring small values, and reduces the spatial size of the |
| 177 | +problem. We can use these after convolutional layers to keep the computation |
| 178 | +of problems manageable. |
| 179 | + |
| 180 | +Fully Connected Layers |
| 181 | +====================== |
| 182 | +The last type of layer we will discuss is called a **fully connected layer**. |
| 183 | +Fully connected layers are used to make the final classification in the CNN. |
| 184 | +They work exactly like they do in other neural networks. Before moving to the |
| 185 | +first fully connected layer, we must flatten our input values into a |
| 186 | +one-dimensional vector that the layer can interpret. *Figure 10* shows a |
| 187 | +simple example of converting a multi-dimensional input into a one-dimensional |
| 188 | +vector. |
| 189 | + |
| 190 | +.. figure:: _img/Flatten.png |
| 191 | + |
| 192 | + **Figure 10. Flattening input values** |
| 193 | + |
| 194 | +After doing this, we may have several fully connected layers before the final |
| 195 | +output layer. The output layer uses some function, such as softmax_, |
| 196 | +to convert the neuron values into a probability distribution over our classes. |
| 197 | +This means that the image has a certain probability for being classified as |
| 198 | +one of our classes and the sum of all those probabilities equals one. This is |
| 199 | +clearly visible in *Figure 11*. |
| 200 | + |
| 201 | +.. _softmax: https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax |
| 202 | + |
| 203 | +.. figure:: _img/Layers_Final.png |
| 204 | + |
| 205 | + **Figure 11. The final probabilistic outputs** |
| 206 | + |
| 207 | + |
| 208 | +******** |
| 209 | +Training |
| 210 | +******** |
| 211 | +Now that we have the architecture in place for CNNs we can move on to |
| 212 | +training. Training a CNN is pretty much exactly the same as training a normal |
| 213 | +neural network. There is some added complexity due to the convolutional layers |
| 214 | +but the strategies for training remain the same. Techniques, such as gradient |
| 215 | +descent or backpropagation, can be used to train filter values and other |
| 216 | +parameters in the network. As with all the other training we have covered, |
| 217 | +having a large training set will improve the performance of the CNN. The |
| 218 | +problem with training CNNs and other deep learning models is that they are |
| 219 | +much more complex than the models we covered in earlier modules. This results |
| 220 | +in training being much more computationally expensive to the point where we |
| 221 | +would need specialized hardware like GPUs to run our code. However, we get |
| 222 | +what we pay for because deep learning models are much more powerful than the |
| 223 | +models covered in earlier modules. |
| 224 | + |
| 225 | + |
| 226 | +******* |
| 227 | +Summary |
| 228 | +******* |
| 229 | +In this module, we learned about convolutional neural networks. CNNs differ |
| 230 | +from other neural networks because they usually take images as input and can |
| 231 | +have hidden layers that are not fully connected. CNNs are powerful tools |
| 232 | +widely used in image classification applications. By using a variety of hidden |
| 233 | +layers, we can extract features from an image and use them to |
| 234 | +probabilistically guess a classification. CNNs are also complex models and |
| 235 | +understanding how they work can be an intimidating task. We hope that the |
| 236 | +information presented gives you a better understanding of how CNNs work so |
| 237 | +that you can continue to learn about them and deep learning. |
| 238 | + |
| 239 | + |
| 240 | +********** |
| 241 | +References |
| 242 | +********** |
| 243 | +#. https://towardsdatascience.com/convolutional-neural-networks-for-beginners-practical-guide-with-python-and-keras-dc688ea90dca |
| 244 | +#. https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8 |
| 245 | +#. https://medium.freecodecamp.org/an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050 |
| 246 | +#. https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53 |
| 247 | +#. https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ |
| 248 | +#. https://www.kaggle.com/dansbecker/rectified-linear-units-relu-in-deep-learning |
| 249 | +#. https://en.wikipedia.org/wiki/Convolutional_neural_network#ReLU_layer |
0 commit comments