# Convolutional Neural Networks

• Author: Johannes Maucher
• Last Update: 13th December 2017

There exists different types of deep neural networks, e.g. * Convolutional Neural Networks (CNNs) * Deep Belief Networks (DBNs) * Stacked Autoencoders

and many others and variants of them.

Among these different types, CNNs are currently the most relevant ones. This notebook describes the overall architecture and the different layer-types, applied in a CNN. Since the most prominent application of CNNs is object recognition in images, the descriptions in this notebook refer to this use-case.

## Overall Architecture

The picture below contains a very famous CNN - the so called AlexNet, which won the ImageNet contest in 2012. In the classification-task of this contest 1000 different objects must be recognized in images. For training 15 million labeled images have been applied. On a computer with two GTX 580 3GB GPUs training took about 6 days. AlexNet achieved a top-5 error rate of 15.4% - compared to 26.2% of the second best. AlexNet can be considered as an important milestone in the development and application of deep neural networks.

The input of to the AlexNet is a 3-channel RGB-image. The dense-layer at the output consists of 1000 neurons, each refering to one of the 1000 object categories, which are distinguished in the ImageNet data. The index of the output-neuron with the maximum value indicates the most-probable object category.

Between input- and output-layer, there are basically three different types of layers: * convolutional layers * pooling layers * fully connected layers

Moreover, normalisation-layers, dropout-layers, deconvolution-layers and dilation-layers are also frequently applied in CNNs. In the following sections, these layer types are described.

%matplotlib inline
import scipy.ndimage as ndi
import numpy as np
from matplotlib import pyplot as plt
import matplotlib.cm as cm
grays=plt.get_cmap('gray')

np.set_printoptions(precision=2)
#grays(1)
#print cm.hot(0.3)
#print cm.hot(0.4)


## Convolution Layer

A key concept of CNNs is the convolutional layer type. This layer type applies convolutional filtering, which is a well known image processing (or more general: signal processing) technique to extract features from the given input. In contrast to conventional convolutional filtering, in CNNs the filter coefficients and thus the relevant features are learned. In this section first the concept of conventional convolutional filtering is described. Then subsection Convolutional Layer in CNNs discusses the application of this concept in CNNs.

### Concept of 2D-Convolution Filtering

Convolution Filtering of a 2D-Input of size $$(r \times c)$$ with a 2D-Filter of size $$(a \times b)$$ is defined by applying the filter at each possible position and calculating the scalar product of the filter coefficients and the input-values covered by the current position of the filter. This scalar product is the filter-output at the corresponding position. If the filter is applied with a stepsize of $s=1$, then there exists $(r-a+1) \times (c-b+1)$ possible positions for calculating the scalar product. Hence, the output of the convolution filtering is of size $$(r-a+1) \times (c-b+1).$$

Hereafter, we assume to have quadratic inputs of size $(r \times r)$ and quadratic filters of size $(a \times a)$.

The picture below shows the filtering of a $(10 \times 10)$ input with an average filter of size $(3 \times 3)$. In this picture the filtering operation is only shown for the upper right and the lower left position. Actually, there are $(8 \times 8)$ elements in the output.

Note: In signal processing convolution and correlation is not the same. Convolution is correlation with the filter rotated 180 degrees. In context of neural networks, where filter coefficients are learned, this distinction can be ignored. As it is common in machine learning literature, in this notebook convolution and correlation are the same.

#### Stepsize

The filter need not be convolved in steps of $s=1$ across the input. Larger stepsizes yield a correspondingly smaller output. In the picture below filtering with stepsize of $s=2$ is shown below filtering the same input with a stepsize of $s=1$.

#### Zero-Padding

Besides stepsize, Padding is an important parameter of convolutional filtering. A padding of $p>0$ means, that at the left, right, upper and lower boundary of the input $p$ columns (rows) of zeros are attached to the input. In the picture below a padding of $p=1$ (lower part) is compared with a convolutional filtering without padding (upper part). Often zero-padding is set such that the size of output is equal to the size of the input. For a stepsize of $s=1$ and an uneven filter-width of $a$, this is the case if $$p=\frac{a-1}{2}$$

For a quadratic input with sidelength r, a quadratic filter of side-length a, a padding of p, and a stepsize of s, the quadratic output of convolution-filtering has a side-length of $$o=\frac{r-a+2p}{s}+1$$.

#### Example: 2D-convolution filter

Below, a 2D input x1 of shape (10,10) is defined as a numpy-array. The signal values are 1 (white) in the (4,4)-center region and 0 (black) elsewhere. For the display of the 2-dimensional signal the matplotlib method imshow() is applied.

x1=np.zeros((10,10))
x1[3:7,3:7]=1
print x1

[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  1.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  1.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  1.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  1.  1.  1.  1.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

offt=1
offx=1
dx=0.1
dy=0.1
usecols=["white","black"]
plt.figure(num=None, figsize=(14,7), dpi=80, facecolor='w', edgecolor='k')
plt.imshow(x1,cmap=plt.cm.gray,interpolation='none')
for col in range(10):
for row in range(10):
number="%d"%x1[row,col]
plt.text(col-dx,row+dy,number,fontsize=14,
color=grays(1-x1[row,col]))
plt.title('2-Dimensional Input Image')
plt.show()


A 2-dimensional average-filter shall be applied on this image. We implement this filter as a numpy-array:

FSIZE=3
avgFilter=1.0/FSIZE**2*np.ones((FSIZE,FSIZE))
print "Coefficients of average filter:\n",avgFilter

Coefficients of average filter:
[[ 0.11  0.11  0.11]
[ 0.11  0.11  0.11]
[ 0.11  0.11  0.11]]


The filtering operation is performed in the following code cell. Note, that the correlate() function applies filtering with zero-padding of size $$p=\frac{a-1}{2}$$.

avgImg=ndi.correlate(x1,avgFilter, output=np.float64, mode='constant')
#print "Result of average filtering:\n",avgImg


The filter result as calculated above, is visualized in the code-cell below:

dx=0.2
plt.figure(num=None, figsize=(14,7), dpi=80, facecolor='w', edgecolor='k')
plt.imshow(avgImg,cmap=plt.cm.gray,interpolation='none')
for col in range(10):
for row in range(10):
number="%1.2f"%avgImg[row,col]
plt.text(col-dx,row+dy,number,fontsize=10,
color=grays(1-avgImg[row,col]))
plt.title('Response on Average filter')
plt.show()


#### Example: Gradient Filters for Edge Detection

A 2-dimensional filter, which calculates the gradient in x-direction can be implemented as 2-dimensional numpy-array:

gradx=np.array([[-1.0, 0.0, 1.0],[-2.0, 0.0, 2.0],[-1.0, 0.0, 1.0],])
print gradx

[[-1.  0.  1.]
[-2.  0.  2.]
[-1.  0.  1.]]


The filter defined above is the well known Sobel Filter, which is frequently applied for edge-detection. The response of this filter applied to our example image can be calculated as follows:

gradxImg=ndi.correlate(x1,gradx, output=np.float64, mode='constant')
#print "Result of x-gradient Filtering:\n",gradxImg


Visualization of the Sobel-x-gradient calculation:

dx=0.2
plt.figure(num=None, figsize=(14,7), dpi=80, facecolor='w', edgecolor='k')
plt.imshow(gradxImg,cmap=plt.cm.gray,interpolation='none')
for col in range(10):
for row in range(10):
number="%1.2f"%gradxImg[row,col]
plt.text(col-dx,row+dy,number,fontsize=10,
color=grays(1-gradxImg[row,col]))
plt.title('Response on x-gradient filtering with Sobel')
plt.show()


For determining the gradient in y-direction the Sobel filter for the x-gradient must just be transposed:

grady=np.transpose(gradx)
print grady

[[-1. -2. -1.]
[ 0.  0.  0.]
[ 1.  2.  1.]]


The response of this y-gradient Sobel filter applied to our example image can be calculated and visualized as follows:

gradyImg=ndi.correlate(x1,grady, output=np.float64, mode='constant')
#print "Result of y-gradient Filtering:\n",gradyImg

dx=0.2
plt.figure(num=None, figsize=(14,7), dpi=80, facecolor='w', edgecolor='k')
plt.imshow(gradyImg,cmap=plt.cm.gray,interpolation='none')
for col in range(10):
for row in range(10):
number="%1.2f"%gradyImg[row,col]
plt.text(col-dx,row+dy,number,fontsize=10,
color=grays(1-gradyImg[row,col]))
plt.title('Response on y-gradient filtering with Sobel')
plt.show()


#### Applying the Sobel filter for detecting horizontal and vertical edges in an image

In the previous subsection the Sobel filter has been applied on a small 2D-input. Now, both filters are applied to the real image. The real greyscale image, it’s gradients in x- and y-direction and the magnitude of the gradient are plotted below:

from PIL import Image
im = np.array(Image.open("a4weiss.jpg").convert('L'))
plt.imshow(im,cmap='Greys_r')

<matplotlib.image.AxesImage at 0xc8eeba8>


imx=ndi.correlate(im,gradx, output=np.float64, mode='constant')
imy=ndi.correlate(im,grady, output=np.float64, mode='constant')


The output of the Sobel filter is stored in the numpy arrays imx and imy, respectively. Moreover, the magnitude of the gradient is calculated and plotted to the matplotlib figure.

magnitude=np.sqrt(imx**2+imy**2)
plt.figure(num=None, figsize=(18,12), dpi=80, facecolor='w', edgecolor='k')
plt.subplot(2,2,1)
plt.imshow(im,cmap=plt.cm.gray)
plt.title("Original")
plt.subplot(2,2,2)
plt.imshow(magnitude,cmap=plt.cm.gray)
plt.title("Magnitude of Gradient")
plt.subplot(2,2,3)
plt.imshow(imx,cmap=plt.cm.gray)
plt.title('Derivative in x-direction')
plt.subplot(2,2,4)
plt.title('Derivative in y-direction')
plt.imshow(imy,cmap=plt.cm.gray)
plt.show()


### Convolutional Layer in CNNs

In the car-image example above 3 different filters where applied on a single input. Each filter extracted a specific feature: * the x-component of the gradient, * the y-component of the gradient, * the magnitude of the gradient.

In Convolutional Neural Networks (CNN) the concept of convolution-filtering is realized as demonstrated above. However, in CNNs * the filter coefficients are not defined by the user. Instead they are learned in the training phase, such that these filters are able to detect patterns, which frequently occur in the training data. In the context of CNNs the filter-coefficients are called weights. * the output of a convolutional filter is called a feature map. All elements of a single feature map are calculated by the same set of shared weights. * the output-elements of a convolutional filter are typically fed element-wise to an activation-function, which maps single scalars to other scalars. * for a given input not only one, but multiple feature maps are calculated in parallel. Different feature maps have different sets of shared weights. In the car-image above, three different filters calculated 3 different feature maps on a single input. * The input does not only contain one, but multiple parallel 2D-array arrays of equal size. These parallel 2D-arrays are called channels.

#### Multiple Channels at the input of the Convolutional Layer

In CNNs the input to a convolutional layer is in general not a single 2D-array, but a set of multiple arrays (channels). For example in object recognition the input of the first convolutional layer $conv_1$ is usually the image. The number of channels is then 3 for a RGB-image and 1 for a greyscale image. For all of the following convolutional layers $convi, \; i>1,$ the input is typically the set of (pooled) feature maps of the previous layer $conv{i-1}$. In the picture below the calculation of the values of a single feature map from $L=3$ channels at the input of the convolutional layer is depicted. Note that for calculating the values of a single feature map, for each channel an individual filter is learned and applied.

The configuration sketched in the picture above is implemented in the code cell below. Each channel is implemented as a numpy-array channeli. Each of the 3 filters is also implemented as a numpy-array filteri:

channel1=np.array([[8,9,9,0,-1],
[3,2,5,1,1],
[2,3,4,1,-3],
[2,4,-1,-1,-2],
[1,1,1,1,1],])
channel2=np.array([[2,-2,0,1,-1],
[3,2,0,0,2],
[-1,1,-1,2,-2],
[1,0,3,4,-2],
[0,0,0,1,1],])
channel3=np.array([[0,1,1,2,0],
[-1,0,1,3,2],
[-1,-1,0,0,0],
[-1,0,3,0,2],
[1,2,3,1,1],])
filter1=np.array([
[0.1,-0.1,0],
[0,0.2,0],
[-0.2,-0.1,0.3]
])
filter2=np.array([
[0,0,0.3],
[-0.1,0.3,-0.2],
[-0.3,-0.1,0.1]
])
filter3=np.array([
[0.2,0,0.1],
[-0.1,0.2,0.3],
[-0.2,-0.1,0.3]
])


The convolution is realized by the scipy-method correlate(). The output of the convolution-filtering, i.e. the feature-map is also stored in a numpy-array. Note that the method correlate() applies zero padding of $p=1$ for a filter-width of $a=3$. In order to obtain the corresponding result without padding one can just remove the line and columns at the border of the obtained output.

out=ndi.correlate(channel1,filter1, output=np.float64, mode='constant')+ \
ndi.correlate(channel2,filter2, output=np.float64, mode='constant')+ \
ndi.correlate(channel3,filter3, output=np.float64, mode='constant')
featureMap=out[1:4,1:4] #ignore the outer rows and columns in order to obtain the non-padded result
print("Feature Map: \n %s"%featureMap)

Feature Map:
[[ 2.   2.1 -0.2]
[ 1.1 -1.1  0.9]
[ 1.   0.5  0.7]]


#### Multiple Feature Maps at the output of the Convolutional Layer

The previous subsection demonstrated the calulation of a single feature map from multiple channels at the input of the convolutional layer. In a CNN there are multiple channels at the input and multiple feature maps at the output of a convolutional layer. For sake of simplicity - and because we already know how multiple channels at the input influence a single feature map at the output - the picture and the code cells below demonstrate how multiple feature maps at the output are calculated from a single channel. Note that each feature map has it’s own set of shared weights (filter-coeffients) for each channel at the input of the convolutional layer:

The configuration sketched in the picture above is implemented in the code cells below.

channel1=np.array([[2,-2,0,1,-1],
[3,2,0,0,2],
[-1,1,-1,2,-2],
[1,0,3,4,-2],
[0,0,0,1,1],])
filter1=np.array([
[0,0.4,0],
[0,0.1,0],
[-0.2,0.2,0]
])
filter2=np.array([
[0,0,0],
[0.3,0,0.4],
[0.2,0.3,0.1]
])
filter3=np.array([
[0.2,0,0.4],
[0.2,0,0.3],
[0.5,-0.1,0]
])

out1=ndi.correlate(channel1,filter1, output=np.float64, mode='constant')
featureMap1=out1[1:4,1:4]
print("Feature Map 1: \n %s"%featureMap1)

Feature Map 1:
[[-0.2 -0.4  1. ]
[ 0.7  0.5  0.4]
[ 0.4 -0.1  1.4]]

out2=ndi.correlate(channel1,filter2, output=np.float64, mode='constant')
featureMap2=out2[1:4,1:4]
print("Feature Map 2: \n %s"%featureMap2)

Feature Map 2:
[[ 0.9  0.7  1. ]
[-0.2  2.4  0.5]
[ 1.5  1.7  0.5]]

out3=ndi.correlate(channel1,filter3, output=np.float64, mode='constant')
featureMap3=out3[1:4,1:4]
print("Feature Map 3: \n %s"%featureMap3)

Feature Map 3:
[[ 0.4  1.  -0.5]
[ 0.6  0.9  1.1]
[ 0.5  2.2 -1.1]]


The obtained feature-map values are processed element-wise by an activation function. These processed feature maps constitute the input channels for the following layer, either another convolution-layer, a pooling-layer or a fully-connected layer.

#### Activation function

Activation functions operate element-wise on the values of feature maps. For CNNs the most common activation functions are * ReLU (Rectified Linear Unit) * sigmoid * tanh (tangens hyperbolicus) * linear These functions are defined and visualized below.

def sigmoid(input):
return 1/(1+np.exp(-input))

def tanh(input):
return np.tanh(input)

def linear(input):
return input

def relu(input):
return np.maximum(np.zeros(input.shape),input)

def softmax(z):
"""Softmax activation function."""
return np.exp(z)/np.sum(np.exp(z),axis=0)

R=6
x=np.arange(-R,R,0.01)
plt.figure(figsize=(10,10))
###Sigmoid#################
plt.subplot(2,2,1)
plt.grid()
ysig=sigmoid(x)
plt.plot(x,ysig,'b-')
plt.title("Sigmoid")
###Tanh####################
plt.subplot(2,2,2)
plt.grid()
ytan=tanh(x)
plt.plot(x,ytan,'g-')
plt.title("Tanh")
###Linear##################
ylin=linear(x)
plt.subplot(2,2,3)
plt.grid()
plt.plot(x,ylin,'m-')
plt.title("Linear")
###Relu####################
yrelu=relu(x)
plt.subplot(2,2,4)
plt.grid()
plt.plot(x,yrelu,'r-')
plt.title("ReLU")
plt.show()


In CNN convolutional layers the ReLU activation function is frequently applied. In the code-cells below ReLU is applied on the 3 feature maps from above.

featureMap1re=relu(featureMap1)
print("Feature Map 1:\n%s"%featureMap1)
print("After applying ReLU on Feature Map 1:\n%s"%featureMap1re)

Feature Map 1:
[[-0.2 -0.4  1. ]
[ 0.7  0.5  0.4]
[ 0.4 -0.1  1.4]]
After applying ReLU on Feature Map 1:
[[ 0.   0.   1. ]
[ 0.7  0.5  0.4]
[ 0.4  0.   1.4]]

featureMap2re=relu(featureMap2)
print("Feature Map 2:\n%s"%featureMap2)
print("After applying ReLU on Feature Map 2:\n%s"%featureMap2re)

Feature Map 2:
[[ 0.9  0.7  1. ]
[-0.2  2.4  0.5]
[ 1.5  1.7  0.5]]
After applying ReLU on Feature Map 2:
[[ 0.9  0.7  1. ]
[ 0.   2.4  0.5]
[ 1.5  1.7  0.5]]

featureMap3re=relu(featureMap3)
print("Feature Map 3:\n%s"%featureMap3)
print("After applying ReLU on Feature Map 3:\n%s"%featureMap3re)

Feature Map 3:
[[ 0.4  1.  -0.5]
[ 0.6  0.9  1.1]
[ 0.5  2.2 -1.1]]
After applying ReLU on Feature Map 3:
[[ 0.4  1.   0. ]
[ 0.6  0.9  1.1]
[ 0.5  2.2  0. ]]


## Pooling Layer in CNNs

Convolutional layers extract spatial features from their input. The filters (sets of shared weights) define which features are extracted. In the training phase the filters are learned, such that after learning they represent patterns, which frequently appear in the training data. Since each element in a feature map corresponds to a unique region of the input, feature maps do not only represent if the feature is contained in the current input, but also where it is contained (spatial information).

The benefits of pooling layers are: * they reduce the size of the input channels and therefore reduce complexity * they provide a certain degree of shift-invariance in the sense, that if in two inputs a certain feature appears not in exactly the same position, but in nearby positions they yield the same pooled output.

Similar as in a convolution layer, in pooling layers a filter is shifted across a 2D-input and calculates for each position a single value. However, in pooling layers * the filter weights are not learned, instead the operation performed is a fixed and often non-linear operation. Common pooling operations are: * max-pooling: the filter outputs the maximum of it’s current input region * min-pooling: the filter outputs the minimum of it’s current input region * mean-pooling: the filter outputs the arithmetic mean of it’s current input region

The most frequent type is max-pooling.

• the common stepsize $s$ in a pooling layer is equal to the width of the filter $w$, i.e. the pooling filter operates on non-overlapping regions. Even though $s=w$ is the common configuration, it is not mandatory and there exist some good CNNs, whose pooling layers operate in a overlapping manner with $s<w$.

In the picture below max-pooling with a filter width of $w=s=2$ is shown. In this case pooling reduces the size of the 2D-input by a factor of 2.

The configuration sketched in the picture above is implemented in the code cells below.

r=8 # number of rows and columns
Input=np.array([
[2,5,3,4,5,9,6,9],
[4,1,5,5,9,8,7,8],
[5,4,8,0,8,6,7,0],
[2,4,3,5,1,8,9,8],
[3,0,7,0,8,8,6,7],
[4,2,3,2,6,6,2,2],
[1,0,2,0,3,1,6,0],
[4,2,0,4,3,2,9,5],
])
w=2 # width of pooling filter
s=2 # stepsize of pooling filter
o=int(np.floor((r-w)/s)+1) # size of pooling output
print o
Output=np.zeros((o,o))
for row in range(o):
for col in range(o):
#print row*s,(row+1)*s
#print Input[row*s:(row+1)*s,col*s:(col+1)*s]
Output[row,col]=np.max(Input[row*s:(row+1)*s,col*s:(col+1)*s])
print Output


4
[[ 5.  5.  9.  9.]
[ 5.  8.  8.  9.]
[ 4.  7.  8.  7.]
[ 4.  4.  3.  9.]]


## Concatenation of Convolution and Pooling

As shown by the example of the AlexNet architecture, in a CNN after the input layer usually a cascade of convolution followed by pooling is applied. Each convolutional layer extracts meaningful features and their location. Each pooling layer reduces the size of the channels and thus the spatial resolution of features.

The image below shows a single sequence of convolution and pooling.

In subsections Multiple Channels and Multiple Feature Maps it was already shown how to calculate a single feature map from multiple input channels and how to calculate multiple feature maps from a single input channel. Now, we have a combination of both of them: Multiple feature maps are calculated from multiple channels. For a convolutional Layer with $C$ input channels and $F$ feature maps at the output, the entire operation is defined by an array of $F$ rows and $C$ columns. Each element of this array is a 2-dim array $W_{ij}$, which is the filter applied for calculating feature map $i$ from channel $j$. Hence, the entire convolutional operation is defined by a 4-dim filter array, whose

• first dimension is the number of featuremaps $F$ at the output of the convolutional layer
• second dimension is the number of channels $C$ at the input of the convolutional layer
• third and forth dimension is given by the size of the covolutional filter $W_{ij}$.

Correspondingly, the $C$ input channels can be arranged in a 3-dimensional input array, whose

• first dimension is the number of channels $C$ at the input of the convolutional layer
• second dimension is the number of rows in each input channel
• third dimension is the number of columns in each input channel

Applying the 4-dimensional filter array on the 3-dimensional input array yields a 3-dimensional feature array, whose

• first dimension is the number of featuremaps $F$ in the output of the convolutional layer
• second dimension is the number of rows in each featuremap at the output of the convolutional layer
• third dimension is the number of columns in each featuremap at the output of the convolutional layer

The code cells below demonstrate the entire operation in a convolutional layer. For this demo, 1. an array Input of 3-input channels is randomly generated, 2. a 4-dimensional FilterArray is randomly generated, 3. the the function convolution() is defined, which calculates the $F$ featuremaps from the Input- and the FilterArray.

Define $C=3$ random input channels of size $(6 \times 6)$:

np.random.seed(1234) #random seed is fixed just to obtain always the same random result
Input=1.0/10*np.random.randint(-10,10,(3,6,6))
print Input

[[[ 0.5  0.9 -0.4  0.2  0.5  0.7]
[-0.1  0.1  0.2  0.6 -0.5  0.6]
[-0.1  0.5  0.8  0.6  0.2 -0.5]
[-0.8 -0.4 -0.7 -0.3  0.1 -1. ]
[-0.1  0.1  0.6 -0.7 -0.8  0.9]
[ 0.2 -0.9  0.1  0.9  0.1  0.7]]

[[ 0.4  0.9 -0.3  0.   0.1  0.4]
[ 0.7  0.3 -1.   0.2 -0.5  0.7]
[-0.5  0.3  0.6 -0.1 -0.2  0.2]
[-0.4  0.2  0.9  0.5  0.7  0.8]
[ 0.4 -0.8 -0.5  0.3 -0.4 -0.3]
[-0.6 -0.7 -0.5  0.4  0.5  0.5]]

[[ 0.5 -0.8  0.  -0.6  0.8 -0.3]
[ 0.1  0.4  0.8 -0.1 -1.  -0.8]
[-0.9  0.8  0.7 -0.3 -0.6 -0.3]
[ 0.7 -1.  -0.1  0.8 -0.1 -0.9]
[ 0.4 -0.7  0.2 -0.1  0.3  0.9]
[-1.  -0.6 -0.6 -1.  -0.2  0.2]]]


Define a random filter array, which calculates $F=4$ feature maps from $C=3$ input channels. Each filter $W_{ij}$ is of size $(3 \times 3)$:

np.random.seed(1234) #random seed is fixed just to obtain always the same random result
FilterArray=1.0/10*np.random.randint(-10,10,(4,3,3,3))
print FilterArray

[[[[ 0.5  0.9 -0.4]
[ 0.2  0.5  0.7]
[-0.1  0.1  0.2]]

[[ 0.6 -0.5  0.6]
[-0.1  0.5  0.8]
[ 0.6  0.2 -0.5]]

[[-0.8 -0.4 -0.7]
[-0.3  0.1 -1. ]
[-0.1  0.1  0.6]]]

[[[-0.7 -0.8  0.9]
[ 0.2 -0.9  0.1]
[ 0.9  0.1  0.7]]

[[ 0.4  0.9 -0.3]
[ 0.   0.1  0.4]
[ 0.7  0.3 -1. ]]

[[ 0.2 -0.5  0.7]
[-0.5  0.3  0.6]
[-0.1 -0.2  0.2]]]

[[[-0.4  0.2  0.9]
[ 0.5  0.7  0.8]
[ 0.4 -0.8 -0.5]]

[[ 0.3 -0.4 -0.3]
[-0.6 -0.7 -0.5]
[ 0.4  0.5  0.5]]

[[ 0.5 -0.8  0. ]
[-0.6  0.8 -0.3]
[ 0.1  0.4  0.8]]]

[[[-0.1 -1.  -0.8]
[-0.9  0.8  0.7]
[-0.3 -0.6 -0.3]]

[[ 0.7 -1.  -0.1]
[ 0.8 -0.1 -0.9]
[ 0.4 -0.7  0.2]]

[[-0.1  0.3  0.9]
[-1.  -0.6 -0.6]
[-1.  -0.2  0.2]]]]


Above the generated FilterArray is displayed. The first 3 matrices are the filters $W{11}, W{12}, W{13}$, which act on the $C=3$ channels in order to calculate the first feature map. The next 3 matrices are the filters $W{21}, W{22}, W{23}$, which act on the $C=3$ channels in order to calculate the second feature map, and so on.

Next, we define the function convolution(), which calculates the feature maps from the given Input and FilterArray:

def convolution(Input,FilterArray,padding=False):
"""
Input is a 3-dim numpy array, where
* dimension 0 is the number of channels in the input of the convolutional layer
* dimension 1 is the number of rows in each input channel
* dimension 2 is the number of columns in each input channel

FilterArray is a 4-dim numpy array, where
* dimension 0 is the number of featuremaps at the output of the convolutional layer
* dimension 1 is the number of channels at the input of the convolutional layer
* dimension 2 = dimension 3 is the size of the rectangular filter

This function returns a 3-dimensional array, where
* dimension 0 is the number of featuremaps in the output of the convolutional layer
* dimension 1 is the number of rows in each feature map at the output of the convolutional layer
* dimension 2 is the number of columns in each feature map at the output of the convolutional layer

"""
numFeats=FilterArray.shape[0] #number of feature maps at the output
numChannels=FilterArray.shape[1] #number of channels at the input
filterSize=FilterArray.shape[2] # filter size
if filterSize != FilterArray.shape[3]:
return "ERROR: Only rectangular filter sizes are supported"
p=int(np.floor((filterSize-1)/2)) #padding size
FeaturesList=[]
for f in range(numFeats):
for c in range(numChannels):
if c==0:
out=ndi.correlate(Input[c],FilterArray[f,c], output=np.float64, mode='constant')
else:
out = out + ndi.correlate(Input[c],FilterArray[f,c], output=np.float64, mode='constant')
FeaturesList.append(out[p:-p,p:-p])
return np.array(FeaturesList)


Applying the function convolution(), the feature maps (4 in this example) can be calculated from the input and the FilterArray as follows:

FeatureMaps=convolution(Input,FilterArray)
print FeatureMaps

[[[-0.32  2.3  -0.12  1.59]
[-1.03  1.99  1.01  0.96]
[ 0.18 -1.27  1.66  3.41]
[-1.81 -1.27 -1.93 -0.07]]

[[-0.05  0.4   0.9  -0.57]
[ 0.13 -2.28 -3.56 -1.84]
[ 1.01  0.26  0.03 -1.67]
[ 0.08  0.32  1.26  2.01]]

[[ 0.63  1.    0.88 -0.86]
[ 2.39  3.27  1.25  1.5 ]
[-4.45 -2.01  1.03 -2.18]
[-0.41 -2.24 -2.45  0.01]]

[[-0.51 -1.68 -1.95 -0.6 ]
[ 1.    1.81 -1.24 -1.37]
[-1.97 -1.51 -0.95 -0.13]
[ 2.95  1.25 -1.76  0.38]]]


The first matrix in the displayed output is the first feature map. The last matrix is the fourth feature map. As sketched in the image convolution and pooling, the calculated feature maps are processed by an activation function. Below, we apply the Relu-activation on the 4 feature maps. As can be seen in the output all negative values are mapped to 0, whereas the positive values are not changed.

FeatureMapsRelu=relu(FeatureMaps)
print FeatureMapsRelu

[[[ 0.    2.3   0.    1.59]
[ 0.    1.99  1.01  0.96]
[ 0.18  0.    1.66  3.41]
[ 0.    0.    0.    0.  ]]

[[ 0.    0.4   0.9   0.  ]
[ 0.13  0.    0.    0.  ]
[ 1.01  0.26  0.03  0.  ]
[ 0.08  0.32  1.26  2.01]]

[[ 0.63  1.    0.88  0.  ]
[ 2.39  3.27  1.25  1.5 ]
[ 0.    0.    1.03  0.  ]
[ 0.    0.    0.    0.01]]

[[ 0.    0.    0.    0.  ]
[ 1.    1.81  0.    0.  ]
[ 0.    0.    0.    0.  ]
[ 2.95  1.25  0.    0.38]]]


As depicted in the image convolution and pooling, the ReLU-processed feature maps are fed to a pooling layer. The function maxpooling() in the code-cell below performs max-pooling. The pooling-filter width $w$ and the pooling step-size $s$ can be configured in the arguments of this function.

def maxpooling(Input,w,s):
"""
* w: filter width
* s: pooling step-size (s=w for non-overlapping pooling)
"""
numFeats=Input.shape[0] #number of feature maps in the input
r=Input.shape[1] #filter size
if r != Input.shape[2]:
return "ERROR: Only rectangular pooling filter sizes are supported"
o=int(np.floor((r-w)/s)+1) # size of pooling output
Output=np.zeros((numFeats,o,o))
for f in range(numFeats):
for row in range(o):
for col in range(o):
Output[f,row,col]=np.max(Input[f,row*s:(row+1)*s,col*s:(col+1)*s])
return Output


Next, the defined maxpooling-function is applied to the ReLU-processed feature maps. The result is displayed below. Before maxpooling each feature-map contained information on the presence of the feature at the corresponding spatial region. After pooling spatial resolution is reduced, i.e. the lcoation of the features is coarser.

pooledFeats=maxpooling(FeatureMapsRelu,w=2,s=2)
print pooledFeats

[[[ 2.3   1.59]
[ 0.18  3.41]]

[[ 0.4   0.9 ]
[ 1.01  2.01]]

[[ 3.27  1.5 ]
[ 0.    1.03]]

[[ 1.81  0.  ]
[ 2.95  0.38]]]


The picture below sketches the sequence of convolution, activation and pooling in an abstract manner: For $L_i$ input channels of convolutional layer $convi$ $L{i+1}$ feature maps are calculated. These feature maps are processed by an activation function and fed to a pooling layer. The pooled $L{i+1}$ feature maps are the $L{i+1}$ input channel for the next convolutional layer. If no further convolution-layer is applied, the pooled feature maps are serialized and fed to a fully connected layer.

## Fully Connected Layers

In CNNs cascades of convolution- and pooling layer learn to extract meaningful features. These learned features are then fed to a classifier or to a regressor, which outputs the estimated class or the estimated numeric value, respectively. The final classifier or regressor is usually implemented by a usual single- or multilayer perceptron (SLP or MLP). The layers of the SLP or MLP are called fully connected layers, since each neuron in layer $k$ is connected to all neurons in layer $k-1$.

If $\mathbf{x}=(x_1,x_2,\ldots,x_n)$ is the input and $\mathbf{y}=(y_1,y_2,\ldots,y_m)$ is the output of a fully connected layer, then

$$\mathbf{y}=g(W \cdot \mathbf{x^T}),$$

where $g()$ is the activation function and

$$W=\left( \begin{array}{cccc} w{1,1} & w{1,2} & \cdots & w{1,16} \ w{2,1} & w{2,2} & \cdots & w{2,16} \ \vdots & \vdots & \ddots & \vdots w{8,1} & w{8,2} & \cdots & w_{8,16} \ \end{array} \right)$$

is the weight matrix. Entry $w_{i,j}$ is the weight from the $j.th$ element in $\mathbf{x}$ to the $i.th$ element in $\mathbf{y}$.

As shown in the picture below, the output of the last pooling layer is serialized before it is fed into a fully connected layer. In this example only one fully connected layer is applied, i.e. the classifier is just a SLP. Since there are 8 neurons in the output of the fully connected layer, this example architecture can be applied for a classification into 8 classes. In this case the output is usually processed by a softmax-activation function, which is not depicted in the image below.

In the code-cells below the operation of the fully connected layer is demonstrated according to the example in the picture above.

First the output of the maxpooling layer is serialized as follows:

print pooledFeats
featurevector=pooledFeats.flatten()
print "Serialized Output of maxpooling:\n",featurevector

[[[ 2.3   1.59]
[ 0.18  3.41]]

[[ 0.4   0.9 ]
[ 1.01  2.01]]

[[ 3.27  1.5 ]
[ 0.    1.03]]

[[ 1.81  0.  ]
[ 2.95  0.38]]]
Serialized Output of maxpooling:
[ 2.3   1.59  0.18  3.41  0.4   0.9   1.01  2.01  3.27  1.5   0.    1.03
1.81  0.    2.95  0.38]


Then a $(8 \times 16)$-weight matrix for the fully connected layer is generated randomly:

np.random.seed(1234) #random seed is fixed just to obtain always the same random result
FCweights=1.0/10*np.random.randint(-10,10,(8,16))
print "weight matrix:\n",FCweights

weight matrix:
[[ 0.5  0.9 -0.4  0.2  0.5  0.7 -0.1  0.1  0.2  0.6 -0.5  0.6 -0.1  0.5
0.8  0.6]
[ 0.2 -0.5 -0.8 -0.4 -0.7 -0.3  0.1 -1.  -0.1  0.1  0.6 -0.7 -0.8  0.9
0.2 -0.9]
[ 0.1  0.9  0.1  0.7  0.4  0.9 -0.3  0.   0.1  0.4  0.7  0.3 -1.   0.2
-0.5  0.7]
[-0.5  0.3  0.6 -0.1 -0.2  0.2 -0.4  0.2  0.9  0.5  0.7  0.8  0.4 -0.8
-0.5  0.3]
[-0.4 -0.3 -0.6 -0.7 -0.5  0.4  0.5  0.5  0.5 -0.8  0.  -0.6  0.8 -0.3
0.1  0.4]
[ 0.8 -0.1 -1.  -0.8 -0.9  0.8  0.7 -0.3 -0.6 -0.3  0.7 -1.  -0.1  0.8
-0.1 -0.9]
[ 0.4 -0.7  0.2 -0.1  0.3  0.9 -1.  -0.6 -0.6 -1.  -0.2  0.2  0.7 -0.1
0.5 -0.2]
[-0.8  0.6  0.1 -0.8  0.8  0.5 -0.7  0.4 -0.8 -0.6 -0.9  0.  -0.8  0.3
-0.7  0.8]]


The output of the network is calculated by multiplying the serialzed featurevector with the weight matrix FCweights. The result of this matrix-multiplication is then processed by a softmax-activation function:

z=np.dot(FCweights,featurevector)
y=softmax(z)
print y

[  9.93e-01   2.75e-07   3.16e-03   3.57e-03   9.93e-05   1.09e-06
1.54e-05   1.30e-08]


The $1.st$ component of the output $\mathbf{y}$ has the maximum value. Hence the network would decide on the class 1.

## Final Remarks

In this notebook the basic concepts of CNNs, i.e. convolution-, pooling- and fully-connected layers have been introduced. Other techniques and layer-types, such as e.g.

• normalization
• dropout
• dilation
• deconvolution

will be introduced in a follow-up notebook.

For sake of simplicity in this notebook biases have not been applied in convolutional- and fully-connected-layers. In practical networks biases are applied for these two layer types in the same way as in the context of SLPs MLPs.