(1)本人挑战手写代码验证理论，获得一些AI工具无法提供的收获和思考，对于一些我无法回答的疑问请大家在评论区指教；

(2)本系列文章有很多细节需要弄清楚，但是考虑到读者的吸收情况和文章篇幅限制，选择重点进行分享，如果有没说清楚或者解释错误的地方欢迎在评论区提出；

(3)写的时候是用英文撰写的，这里就不翻译成中文了，希望大家理解；

(4)本系列内容基于李沐老师《动手学深度学习》教材，网址：

《动手学深度学习》 — 动手学深度学习 2.0.0 documentation

(5)由于代码量较大，以免费资源的形式上传到个人空间，方便读者运行和使用。

注：AlexNet提供了Pytorch和MxNet两种实现方式，LeNet只提供基于MxNet框架的实现。

原论文作者训练后的模型参数也可以通过深度学习框架直接下载获取，不过本实验意在探究CNN的理论基础和实现思路，因此从零开始训练不同版本的"LeNet"和"AlexNet"。

同时提出了不同的代码实现方案和分析思路，对于训练成本较大的模型，建议使用Google Colaboratory提供的免费算力平台，本质是配置了Pytorch和Tensorflow等深度学习框架的基于Ubuntu系统的服务器。

本篇主要分析：

【1】CNN卷积神经网络中卷积层、池化层、批规范化层、激活层、“暂退层”的作用原理；

下篇文章主要分析：

【2】单CPU核训练背景下的时间花费组成和实验验证，以及函数接口的加速效果；

【3】学习率、优化方法、批量大小、激活函数等超参数（Hyperparameters）的调参方法；

【4】卷积神经网络（LeNet，1998）和深度卷积神经网络（AlexNet，2012）在MNIST，Fashion_MNIST，CIFAR100数据集上的表现与一种可能可行的参数量自适应调整方法；

【5】CNN激活层特征可视化，直观比对人工设计卷积核的滤波效果，理解CNN的信息提取过程；

【6】混淆矩阵的作用分析，绘制自定义混淆矩阵。

从零开始搭建深度学习大厦系列-3.卷积神经网络基础（5-9）-CSDN博客https://blog.csdn.net/2302_80464577/article/details/149260898?sharetype=blogdetail&sharerId=149260898&sharerefer=PC&sharesource=2302_80464577&spm=1011.2480.3001.8118

A Quick Look

LeNet(Based on mxnet; Textbook:2019+GPU，Max-pooling; Mine: Max-pooling+2 PREFETCHING PROCESS / 5 Prefetching processes)

2 PREFETCHING PROCESSES

5 Prefetching processes

Alexnet (Based on pytorch; Textbook:Original;Mine: Parameter size nearly 1/256 of original design)

2 Prefetching processes (Batch size=64)

5 Prefetching processes（Batch size=64/32, initial learning rate=0.01/0.03）

Figure 1 result of textbook’s VS mine

Content

Environment Setting. 5

Experiment Goals. 6

1. Edge Detection. 6

1.1 Basic Principle. 6

1.2 Function Design. 6

1.3 Carrying-out Result 8

2. Shape of layers and kernels in a CNN.. 14

2.1 Basic Theories. 15

2.2 Code implementation(numpy,mxnet.gluon.nn,mxnet.nd) 17

2.3 Result 18

3. 1x1 Convolution. 20

3.1 Basic Theory. 20

3.2 Code implementation (3 lines) 20

3.3 Result 21

4-5 CNN Architecture Implementation and Evaluation. 21

About Data loaders. 22

About num _workers and prefetching processes. 23

4. LeNet Implementation (MxNet based) 24

4.1 Basic Theories. 24

4.2 Code Implementation. 25

4.3 Model Evaluation on Fashion-MNIST dataset 26

4.3.1 Pooling: Maximum-pooling VS Average-pooling. 26

4.3.2 Optimization: sgd vs sgd+momentum(nag) 28

4.3.3 Activation Function: ReLU vs sigmoid. 28

4.3.4 Normalization Layer: Batch Normalization VS None. 29

4.3.5 Batch size: 64 vs 128. 31

4.3.6 Textbook Result(Batch Normalization) && Running Snapshot 32

4.4 LeNet Evaluation on MNIST dataset 33

4.5 Evaluating LeNet on CIFAR100. 35

4.5.1 Coarse Classification (20 classes) 36

4.5.2 Fine Classification (100 classes) 38

4.5.3 Running Snapshot 39

5. AlexNet Architecture. 39

5.1 Code Implementation. 39

5.2 Fashion_MNIST Dataset (Mxnet vs Pytorch) 44

5.3 MNIST Dataset (Pytorch only) 45

5.4 CIFAR100(100 classes, fine labels)-Pytorch Only. 46

5.4.1 Learning rate setting. 46

6. CNN activation layer characteristics visualization. 48

6.1 MNIST Dataset 49

6.2 Fashion_MNIST Dataset 52

7. Confusion Matrix. 54

7.1 MNIST.. 54

7.2 Fashion_MNIST.. 55

References. 55

Environment Setting

All of the four experiments are carried out on virtual environment based on Python interpreter 3.7.0 and mainly used packages include deep-learning package mxnet1.7.0.post2(PREFETCHING PROCESS version), visualization package matplotlib.pyplot, image processing package opencv-python, array manipulation package numpy.

Experiment Goals

Design appropriate kernels of fixed parameters and detect edges with horizontal, vertical, diagonal orientation separately;
Derive shape transformation formula in the forward propagation process of CNN(Convolutional Neural Network) and verify the result by fundamental coding and calling scripts;
Understand the effect and principle of 1x1 kernels and then explore different implementation versions of 1x1 convolution in 2-dimensional plane such as cross-correlation calculation and matrix multiplication;
Construct LeNet[2] by hand using mxnet.gluon.nn and explore how different settings of hyperparameters impact training result and model performance;
Construct AlexNet[3] by hand using torch.nn and explore how different settings of hyperparameters impact training result and model performance.

1.Edge Detection

1.1 Basic Principle

According to corresponding theories in DIP(Digital Image Processing), one-order difference calculators or kernels such as Prewitt and Sobel kernels with horizontal, vertical and two diagonal design versions can be used to detect edges in gray-scale images.

These kernels can filter out transition between different objects or parts of an object due to rapid change in intensity level of pixels distributed along the edges on both sides.

By the way, adding a comprehensive orientation algorithm to combine all direction information, see ‘combimg’ implementation for details.

1.2 Function Design

This section employs two tool functions to accomplish the goal: get_data(input_dir) for image loading(similar to building dataset) ; edge_detect(input_dir) for cross-correlation calculation under different settings of kernel shape and layer shape.

Figure 2 Code implementation

1.3 Carrying-out Result

Choosing 6 different scenario photos with obvious edge information posted by professional photographers on websites as a mini-dataset.

Figure 3 Mini-dataset in Mission 1

Only saving the ‘combimgs’. ‘canyon.jpg’ interprets direction attribute of Prewitt kernel vividly.

Figure 4 canyon.jpg

Other examples are as follows, orientation of texture can verify the DIP theory more or less.

Figure 5 galaxy, notice the small dot in the picture with interesting behavior(The combination dot has a black circle within it, others only include shapes like rectangular line)

Figure 6 Bungalow laying in the embrace of lake and mountains

Figure 7 grassland and night sky in an estate

Figure 8 Clouds

2. Shape of layers and kernels in a CNN

2.1 Basic Theories

Figure 9 Kernel and Layer in a CNN

Unlike hidden neurons(intermediate outputs) and lines(weights) fully connecting them in a multiple-layer perceptron(MLP), CNN is mainly characterized by kernels(similar to weights in MLP) and feature maps(similar to nodes in MLP), with activation functions, normalization layers and some other designs together construct the architecture. Kernels can also be understood as component of certain CNN layers.

Kernel is here mainly to reduce overwhelming parameter size and to reuse parameter scientifically according to spatial distribution locality and adjacency principles. Input images are transformed to different feature maps after going through convolution or cross-correlation operations of kernels. Notice that these kernels can have either adjustable(Convolutional kernel) or non-adjustable parameter settings(Pooling kernel).

These feature maps can include any implicit information such as edges of objects and so on. Part I of this experiment demonstrates the effect of human-designed edge-detection kernels. For layers near the top of deeper neural networks, the feature maps within may indicate rather global information(sometimes nothing can be learned may be due to smaller images input and deeper depth, so ResNet was born), taking AlexNet and LeNet as examples.

Figure 10 Characteristics Visualization & Understanding[1]

Figure 11 Primitive CNN architecture proposed(1998,2012)

A common design problem is to estimate the parameter size(storage amount) and training time(PREFETCHING PROCESS/GPU hour measured) of a 2D-CNN architecture. The size of a featured map is fixed as ‘NCHW’ format(or ‘NHWC’), while the size of a kernel is denoted as ‘CoCiKhKw’(or ‘KhKwCiCo’). See Figure 9 for graphic explanation.

Figure 12 Cross-correlation calculation at a 2D-convolutional layer

According to academic design and that within the textbook, NCHW and CoCiKhKw should satisfy C== Ci. When Co==1, Ci different kernels are used to do convolution(equivalent to cross-correlation operations in manipulation section) separately in correspondence to feature maps. One kernel aims at one feature map with size Nx1xHxW.

The result is obtained by pixel-wise summation on Ci different Nx1xHxW to get a composite Nx1xH2xW2 feature map with richer information. Then repeating similar process Co times to achieve final output: NxCoxH2xW2. Kernel size, paddings and strides and three basic settings for convolution operation which leads to different mapping: H->H2, W->W2.

Pooling layer has kernels with unlearnable parameters, generally divided into max-pooling and average-pooling.

2.2 Code implementation(numpy,mxnet.gluon.nn,mxnet.nd)

Use 2 ways to verify: direct hand-coding and package calling.

The input images are random values generated by numpy and simulate noises, serving to verify shapes of feature maps at the current layer. Kernels vary in Ci channels and are identical in Co channels. Use 5 nested loops to accomplish.

2.3 Result

These hand-coded kernels can be actually interpreted as smoothing filters with small variance because of k_base setting in the code block, however parameters in nn.Conv2D are initialized randomly without meaning at the start of training. By the way, the network layer isn’t initialized in this section because it is not necessary to do so.

nn.Conv2D can detect in_channels automatically and the layer goes through delayed initialization in this case. Reinitialization or assigning in_channels by hand can avoid delayed initialization.

In the simulation, N=2 and Ci=3, Co=4,H=360,W=480,Kh=Kw=3,ph=pw=1(only for unilateral),sh=sw=1. Result shows that the shape formula is correct.

Figure 13 H2 and W2 should be floored to integer[1]

3. 1x1 Convolution

3.1 Basic Theory

1x1 convolution is specifically used to compress channels C of feature maps to contract parameters needed. In this case, ph=pw=0,sh=sw=1, Co<Ci.

3.2 Code implementation (3 lines)

Use NHWC format in 1x1 convolution matrix multiplication implementation.

A much slower implementation of 1x1 convolution is simply adjusting the parameter and calling hand-coded convsize_verify().

3.3 Result

The result indicates that mxnet.gluon.nn implements convolution in the form of matrix multiplication. Similar method can be generalized into implementation of general kernel size convolution:

Given a layer of N feature maps, first divide input feature maps into M(H2xW2) flattened pixel vectors(length=KhxKw);

then do dot product with flattened kernels on two dimensions;

finally, reshape the output feature maps to obtain result.

A more detailed explantion is as follows.