读论文1-resNet

ResNet

论文地址：Deep Residual Learning for Image Recognition

针对深度神经网络难以训练的问题，ResNet提出了一种特殊的网络结构-残差，有效的解决了深度网络的退化问题，降低了深度网络学习难度

论文结构

摘要 abstract

按照问题->解决方案->实验效果的逻辑，首先提出问题：“深度神经网络难以训练”，引出解决方案-residual net，最后列举在不同数据上的卓越效果（ImageNet，COCO，CIFAR），证明解决方案的有效性
介绍 intro

与摘要的逻辑相同，逻辑非常严密（ps:太丝滑了）
1. 为什么要用深度网络？因为深度网络有助于捕捉特征，提升任务效果
2. 增加网络深度又会出现两个主要问题：一是梯度爆炸/消失，二是深度网络的退化问题
3. 梯度爆炸可以通过 normalized initialization and intermediate normalization layers 解决
4. 网络退化如何解决？引出了本文的残差机制-residual
5. 最后又展示了一轮不同数据上的实验效果

相关工作 related work

没细看。。
算法描述 Deep Residual Learning

残差机制->网络架构设计->网络实现
1. 提出了深度网络之所以会出现退化,是因为 难以学习直接映射（只是形式上的理解，没有给出公式证明），很自然引出了残差机制的设计
  
  might have difficulties in approximating identity mappings by multiple nonlinear layers.
2. 为了方便对比残差机制是否真的有效，设计了Plain Networks和Residual Networks两种架构
实验 Experiments

列举了从 ImageNet 到CIFAR10再到PASCAL and MS COCO三个数据集上不同任务的实验效果
1. 首先对比在ImageNet数据集上34层Plain network和18层效果，证明深层网络确实出现了退化现象，并且排除了是梯度消失造成的可能
  
  We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error
2. 然后对比34层和18层Residual Networks效果,证明残差确实能够解决深层网络退化问题,并且能够在训练初期加速网络收敛速度
  
  This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth
3. 继续对比了 直接映射和投影映射 两种不同残差计算方式，是否影响残差发挥作用，得到结论：单纯的直接映射残差机制即可解决退化问题，投影映射效果优于直接映射，但是计算量增加较大
4. 之后提出了一种Deeper Bottleneck Architectures，在ImageNet上探索残差机制在更深层网络上的效果
5. 在小数据集CIFAR-10上验证：当学习任务足够简单，并不需要过深网络时，深度网络中的残差映射会近似于直接映射（加了残差块的深层网络 等价于 浅层网络+多个直接映射层）
  
  the residual functions might be generally closer to zero than the non-residual functions.
6. 最后秀了一波在目标检测任务上的优秀成果
总结 summarize

因为CVPR的页数限制没有总结（？）

残差机制-residual

文章中分别提出了两种残差映射公式，其中第一种为 “Identity Mapping”，直接映射公式如下

$y=F(x,\{W_i\}) +x \tag{1}$

其中 $F(x,\{W_i\})$ 代表残差块内从X到残差的映射（如图2为中间带ReLU激活函数的两个权重层）,直接映射要求输入输出的唯独相同，可以直接叠加

第二种为 “linear projection”，将输入投影到与输出相同维度，方便叠加

$y=F(x,\{W_i\}) +W_s x \tag{2}$

论文中已通过实验证明， “Identity Mapping”即可解决深度网络的退化问题
虽然 “linear projection”效果略优于 “Identity Mapping”，但投影操作增加了计算量

网络结构设计

两个基本设计原则

当特征图大小缩小一半（ $224*224->112*112$ ），通道数翻一倍（ $64->128$ ）
当特征图大小不变时，通道数保持不变

网络组成结构如图

每个中括号内为一个残差块，每一个例如$conv2_x，conv3_x$ 的卷积层包含多个残差块
跨越不同卷积层时特征图缩小一般，通道数扩大一倍
例如18层网络的结构为
1. 首先通过 $7*7$ 输出通道为64的卷积层+ $3*3$ 的最大池化层
2. 进入第 $conv2\_x$ 卷积层，包括两个残差块，每个残差块内包含两个 $3*3*64$ 的卷积层
3. 进入第卷积层，同样包括两个残差块，由于第一个残差块输入为输出为需要使用投影残差计算，其余无差异，其中投影残差作者对比了两种不同的选择
  - 通过padding升维度，避免投影计算（A)
  - 1*1卷积核，类似于全连接层(B）

Deeper Bottleneck Architectures

针对ImageNet设计深层网络时，为了降低计算复杂度，设计了一种Bottleneck block

与普通残差块不同在于包含三个卷积层: $1*1 + 3*3 + 1*1$ ,其中 $1*1$ 负责升维和降维
上图中34层和50层，虽然增加了16层，但是使用Bottleneck block的50层网络计算量增加不大

训练超参数

使用SGD，256 mini-batch，迭代训练 $6*10^4$ 次
初始学习率为0.1，每当错误率达到稳定，学习率缩小10倍
weight decay:0.0001,momentum:0.9,不使用dropout
每个卷积层之后，激活层之前，添加BN层

Pytorch 代码实现

基础残差块：

class BasicBlock(nn.Module):
    expansion: int = 1

    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        downsample: Optional[nn.Module] = None,
        groups: int = 1,
        base_width: int = 64,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    ) -> None:
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        # 第一个卷积层传入 输入通道，输出通道，步长
        self.conv1 = conv3x3(inplanes, planes, stride)
        # 卷积层后+batchnorm层
        self.bn1 = norm_layer(planes)
        # relu激活函数
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
		# 若输入与输出不相同，使用投影映射，也就是残差公式2
        if self.downsample is not None:
            identity = self.downsample(x)
		# 残差计算+ReLU激活函数
        out += identity
        out = self.relu(out)

        return out

投影残差函数初始化(就是简单的$1*1$卷积核加batchnorm）:

if stride != 1 or self.inplanes != planes * block.expansion:
    downsample = nn.Sequential(
        conv1x1(self.inplanes, planes * block.expansion, stride),
        norm_layer(planes * block.expansion),
    )

Bottleneck块

def forward(self, x: Tensor) -> Tensor:
       identity = x
       # 1*1
       out = self.conv1(x)
       out = self.bn1(out)
       out = self.relu(out)
	# 3*3
       out = self.conv2(out)
       out = self.bn2(out)
       out = self.relu(out)
	# 1*1
       out = self.conv3(out)
       out = self.bn3(out)
       
       if self.downsample is not None:
           identity = self.downsample(x)

       out += identity
       out = self.relu(out)

       return out

resnet网络结构