Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts
Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts
By transferring both features and gradients between different layers, shortcut connections explored by ResNets allow us to effectively train very deep neural networks up to hundreds of layers. However, the additional computation costs induced by those shortcuts are often overlooked.For example, during online inference, the shortcuts in ResNet-50 account for about 40 percent of the entire memory usage on feature maps, because the features in the preceding layers cannot be released until the subsequent calculation is completed. In this work,for the first time, we consider training the CNN models with shortcuts and deploying them without. In particular, we propose a novel joint-training framework to train plain CNN by leveraging the gradients of the ResNet counterpart. During forward step, the feature maps of the early stages of plain CNN are passed through later stages of both itself and the ResNet counterpart to calculate the loss. During backpropagation, gradients calculated from a mixture of these two parts are used to update the plainCNN network to solve the gradient vanishing problem. Extensive experiments on ImageNet/CIFAR10/CIFAR100 demonstrate that the plainCNN network without shortcuts generated by our approach can achieve the same level of accuracy as that of the ResNet baseline while achieving about $1.4\times $ speed-up and $1.25\times$ memory reduction. We also verified the feature transferability of our ImageNet pretrained plain-CNN network by fine-tuning it on MIT 67 and Caltech 101. Our results show that the performance of the plain-CNN is slightly higher than that of its baseline ResNet-50 on these two datasets. The codes are in: \href{https://github.com/leoozy/JointRD_Neurips2020}{https://github.com/leoozy/JointRD\_Neurips2020}
残留蒸馏:迈向没有捷径的便携式深度神经网络
通过在不同层之间传递特征和梯度,ResNets探索的快捷连接使我们能够有效地训练非常深的神经网络,直至数百层。但是,这些快捷方式引起的额外计算成本通常被忽略。.. 例如,在在线推理期间,ResNet-50中的快捷方式约占要素图整个内存使用量的40%,因为之前的图层中的要素要等到后续计算完成后才能发布。在这项工作中,我们第一次考虑使用快捷方式训练CNN模型,而无需使用快捷方式进行部署。特别是,我们提出了一种新颖的联合训练框架,以利用ResNet对应项的梯度来训练普通CNN。在前进步骤中,纯CNN早期阶段的特征图将通过自身和ResNet对应项的后期阶段进行计算,以计算损失。在反向传播期间,将从这两个部分的混合物计算得出的梯度用于更新plainCNN网络,以解决梯度消失问题。 1.4× 加速和 1.25× 减少内存。我们还通过在MIT 67和Caltech 101上进行微调,验证了ImageNet预训练的平原CNN网络的特征可传递性。我们的结果表明,在这些方面,平原CNN的性能略高于其基线ResNet-50的性能。两个数据集。代码位于:\ href {https://github.com/leoozy/JointRD_Neurips2020} {https://github.com/leoozy/JointRD\_Neurips2020}