[Reviews] SNOW: Subscribing to Knowledge via Channel Pooling for Transfer & Lifelong Learning of Convolutional Neural Networks

Minho Ryu
2 min readApr 18, 2020

--

Introduction

Training CNNs from random initializations to achieve high task accuracy generally requires a large amount of data that is expensive to collect. Transfer Learning has been a milestone to solve such a problem, attracting much attention since ImageNet pre-trained models were publicly opened and easily approachable through multiple deep learning libraries.

transfer learning from a source task to many target tasks incurs overall significant training and space overhead due to multiple large models to customize and store. On the other hand, lifelong learning can enable substantial parameter sharing and deliver multiple target tasks with less training time and smaller model size, but may suffer from catastrophic forgetting or lower accuracy.

In this article, we will discuss a framework, SNOW, that tackles cited problems in transfer and continual learning.

Figure 1. Overview of SNOW architecture.

Model Architecture

In SNOW architecture, the source model produces intermediate features and the target models selectively subscribe to them via channel pooling in their training and serving processes. The local features from a target model and the subscribed features are combined through concatenation. For the convenience, authors use the same architecture for target models as the source model except much smaller number of channels. For channel pooling, authors assign scalar weights for each channel from the source model and train these weights along with the paremeters of the target models. This allows the architecture to selectively pool channels to improve target model’s performance. Details are visually described at Figure 2.

Training Strategy

Figure 2. Training weights for pooling channels.

To provide every channel with enough chance to be selected, authors introduced stochasticity by adding samples from zero mean normal distribution with the standard deviation σ when training as shown in Figure 2. They remove stochastic part at inference time so that they directly select top-K trained weights for selective subscription.

--

--

Minho Ryu
Minho Ryu

Written by Minho Ryu

0 Followers

Machine Learning Engineer @ SK Telecom