Paper Review: SoundNet: Learning Sound Representations from Unlabeled Video (NeurIPS’16)

Peratham Wiriyathammabhum
4 min readApr 9, 2019

This paper…

  • leverages natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos.
  • proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video.
  • makes significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification (DCASE Challenge, ESC-50, and ESC-10).
  • has visualizations which suggest some high-level semantics automatically emerge in the sound network.
  • has no manual supervision/excessive labeling. The input contains just a pairing between RGB frames and raw audio waveforms from videos.

The paper proposes a system where it can recognize objects and scenes only from sound inputs. The model tries to learn a good audio representation from natural synchronization between RGB frames and waveforms from videos. In order to do that, they fed RGB frames into two-pretrained image CNNs. They used ImageNet for objects and Places for scenes. Then, for raw audio waveforms, they defined an audio convnet as a 1D fully-convolution ConvNet which looks very similar to AlexNet except the fact that it is fully-convolutional and its conv8 layer has two heads so that each head will pair with each top-layer from image CNN in each KL loss function. They later try to define an audio VGG ConvNet also.

The training scheme is student-teacher learning like in the knowledge distillation paper [3]. The reader can find a good video tutorial in [4]. A simple explanation for this audio-visual problem is the audio CNN will adjust its parameters based on the errors when the prediction/output from audio CNN is compared with the outputs from the image CNNs. So, it is like the (presumably groundtruth) labels are from the image CNNs and the audio CNN will adjust itself accordingly. The knowledge is distilled from two image ConvNets to one audio ConvNet across different modalities (Objects+ Scenes -> Sounds). The loss is simply a KL loss (Yes, it is differentiable.) and everything will be done via backpropagation. The paper uses stochastic gradient descent (SGD) to tune the model end-to-end (In fact, they used ADAM as the paper mentioned in section 3.4 on page 4.).

Then, they take the trained audio ConvNet, remove some top-layers, and use a specific layer like pool5 as a fixed-feature extractor for a multi-class SVM classifier (one-versus-all). The reason is that the audio semantic categories are different to the visual categories during training. So, naively transferring visual knowledge probably will not make the audio network learn everything out-of-the-box. It needs an extra fine-tuning step to make things work.

Task: Acoustic Scene Classification. Datasets: ESC-50, ESC-10, DCASE. (50, 10 and 10 categories respectively.) SoundNet beats all previous works and baselines at that time (2016).

There are also ablation studies to compare on various configurations. The results suggest that KL loss, using both ImageNet and Places teacher Nets, 8 layers and video pre-training are good.

They also performed multi-modal recognition on a novel dataset containing 9,478 videos (video+sound) with 44 categories. To begin with, they visualized the embedding of visual fc7 and audio conv7 features. All are from VGG models. They concluded that sound features alone contains semantics information.

For object/scene recognition, vision is more informative than sound only. Still, sound only contains a lot of information. However, combining both of them is the best.

They visualize the learned conv1 filter and found that the learned filters are diverse, including low and high frequencies, wavelet-like patterns, increasing and decreasing amplitude filters.

Learned conv1 filters

Then, they visualize the conv7 filter by finding the samples that maximize the activation of the hidden unit. The model learns the semantics/class information well.

Lastly, the authors provide a nice webpage [2] where everyone can download their Flickr dataset and their Torch7 code.

Thoughts: I think this is a milestone paper for audio CNN literatures. This paper creates influences in term of using CNNs and framing the problem as transfer learning from visual representations using natural video synchronization. There are lots of rooms for improvements if your GPUs allow.

References

[1] Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. “Soundnet: Learning sound representations from unlabeled video.” Advances in neural information processing systems. 2016.

[2] http://soundnet.csail.mit.edu/

[3] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531(2015).

[4] https://www.youtube.com/watch?v=skHpJ-oTi6o

--

--