“Learning Spatiotemporal Features with 3D Convolutional Networks” — Du Tran et al. 2015

Paper: Learning Spatiotemporal Features with 3D Convolutional Networks

Website: C3D

Features:

  • 3D
  • Supervised learning
  • For video dataset

Findings:

  • 3D ConvNets are more suitable for spatiotemporal feature learning compared to 2D ConvNets.
  • 3D ConvNet has the ability to model temporal information better owing to 3D convolution and 3D pooling operations.
  • A homogeneous architecture with small 3*3*3 convolution kernels in all layers is among the best performing architectures for 3D ConvNets.
    • Fix the spatial receptive field to 3*3 and vary only the temporal depth of the 3D convolution kernels.

Useful reference:

  • 3D ConvNets: S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE TPAMI, 35(1):221–231, 2013. 1, 2
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
    • “slow fusion model” uses 3D convolutions and averaging pooling in its first 3 convolution layers. It still loses all temporal information after the third convolution layer.

Dataset:

  • UCF101: medium-scale
    • K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human action classes from videos in the wild. In CRCV-TR-12-01, 2012. 5, 7
  • Sports-1M: the largest video classification benchmark
    • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014. 1, 2, 3, 4, 5, 6
  • ASLAN: 3631 videos from 432 action classes, for action similarity labeling
  • YUPENN: 420 videos of 14 scene categories
    • K. Derpanis, M. Lecce, K. Daniilidis, and R. Wildes. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In CVPR, 2012. 8
  • Maryland:
    • N. Shroff, P. K. Turaga, and R. Chellappa. Moving vistas: Exploiting motion for describing scenes. In CVPR, 2010. 8
  • egocentric: 42 types of everyday objects
    • X. Ren and M. Philipose. Egocentric recognition of handled objects: Benchmark and analysis. In Egocentric Vision workshop, 2009. 2, 8
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s