Online citations, reference lists, and bibliographies.

Spatial Pyramid Pooling In Deep Convolutional Networks For Visual Recognition

Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun
Published 2015 · Computer Science

Cite This
Download PDF
Analyze on Scholarcy
Share
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224$\times$ 224) input image. This requirement is “artificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102 $\times$ faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.
This paper references
10.1145/3065386
ImageNet classification with deep convolutional neural networks
A. Krizhevsky (2017)
The PASCAL Visual Object Classes Challenge
J. Zhang (2006)
10.1109/CVPR.2015.7298594
Going deeper with convolutions
Christian Szegedy (2015)
10.1007/978-3-540-88690-7_52
Kernel Codebooks for Scene Categorization
J. C. V. Gemert (2008)
10.1109/CVPR.2014.220
DeepFace: Closing the Gap to Human-Level Performance in Face Verification
Yaniv Taigman (2014)
10.1016/j.cviu.2005.09.012
Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories
Li Fei-Fei (2004)
10.1061/(ASCE)GT.1943-5606.0001284
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky (2012)
10.1109/CVPR.2014.222
Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks
M. Oquab (2014)
10.1007/978-3-319-10584-0_26
Multi-scale Orderless Pooling of Deep Convolutional Activation Features
Yunchao Gong (2014)
Some Improvements on Deep Convolutional Neural Network Based Image Classification
A. Howard (2014)
10.1109/TPAMI.2015.2389824
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
Kaiming He (2015)
Visualizing and Understanding Convolutional Neural Networks
Matthew D. Zeiler (2013)
10.1007/978-3-319-10590-1_53
Visualizing and Understanding Convolutional Networks
Matthew D. Zeiler (2014)
10.1109/CVPR.2014.414
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Ming-Ming Cheng (2014)
10.1109/CVPR.2005.177
Histograms of oriented gradients for human detection
N. Dalal (2005)
10.1109/CVPRW.2014.131
CNN Features Off-the-Shelf: An Astounding Baseline for Recognition
Ali Sharif Razavian (2014)
Network In Network
M. Lin (2014)
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
J. Donahue (2014)
10.1109/CVPR.2009.5206848
ImageNet: A large-scale hierarchical image database
Jia Deng (2009)
10.1109/CVPR.2014.212
PANDA: Pose Aligned Networks for Deep Attribute Modeling
Ning Zhang (2014)
10.1109/TPAMI.2009.167
Object Detection with Discriminatively Trained Part Based Models
Pedro F. Felzenszwalb (2009)
10.1109/TPAMI.2015.2389830
Regionlets for Generic Object Detection
X. Wang (2013)
Object detection with discriminatively trained part-based models Kernel codebooks for scene categorization
P F Felzenszwalb (2008)
Caffe: An open source convolutional architecture for fast feature embedding
Y Jia (2013)
The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization
A. Coates (2011)
10.1109/ICCV.2005.239
The pyramid match kernel: discriminative classification with sets of image features
K. Grauman (2005)
10.1007/s11263-015-0816-y
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky (2015)
10.1145/1961189.1961199
LIBSVM: A library for support vector machines
Chih-Chung Chang (2011)
10.1109/TPAMI.2011.235
Aggregating Local Image Descriptors into Compact Codes
H. Jégou (2012)
10.1109/ICCV.2011.6126456
Segmentation as selective search for object recognition
K. V. D. Sande (2011)
Caffe: An open source convolutional architecture for fast feature embedding [Online
Y. Jia (2013)
10.1007/s11263-009-0275-4
The Pascal Visual Object Classes (VOC) Challenge
M. Everingham (2009)
10.1109/CVPR.2010.5540018
Locality-constrained Linear Coding for image classification
J. Wang (2010)
10.5244/c.28.72
Generic Object Detection with Dense Neural Patterns and Regionlets
Will Y. Zou (2014)
10.1007/978-3-642-15561-1_11
Improving the Fisher Kernel for Large-Scale Image Classification
F. Perronnin (2010)
10.5244/C.25.76
The devil is in the details: an evaluation of recent feature encoding methods
K. Chatfield (2011)
Distinctive Image Features from Scale-Invariant Keypoints
G. LoweDavid (2004)
The PASCAL visual object classes challenge 2006 (VOC2006) results
M. Everingham (2006)
Deep Neural Networks for Object Detection
Christian Szegedy (2013)
Very Deep Convolutional Networks for Large-Scale Image Recognition
K. Simonyan (2015)
Fast Training of Convolutional Networks through FFTs
Michaël Mathieu (2014)
10.1109/CVPR.2006.68
Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories
S. Lazebnik (2006)
Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
Emily L. Denton (2014)
10.1007/978-3-319-10602-1_26
Edge Boxes: Locating Object Proposals from Edges
C. L. Zitnick (2014)
Comput. Vis. Pattern Recognit
Proc (2009)
10.1109/CVPR.2014.81
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
Ross B. Girshick (2014)
10.1109/CVPRW.2009.5206757
Linear spatial pyramid matching using sparse coding for image classification
Jianchao Yang (2009)
10.1162/neco.1989.1.4.541
Backpropagation Applied to Handwritten Zip Code Recognition
Y. LeCun (1989)
10.1109/ICCV.2003.1238663
Video Google: a text retrieval approach to object matching in videos
Josef Sivic (2003)
OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks
Pierre Sermanet (2014)
10.5244/C.28.6
Return of the Devil in the Details: Delving Deep into Convolutional Nets
K. Chatfield (2014)



This paper is referenced by
10.1109/ICME.2017.8019402
Regularization of convolutional neural networks using ShuffleNode
Y. Chen (2017)
10.1117/1.JEI.25.2.023025
Multiple scales combined principle component analysis deep learning network for face recognition
Lei Tian (2016)
Analysis and Optimization of Convolutional Neural Network Architectures
M. Thoma (2017)
10.1109/TMM.2016.2614862
Weakly Supervised Learning of Deformable Part-Based Models for Object Detection via Region Proposals
Yuxing Tang (2017)
10.1007/s11263-016-0970-x
Towards Reversal-Invariant Image Representation
Lingxi Xie (2016)
10.1109/ICASSP.2017.7952430
Dynamic tracking attention model for action recognition
Chien-Yao Wang (2017)
10.1007/s11042-016-3540-x
Image classification based on convolutional neural networks with cross-level strategy
Yu Liu (2016)
Context-Aware RCNN: A Baseline for Action Detection in Videos
Jianchao Wu (2020)
10.1109/ICCE-CHINA.2017.7991056
Automatic photographic composition based on Convolutional Neural Network
Sheng-Fang Chen (2017)
10.1007/978-3-030-01267-0_46
Broadcasting Convolutional Network for Visual Relational Reasoning
Simyung Chang (2018)
10.1007/978-981-13-0020-2_29
Dynamic Class Learning Approach for Smart CBIR
Girraj Pahariya (2017)
10.11370/ISJ.56.168
画像を生成する深層学習ネットワーク ―領域分割と画像生成・変換―
啓司 柳井 (2017)
10.1109/ACCESS.2017.2684186
Multimodal Gesture Recognition Using 3-D Convolution and Convolutional LSTM
G. Zhu (2017)
10.5121/csit.2020.100701
HAND SEGMENTATION FOR ARABIC SIGN LANGUAGE ALPHABET RECOGNITION
Ouiem Bchir (2020)
10.1109/TPAMI.2016.2572683
Fully Convolutional Networks for Semantic Segmentation
Evan Shelhamer (2017)
10.1109/WACV.2016.7477702
Detecting temporally consistent objects in videos through object class label propagation
Subarna Tripathi (2016)
End-to-End Deep Learning for Person Search
Tong Xiao (2016)
Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree
Chen-Yu Lee (2016)
Machine learning solutions to visual recognition problems
Jakob Verbeek (2016)
10.1109/ICIP.2016.7533042
CNN based region proposals for efficient object detection
Jawadul H. Bappy (2016)
10.1007/978-3-319-48896-7_7
Integrating Supervised Laplacian Objective with CNN for Object Recognition
Weiwei Shi (2016)
10.1007/978-3-319-46484-8_14
Online Adaptation for Joint Scene and Object Classification
Jawadul H. Bappy (2016)
10.1109/ICPR.2016.7900009
Selective unsupervised feature learning with Convolutional Neural Network (S-CNN)
Amir Ghaderi (2016)
10.1007/978-3-319-54526-4_33
Multiple-Branches Faster RCNN for Human Parts Detection and Pose Estimation
Kaiqiang Wei (2016)
10.1109/ICIP.2017.8297136
Region ensemble network: Improving convolutional network for hand pose estimation
Hengkai Guo (2017)
Improving Deep Learning using Generic Data Augmentation
L. Taylor (2017)
10.1007/978-3-319-70136-3_60
On-Road Object Detection Based on Deep Residual Networks
Kang Chen (2017)
10.1186/s13007-020-00624-2
Early recognition of tomato gray leaf spot disease based on MobileNetv2-YOLOv3 model
Jun Liu (2020)
10.1109/ICASI.2018.8394403
Microarray camera image segmentation with Faster-RCNN
Jiancheng Zou (2018)
10.1109/ICPR.2018.8545615
Learning Fixation Point Strategy for Object Detection and Classification
Jie Lyu (2018)
10.1007/978-3-319-69137-4_14
Deep Neural Networks Features for Arabic Handwriting Recognition
Mustapha Amrouch (2017)
10.1109/cvpr42600.2020.01396
TESA: Tensor Element Self-Attention via Matricization
F. Babiloni (2020)
See more
Semantic Scholar Logo Some data provided by SemanticScholar