Combined feature compression encoding in image retrieval

Recently, features extracted by convolutional neural networks (CNNs) are popularly used for image retrieval.In CNN representation, high-level features are usually chosen to represent the images in coarse-grained datasets, whilemid-level features are successfully applied to describe the images for fine-grained datasets. In this paper, we combinethese different levels of features as a joint feature to propose a robust representation that is suitable for both coarsegrained and fine-grained image retrieval datasets. In addition, in order to solve the problem that the efficiency of imageretrieval is influenced by the dimensionality of indexing, a unified subspace learning model named spectral regression(SR) is applied in this paper. We combine SR and the robust representation of the CNN to form a combined featurecompression encoding (CFCE) method. CFCE preserve the information without noticeably impacting image retrievalaccuracy. We find the tendency of the image retrieval performance to change the compressed dimensionality of features.We further discover a reasonable dimensionality of indexing in image retrieval. Experiments demonstrate that our modelprovides state-of-the-art performances across datasets.

PDF

___

[1] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Communications of the ACM 2017; 60 (6): 84-90.
[2] Ren S, He K, Girshick RB. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017; 39 (6): 1137-1149.
[3] Ji S, Xu W, Yang M, Yu K. 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2013; 35 (1): 221-231.
[4] He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2015; 37 (9): 1904-1916.
[5] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017; 39 (4): 640-651.
[6] Bengio Y, Courville AC, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 2013; 35 (8): 1798-1828.
[7] Babenko A, Lempitsky VS. Aggregating deep convolutional features for image retrieval. arXiv: Computer Vision and Pattern Recognition, 2015.
[8] Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S. CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition; 2014; Columbus, OH, USA. New York, NY, USA: IEEE. pp. 806-813.
[9] Zhao F, Huang Y, Wang L. Deep semantic ranking based hashing for multi-label image retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015; Boston, MA, USA. New York, NY, USA: IEEE. pp. 1556-1564.
[10] Bellman R. Dynamic programming. Science 1966; 153 (3731): 34-37.
[11] Azizpour H, Razavian AS, Sullivan J, Maki A, Carlsson S. Factors of transferability for a generic ConvNet representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2016; 38 (9): 1790-1802.
[12] Babenko A, Lempitsky V. Aggregating local deep features for image retrieval. In: IEEE International Conference on Computer Vision; 2015; Boston, MA, USA. New York, NY, USA: IEEE. pp. 1269-1277.
[13] Tao R, Gavves E, Snoek CGM. Locality in generic instance search from one example. In: IEEE Conference on Computer Vision and Pattern Recognition; 2014; Columbus, OH, USA. New York, NY, USA: IEEE. pp. 2091-2098.
[14] Yue-Hei Ng J, Yang F, Davis LS. Exploiting local features from deep networks for image retrieval. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015; Boston, MA, USA. New York, NY, USA: IEEE. pp. 53-61.
[15] Ke Y, Sukthankar R, Huston L, et al. Efficient near-duplicate detection and sub-image retrieval. In: ACM International Conference on Multimedia; 2004; New York, NY, USA. New York, NY, USA: ACM. p. 5.
[16] Liu L, Fieguth P, Guo Y, Wang X, Pietikäinene M. Local binary features for texture classification: taxonomy and experimental study. Pattern Recognition 2017; 62: 135-160.
[17] Oliva A, Torralba A. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 2001; 42 (3): 145-175
[18] Deng Y, Manjunath BS, Kenney CS. An efficient color representation for image retrieval. IEEE Transactions on Image Processing 2001; 10 (1): 140-147.
[19] Xia R, Pan Y, Lai H. Supervised hashing for image retrieval via image representation learning. In: Twenty-Eighth AAAI Conference on Artificial Intelligence; 2014; Quebec City, Canada. pp. 2156-2162.
[20] Yan K, Wang Y, Liang D. CNN vs. sift for image retrieval: alternative or complementary? In: ACM International Conference on Multimedia; 2016; Amsterdam, the Netherlands. New York, NY, USA: ACM. pp. 407-411.
[21] Bengio Y, Courville AC, Vincent P. Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 2013; 35 (8): 1798-1828.
[22] Zhang Q, Nian Wu Y, Zhu SC. Interpretable convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition; 2018; Salt Lake City, UT, USA. New York, NY, USA: IEEE. pp. 8827-8836.
[23] Elisha O, Dekel S. Function space analysis of deep learning representation layers. arXiv: Artificial Intelligence, 2017.
[24] Razavian AS, Sullivan J, Carlsson S. Visual instance retrieval with deep convolutional networks. ITE Transactions on Media Technology and Applications 2016; 4 (3): 251-258.
[25] Zheng L, Zhao Y, Wang S. Good practice in CNN feature transfer. arXiv: Computer Vision and Pattern Recognition, 2016.
[26] Zhang Q, Nian WY, Zhu SC. Interpretable convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition; 2018; Salt Lake City, UT, USA. New York, NY, USA: IEEE. pp. 8827-8836.
[27] Perronnin F, Liu Y, Sánchez J. Large-scale image retrieval with compressed fisher vectors. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2010; San Francesco, CA, USA. New York, NY, USA: IEEE. pp. 3384-3391.
[28] Jégou H, Douze M, Schmid C. Aggregating local descriptors into a compact image representation. In: IEEE Conference on Computer Vision and Pattern Recognition; 2010; San Francesco, CA, USA. New York, NY, USA: IEEE. pp. 3304-3311.
[29] Sivic J, Zisserman A. Video Google: a text retrieval approach to object matching in videos. In: International Conference on Computer Vision; 2003; Nice, France. New York, NY, USA: IEEE. pp. 1470-1477.
[30] Salakhutdinov R, Hinton G. Semantic hashing. International Journal of Approximate Reasoning 2009; 50 (7): 969- 978.
[31] Carreira-Perpinán MA, Raziperchikolaei R. Hashing with binary autoencoders. In: IEEE Conference on Computer Vision and Pattern Recognition; 2015; Boston, MA, USA. New York, NY, USA: IEEE. pp. 557-566.
[32] Yang HF, Lin K, Chen CS. Supervised learning of semantics-preserving hash via deep convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2018; 40 (2): 437-451.
[33] Cai D, He X, Han J. Spectral regression: a unified approach for sparse subspace learning. In: IEEE International Conference on Data Mining; 2007; Omaha, NE, USA. New York, NY, USA: IEEE. pp. 73-82.
[34] Clevert D, Unterthiner T, Hochreiter S. Fast and accurate deep network learning by exponential linear units (ELUs). arXiv: International Conference on Learning Representations, 2016.
[35] Cai D, He X, Han J. Spectral regression: a unified approach for sparse subspace learning. In: IEEE International Conference on Data Mining; 2007; Omaha, NE, USA. New York, NY, USA: IEEE. pp. 73-82.
[36] Fei-Fei L, Fergus R, Perona P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding 2007; 106 (1): 59-70.
[37] Khosla A, Jayadevaprakash N, Yao B. Novel dataset for fine-grained image categorization: Stanford dogs. In: CVPR Workshop on Fine-Grained Visual Categorization; 2011; Colorado Springs, CO, USA: . New York, NY, USA: IEEE.
[38] Quattoni A, Torralba A. Recognizing indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition; 2009; Miami Beach, FL, USA. New York, NY, USA: IEEE. pp. 413-420.
[39] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: International Conference on Learning Representations, 2015.
[40] Babenko A, Slesarev A, Chigorin A. Neural codes for image retrieval. In: European Conference on Computer Vision; 2014; Zurich, Switzerland. Berlin, Germany: Springer. pp. 584-599.
[41] Perronnin F, Dance C. Fisher kernels on visual vocabularies for image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition; 2007; Minneapolis, MN, USA. New York, NY, USA: IEEE. pp. 1-8.