Effect of Spectrogram Parameters and Noise Types on The Performance of Spectro-temporal Peaks Based Audio Search Method

Audio search algorithms are used to detect the queried file in large databases, especially in multimedia applications. These algorithms are expected to perform the detection in a reliable and robust way within the shortest time. In this study, based on spectral peaks method, an audio fingerprint algorithm with a few minor modifications was developed to detect the matching audio file in target database. This method has two stages as the audio fingerprint extraction and matching. In the first stage, fingerprint features are extracted from spectral peaks on the spectrograms of audio files by hash functions. This state-of-art technique reduces the processing load and time considerably compared to traditional methods. In the second stage, fingerprint data of the queried file are compared with the data created in the first stage in the database. The algorithm was demonstrated, and the effect of spectrogram parameters (window size, overlap, number of FFT) was investigated by considering reliability and robustness under different noise sources. Also, it was aimed to contribute to new audio retrieval studies based on spectral peaks method. It was observed that the variation in the spectrogram parameters significantly affected the number of matchings, reliability and robustness. Under high noise conditions, the optimal spectrogram parameters were determined as 512 (window size), 50% (overlap), 512 (number of FFT). It was seen in general that the algorithm successfully detected the queried file in the database even in high noise conditions for these parameters. No significant effect of music genre was observed.

Keywords:

Audio recognition, Reliability Noise effect, Spectrogram parameters,

PDF

___

[1] Grosche, P., Müller, M., Serra, J., "Audio Content-Based Music Retrieval", in: M. Müller, M. Goto, M. Schedl (Eds.), Multimodal Music Processing, Dagstuhl Follow-Ups, 157–174, (2012).
[2] Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M., "Content-Based Music Information Retrieval: Current Directions and Future Challenges", Proceedings of the IEEE, 96 (4): 668–696, (2008).
[3] Cano, P., Batlle, E., Kalker, T., Haitsma, J., "A Review of Audio Fingerprinting", Journal of VLSI Signal Processing Systems for Signal, Image and Video Technology, 41(3): 271–284, (2005).
[4] Cano, P., Battle, E., Mayer, H., Neuschmied, H., "Robust Sound Modeling for Song Detection in Broadcast Audio", AES 112th Convention, Munich, 1–7, (2002).
[5] Haitsma, J., Kalker, T., "A Highly Robust Audio Fingerprinting System", International Conference on Music Information Retrieval, Paris, 1–9, (2002).
[6] Haitsma, J., Kalker, T., "Speed-change resistant audio fingerprinting using auto-correlation", IEEE International Conference on Acoustics, Speech, and Signal Processing, IV-728–31, (2003).
[7] Cremer, M., Froba, B., Hellmuth, O., Herre, J., Allamanche, E., "AudioID: Towards Content-Based Identification of Audio Material", AES 110th Convention, Amsterdam, (2001).
[8] Fenet, S., Richard, G., Grenier, Y., "A Scalable Audio Fingerprint Method with Robustness to Pitch-Shifting", 12th International Society for Music Information Retrieval Conference, Miami, 121–126, (2011).
[9] Wang, A.L., "An industrial-strength audio search algorithm", International Conference on Music Information Retrieval, Baltimore, Maryland, 7–13, (2003).
[10] Yan K., Hoiem, D., Sukthankar, R., "Computer Vision for Music Identification", IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1: 597–604, (2005).
[11] Jia, M., Li, T., Wang, J., "Audio Fingerprint Extraction Based on Locally Linear Embedding for Audio Retrieval System", Electronics, 9 (9): 1483, (2020).
[12] Baluja, S., Covell, M., "Waveprint: Efficient wavelet-based audio fingerprinting", Pattern Recognition, 41(11): 3467–3480, (2008).
[13] Pucciarelli G., "Wavelet Analysis in Volcanology: The Case of Phlegrean Fields", Journal of Environmental Science and Engineering A, 6: 300-307, (2017).
[14] Chen, D., Zhang, W., Zhang, Z., Huang, W., Ao, J., "Audio retrieval based on wavelet transform", IEEE 16th International Conference on Computer and Information Science, 531–534, (2017).
[15] Liu, N., Gao, J., Jiang, X., Zhang, Z., Wang, Q., "Seismic Time–Frequency Analysis via STFT-Based Concentration of Frequency and Time", IEEE Geoscience and Remote Sensing Letters, 14(1): 127–131, (2017).
[16] Nan, C., "Research on Intelligent Vocal Music Training System Based on Wavelet Transform", IEEE 4th International Conference on Information Systems and Computer Aided Education, 278–282, (2021).
[17] Park, M., Kim, H.R., Yang, S.H., "Frequency-Temporal Filtering for a Robust Audio Fingerprinting Scheme in Real-Noise Environments", ETRI Journal, 28(4): 509–512, (2006).
[18] Kim, H.-G., Kim, J.Y., "Robust Audio Fingerprinting Method Using Prominent Peak Pair Based on Modulated Complex Lapped Transform", ETRI Journal, 36(6): 999–1007, (2014).
[19] Anguera, X., Garzon, A., Adamek, T., "MASK: Robust Local Features for Audio Fingerprinting", IEEE International Conference on Multimedia and Expo, 455–460, (2012).
[20] Tao, S., Getachew, Y., High Fidelity Song Identification via Audio Decomposition and Fingerprint Reconstruction by CNN and LSTM Networks, Stanford University Report, http://cs230.stanford.edu/projects_spring_2020/reports/38911459.pdf. Access date: 11.05.2022.
[21] Chang, S., Lee, D., Park, J., Lim, H., Lee, K., Ko, K., Han, Y., "Neural Audio Fingerprint for High-Specific Audio Retrieval Based on Contrastive Learning", IEEE International Conference on Acoustics, Speech and Signal Processing, 3025–3029, (2021).
[22] Báez-Suárez, A., Shah, N., Nolazco-Flores, J.A., Huang, S.H.S., Gnawali, O., Shi, W., "SAMAF: Sequence-to-sequence Autoencoder Model for Audio Fingerprinting", ACM Transactions on Multimedia Computing, Communications, and Applications, 16(2): 1–23, (2020).
[23] Altalbe, A., "Audio fingerprint analysis for speech processing using deep learning method", International Journal of Speech Technology, (2021).
[24] Koseoglu, M., Uyanik, H., "The Effect of Different Noise Levels on The Performance of The Audio Search Algorithm", IEEE International Congress on Human-Computer Interaction, Optimization and Robotic Applications, 1–7, (2020).
[25] Uyanik, H., Koseoglu, M., "Performance Evaluation of Different Window Functions for Audio Fingerprint Based Audio Search Algorithm", IEEE 4th International Symposium on Multidisciplinary Studies and Innovative Technologies, 1–4, (2020).
[26] Han, B.B., Hou, Y.H., Zhou, L., Shen, H.Y., "A Filtering Method for Audio Fingerprint Based on Multiple Measurements", Proceedings of the International Conferenc on Information Technology and Computer Application Engineering, Hong-Kong, 377-381, (2014).
[27] Zhang, Q.Y., Xu, F.J., Bai, J., "Audio Fingerprint Retrieval Method Based on Feature Dimension Reduction and Feature Combination", KSII Transactions on Internet and Information Systems, 15(2): 522–539, (2021).
[28] Wang, D., Xuewei, Z., "THCHS-30 : A Free Chinese Speech Corpus", ArXiv, (2015).
[29] The 500 Greatest Songs of All Time, https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/. Access date: 28.05.2021.
[30] Yan, B.C., Liu, S.H., Chen, B., "Modulation spectrum augmentation for robust speech recognition", Proceedings of the International Conference on Advanced Information Science and System, Singapore, 1–6, (2019).
[31] Gupta, V., Mittal, M., "QRS Complex Detection Using STFT, Chaos Analysis, and PCA in Standard and Real-Time ECG Databases", Journal of The Institution of Engineers (India): Series B, 100(5): 489–497, (2019).
[32] Ellis, D., "Robust Landmark-Based Audio Fingerprinting", https://www.ee.columbia.edu/~dpwe/resources/matlab/fingerprint/. Access date: 28.06.2021.
[33] Suriñach, E., Márquez, E.L.F., "A Template To Obtain Information On Gravitational Mass Movements From The Spectrograms Of The Seismic Signals Generated", Earth Surface Dynamics Discussions, 1–34, (2022).
[34] Walker, J.S., Don, G.W., Mathematics and Music, Chapman and Hall/CRC, (2019).
[35] Bracewell, R., The Fourier Transform & Its Applications, McGraw-Hill, (2000).
[36] Cohen, L., Time-Frequency Analysis, Electrical Engineering Signal Processing, Prentice Hall, New Jersey, (1995).
[37] Shie, Q., Dapang, C., "Joint time-frequency analysis", IEEE Signal Processing Magazine, 16(2): 52–67, (1999).
[38] Gabor, D., "Theory of communication. Part 1: The analysis of information", Journal of the Institution of Electrical Engineers - Part III: Radio and Communication Engineering, 93(26): 429–441, (1946).
[39] Hill, P., Audio and Speech Processing with MATLAB, CRC Press, (2018).
[40] Castanié, F., Digital Spectral Analysis, John Wiley & Sons Inc, Hoboken, NJ, USA, (2011).
[41] Lukin, A., "Adaptive Time-Frequency Resolution for Analysis and Processing of Audio", AES 120th Convention, Paris, 1–10, (2006).
[42] Boashash, B., "Heuristic Formulation of Time-Frequency Distributions", in: B. Boashash (Ed.), Time-Frequency Signal Analysis and Processing, Elsevier, 65–102, (2016).
[43] Heisenberg, W., "Uber den anschaulichen Inhalt der quantentheoretischen Kinematik und Mechanik", Zeitschrift Fur Physik, 43(3–4): 172–198, (1927).
[44] Paliwal, K.K., Lyons, J.G., Wojcicki, K.K., "Preference for 20-40 ms window duration in speech analysis", IEEE 4th International Conference on Signal Processing and Communication Systems, 1–4, (2010).
[45] Practical Introduction to Time-Frequency Analysis, Mathworks, www.mathworks.com/help/signal/ug/practical-introduction-to-time-frequency-analysis. Access date: 10.07.2021.
[46] Schneier, B., Applied Cryptography, John Wiley & Sons, Inc, (1996).
[47] Haitsma, J., Kalker, T, Oostveen, J., "Robust Audio Hashing for Content Identification", in: Int. Workshop on Content-Based Multimedia Indexing, Brescia, 4: 117-124, (2001).
[48] Cuff, P., ELE301:Signals and Systems-Labs, Fall Semester 2011-12, Princeton University, https://www.princeton.edu/~cuff/ele301/files/Lab5_2011.pdf. Access date: 22.07.2021.
[49] Cuff, P., ELE301:Signals and Systems-Labs, Fall Semester 2011-12, Princeton University, https://www.princeton.edu/~cuff/ele301/files/Lab6_2011.pdf. Access date: 22.07.2021.
[50] Tombaloglu, B., Erdem, H., "Turkish Speech Recognition Techniques and Applications of Recurrent Units", Gazi University Journal of Science, 34(4): 1035-1049, (2021).
[51] Banuroopa, K., Priyaa, D.S., "MFCC based hybrid fingerprinting method for audio classification through LSTM", International Journal of Nonlinear Analysis and Applications, 12 (Special Issue), 2125-2136, (2022).