| Peer-Reviewed

Activation Maximization with a Prior in Speech Data

Received: 30 July 2021     Accepted: 13 August 2021     Published: 31 August 2021
Views:       Downloads:
Abstract

Recently, more and more studies regarding neural networks have been done. However, the learning process of neural networks is often elusive to human beings, which leads to the advent of feature visualization techniques. Activation Maximization (AM) is one of the feature visualization techniques, originally designed for image data. In AM, the input data is optimized to find the data that activates the selected neuron. In this paper, the emotion recognizer’s output is selected as the neuron, and the latent code of a generator (of Generative Adversarial Networks) is optimized instead of the input raw data. The aim of this study is to apply AM to different representations of audio data (waveform-based data and mel-spectrogram-based data) and different model structures (CNN, WaveNet, LSTM), and to find out the most suitable condition for AM in audio domain data. Additionally, we have also tried to visualize the essential features of being a certain class for emotion classification in speech data, using 2 datasets: the Toronto emotional speech set (TESS) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The mel-spectrogram-based models were found to be superior to the others, showing the distinctive features of selected emotions. More specifically, the CNN-mel-spectrogram-based model was the best in both qualitative and quantitative (FID score) results. Moreover, as demonstrated in this study, AM can also be employed as an output enhancer for generative models.

Published in American Journal of Computer Science and Technology (Volume 4, Issue 3)
DOI 10.11648/j.ajcst.20210403.13
Page(s) 75-82
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2021. Published by Science Publishing Group

Keywords

Deep Learning, Signal Processing, Feature Visualization, Activation Maximization, GAN

References
[1] Dumitru Erhan, Y. Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. TechnicalReport, Univerist˜A c ?deMontr˜A c ?al, 01 2009.
[2] K. Simonyan, A. Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2014.
[3] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks, 2016.
[4] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.
[5] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis, 2018.
[6] M. Kathleen Pichora-Fuller and Kate Dupuis. Toronto emotional speech set (TESS), 2020.
[7] Steven R. Livingstone and Frank A. Russo. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), April 2018. Funding Information Natural Sciences and Engineering Research Council of Canada: 2012-341583 Hear the world research chair in music and emotional speech from Phonak.
[8] Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.
[9] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl- Dickstein. Unrolled generative adversarial networks, 2017.
[10] Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020.
[11] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them, 2014.
[12] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks, 2014.
[13] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into neural networks, 2015.
[14] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio, 2016.
[15] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, 2018.
[16] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis, 2017.
[17] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions, 2018.
[18] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network, 2014.
[19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
[20] Keiron O’Shea and Ryan Nash. An introduction to convolutional neural networks. ArXiv e-prints, 11 2015.
[21] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.
[22] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, and Zhen Wang. Multi-class generative adversarial networks with the L2 loss function. CoRR, abs/1611.04076, 2016.
[23] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
[24] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.
[25] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
[26] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015.
[27] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Cite This Article
  • APA Style

    Sho Inoue, Tad Gonsalves. (2021). Activation Maximization with a Prior in Speech Data. American Journal of Computer Science and Technology, 4(3), 75-82. https://doi.org/10.11648/j.ajcst.20210403.13

    Copy | Download

    ACS Style

    Sho Inoue; Tad Gonsalves. Activation Maximization with a Prior in Speech Data. Am. J. Comput. Sci. Technol. 2021, 4(3), 75-82. doi: 10.11648/j.ajcst.20210403.13

    Copy | Download

    AMA Style

    Sho Inoue, Tad Gonsalves. Activation Maximization with a Prior in Speech Data. Am J Comput Sci Technol. 2021;4(3):75-82. doi: 10.11648/j.ajcst.20210403.13

    Copy | Download

  • @article{10.11648/j.ajcst.20210403.13,
      author = {Sho Inoue and Tad Gonsalves},
      title = {Activation Maximization with a Prior in Speech Data},
      journal = {American Journal of Computer Science and Technology},
      volume = {4},
      number = {3},
      pages = {75-82},
      doi = {10.11648/j.ajcst.20210403.13},
      url = {https://doi.org/10.11648/j.ajcst.20210403.13},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajcst.20210403.13},
      abstract = {Recently, more and more studies regarding neural networks have been done. However, the learning process of neural networks is often elusive to human beings, which leads to the advent of feature visualization techniques. Activation Maximization (AM) is one of the feature visualization techniques, originally designed for image data. In AM, the input data is optimized to find the data that activates the selected neuron. In this paper, the emotion recognizer’s output is selected as the neuron, and the latent code of a generator (of Generative Adversarial Networks) is optimized instead of the input raw data. The aim of this study is to apply AM to different representations of audio data (waveform-based data and mel-spectrogram-based data) and different model structures (CNN, WaveNet, LSTM), and to find out the most suitable condition for AM in audio domain data. Additionally, we have also tried to visualize the essential features of being a certain class for emotion classification in speech data, using 2 datasets: the Toronto emotional speech set (TESS) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The mel-spectrogram-based models were found to be superior to the others, showing the distinctive features of selected emotions. More specifically, the CNN-mel-spectrogram-based model was the best in both qualitative and quantitative (FID score) results. Moreover, as demonstrated in this study, AM can also be employed as an output enhancer for generative models.},
     year = {2021}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Activation Maximization with a Prior in Speech Data
    AU  - Sho Inoue
    AU  - Tad Gonsalves
    Y1  - 2021/08/31
    PY  - 2021
    N1  - https://doi.org/10.11648/j.ajcst.20210403.13
    DO  - 10.11648/j.ajcst.20210403.13
    T2  - American Journal of Computer Science and Technology
    JF  - American Journal of Computer Science and Technology
    JO  - American Journal of Computer Science and Technology
    SP  - 75
    EP  - 82
    PB  - Science Publishing Group
    SN  - 2640-012X
    UR  - https://doi.org/10.11648/j.ajcst.20210403.13
    AB  - Recently, more and more studies regarding neural networks have been done. However, the learning process of neural networks is often elusive to human beings, which leads to the advent of feature visualization techniques. Activation Maximization (AM) is one of the feature visualization techniques, originally designed for image data. In AM, the input data is optimized to find the data that activates the selected neuron. In this paper, the emotion recognizer’s output is selected as the neuron, and the latent code of a generator (of Generative Adversarial Networks) is optimized instead of the input raw data. The aim of this study is to apply AM to different representations of audio data (waveform-based data and mel-spectrogram-based data) and different model structures (CNN, WaveNet, LSTM), and to find out the most suitable condition for AM in audio domain data. Additionally, we have also tried to visualize the essential features of being a certain class for emotion classification in speech data, using 2 datasets: the Toronto emotional speech set (TESS) and the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The mel-spectrogram-based models were found to be superior to the others, showing the distinctive features of selected emotions. More specifically, the CNN-mel-spectrogram-based model was the best in both qualitative and quantitative (FID score) results. Moreover, as demonstrated in this study, AM can also be employed as an output enhancer for generative models.
    VL  - 4
    IS  - 3
    ER  - 

    Copy | Download

Author Information
  • Department of Information & Communication Sciences, Sophia University, Tokyo, Japan

  • Department of Information & Communication Sciences, Sophia University, Tokyo, Japan

  • Sections