Research Article | | Peer-Reviewed

A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context

Received: 19 November 2025     Accepted: 8 December 2025     Published: 29 December 2025
Views:       Downloads:
Abstract

While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.

Published in American Journal of Engineering and Technology Management (Volume 10, Issue 6)
DOI 10.11648/j.ajetm.20251006.11
Page(s) 94-100
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

TTS, Incremental TTS, Encoder, Decoder, Transformer

References
[1] Pouget, M., Hueber, T., Bailly, G., Baumann, T., “Hmm training strategy for incremental speech synthesis,” in 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), pp. 1201–1205, 2015.
[2] Pouget, M., Nahorna, O., Hueber, T., Bailly, G., “Adaptive latency for part-of-speech tagging in incremental text-to-speech synthesis,” in 17th Annual Conference of the International Speech Communication Association, pp. 2846–2850, 2016.
[3] Baumann, T., Schlangen, D., “Evaluating prosodic processing for incremental speech synthesis,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012.
[4] Baumann, T., “Decision tree usage for incremental parametric speech synthesis,” in Proc. of ICASSP, pp. 3819–3823, 2014.
[5] Yanagita, T., Sakti, S., Nakamura, S., “Incremental TTS for Japanese language,” in Proc. Interspeech, pp. 902–906, 2018.
[6] Yanagita, T., Sakti, S., Nakamura, S., “Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework”, In Proc. 10th ISCA Speech Synthesis Workshop, 2019.
[7] Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In Proceedings of Spoken Language Technology Workshop (SLT).
[8] Ren Y., Hu C., Tan X., Qin T., Zhao S., Zhao Z., & Liu T.-Y. (2021). Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of international conference on learning representations (ICLR 2021).
[9] Saeki, T., Takamichi, S., Saruwatari, H. “Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model.” arXiv preprint arXiv: 2012.12612v2 [cs. SD], 2021.
[10] Muyang D., Chuan L., Junjie L. “Incremental FastPitch: Chunk-Based High Quality Text To Speech.” arXiv preprint arXiv: 2401.01755v1 [cs. SD], 2024.
[11] Yanagita, T., Sakti, S., Nakamura, S., "Japanese Neural Incremental Text-to-speech Synthesis Framework with an Accent Phrase Input", Volume 11, 22355-22363, IEEE Access, Mar. 2, 2023.
[12] Kayyar, K., Dittmar, C., Pia, N., Habets, E., “Low-resource text-to-speech using specific data and noise augmentation,” in Proc. IEEE-SPS European Signal Processing Conf., pp. 61-65., 2023.
[13] Bataev, V., Ghosh, S., Lavrukhin, V., Li, J., "TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer," arXiv preprint arXiv: 2501.06320v1, 2025.
[14] Chen, Y., Niu, Z., Ma, Z., Deng, K., Wang, C., Zhao, J., Yu, K., Chen, X. "F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching." arXiv preprint arXiv: 2410.06885, 2024.
[15] Shen, K., Ju, Z., Tan, X., Liu, E., Leng, Y., He, L., Qin, T., Zhao, S., Bian, J., "NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers." In Proc. Intl. Conf. Learning Representations (ICLR), 2024.
[16] Peng, P., Huang, P., Li, D, Mohamed, A., Harwath, D., " Voice-Craft: Zero-Shot Speech Editing and Text-to-Speech in the Wild." arXiv preprint arXiv: 2403.16973, 2024.
Cite This Article
  • APA Style

    Kim, S., Song, J., Pak, D., Pak, D., Won, M., et al. (2025). A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. American Journal of Engineering and Technology Management, 10(6), 94-100. https://doi.org/10.11648/j.ajetm.20251006.11

    Copy | Download

    ACS Style

    Kim, S.; Song, J.; Pak, D.; Pak, D.; Won, M., et al. A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. Am. J. Eng. Technol. Manag. 2025, 10(6), 94-100. doi: 10.11648/j.ajetm.20251006.11

    Copy | Download

    AMA Style

    Kim S, Song J, Pak D, Pak D, Won M, et al. A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. Am J Eng Technol Manag. 2025;10(6):94-100. doi: 10.11648/j.ajetm.20251006.11

    Copy | Download

  • @article{10.11648/j.ajetm.20251006.11,
      author = {Song-Yun Kim and Jin-Hyok Song and Dae-Hun Pak and Dong-Song Pak and Myong-Hyok Won and Hakho Hong},
      title = {A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context},
      journal = {American Journal of Engineering and Technology Management},
      volume = {10},
      number = {6},
      pages = {94-100},
      doi = {10.11648/j.ajetm.20251006.11},
      url = {https://doi.org/10.11648/j.ajetm.20251006.11},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajetm.20251006.11},
      abstract = {While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.},
     year = {2025}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context
    AU  - Song-Yun Kim
    AU  - Jin-Hyok Song
    AU  - Dae-Hun Pak
    AU  - Dong-Song Pak
    AU  - Myong-Hyok Won
    AU  - Hakho Hong
    Y1  - 2025/12/29
    PY  - 2025
    N1  - https://doi.org/10.11648/j.ajetm.20251006.11
    DO  - 10.11648/j.ajetm.20251006.11
    T2  - American Journal of Engineering and Technology Management
    JF  - American Journal of Engineering and Technology Management
    JO  - American Journal of Engineering and Technology Management
    SP  - 94
    EP  - 100
    PB  - Science Publishing Group
    SN  - 2575-1441
    UR  - https://doi.org/10.11648/j.ajetm.20251006.11
    AB  - While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.
    VL  - 10
    IS  - 6
    ER  - 

    Copy | Download

Author Information
  • Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea

  • Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea

  • Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea

  • Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea

  • Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea

  • Institute of Mathematics, State Academy of Sciences, Pyongyang, Democratic People’s Republic of Korea

  • Sections