While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.
| Published in | American Journal of Engineering and Technology Management (Volume 10, Issue 6) |
| DOI | 10.11648/j.ajetm.20251006.11 |
| Page(s) | 94-100 |
| Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
| Copyright |
Copyright © The Author(s), 2025. Published by Science Publishing Group |
TTS, Incremental TTS, Encoder, Decoder, Transformer
| [1] | Pouget, M., Hueber, T., Bailly, G., Baumann, T., “Hmm training strategy for incremental speech synthesis,” in 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), pp. 1201–1205, 2015. |
| [2] | Pouget, M., Nahorna, O., Hueber, T., Bailly, G., “Adaptive latency for part-of-speech tagging in incremental text-to-speech synthesis,” in 17th Annual Conference of the International Speech Communication Association, pp. 2846–2850, 2016. |
| [3] | Baumann, T., Schlangen, D., “Evaluating prosodic processing for incremental speech synthesis,” in Thirteenth Annual Conference of the International Speech Communication Association, 2012. |
| [4] | Baumann, T., “Decision tree usage for incremental parametric speech synthesis,” in Proc. of ICASSP, pp. 3819–3823, 2014. |
| [5] | Yanagita, T., Sakti, S., Nakamura, S., “Incremental TTS for Japanese language,” in Proc. Interspeech, pp. 902–906, 2018. |
| [6] | Yanagita, T., Sakti, S., Nakamura, S., “Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework”, In Proc. 10th ISCA Speech Synthesis Workshop, 2019. |
| [7] | Yang, G., Yang, S., Liu, K., Fang, P., Chen, W., & Xie, L. (2020). Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In Proceedings of Spoken Language Technology Workshop (SLT). |
| [8] | Ren Y., Hu C., Tan X., Qin T., Zhao S., Zhao Z., & Liu T.-Y. (2021). Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of international conference on learning representations (ICLR 2021). |
| [9] | Saeki, T., Takamichi, S., Saruwatari, H. “Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model.” arXiv preprint arXiv: 2012.12612v2 [cs. SD], 2021. |
| [10] | Muyang D., Chuan L., Junjie L. “Incremental FastPitch: Chunk-Based High Quality Text To Speech.” arXiv preprint arXiv: 2401.01755v1 [cs. SD], 2024. |
| [11] | Yanagita, T., Sakti, S., Nakamura, S., "Japanese Neural Incremental Text-to-speech Synthesis Framework with an Accent Phrase Input", Volume 11, 22355-22363, IEEE Access, Mar. 2, 2023. |
| [12] | Kayyar, K., Dittmar, C., Pia, N., Habets, E., “Low-resource text-to-speech using specific data and noise augmentation,” in Proc. IEEE-SPS European Signal Processing Conf., pp. 61-65., 2023. |
| [13] | Bataev, V., Ghosh, S., Lavrukhin, V., Li, J., "TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer," arXiv preprint arXiv: 2501.06320v1, 2025. |
| [14] | Chen, Y., Niu, Z., Ma, Z., Deng, K., Wang, C., Zhao, J., Yu, K., Chen, X. "F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching." arXiv preprint arXiv: 2410.06885, 2024. |
| [15] | Shen, K., Ju, Z., Tan, X., Liu, E., Leng, Y., He, L., Qin, T., Zhao, S., Bian, J., "NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers." In Proc. Intl. Conf. Learning Representations (ICLR), 2024. |
| [16] | Peng, P., Huang, P., Li, D, Mohamed, A., Harwath, D., " Voice-Craft: Zero-Shot Speech Editing and Text-to-Speech in the Wild." arXiv preprint arXiv: 2403.16973, 2024. |
APA Style
Kim, S., Song, J., Pak, D., Pak, D., Won, M., et al. (2025). A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. American Journal of Engineering and Technology Management, 10(6), 94-100. https://doi.org/10.11648/j.ajetm.20251006.11
ACS Style
Kim, S.; Song, J.; Pak, D.; Pak, D.; Won, M., et al. A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context. Am. J. Eng. Technol. Manag. 2025, 10(6), 94-100. doi: 10.11648/j.ajetm.20251006.11
@article{10.11648/j.ajetm.20251006.11,
author = {Song-Yun Kim and Jin-Hyok Song and Dae-Hun Pak and Dong-Song Pak and Myong-Hyok Won and Hakho Hong},
title = {A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context},
journal = {American Journal of Engineering and Technology Management},
volume = {10},
number = {6},
pages = {94-100},
doi = {10.11648/j.ajetm.20251006.11},
url = {https://doi.org/10.11648/j.ajetm.20251006.11},
eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ajetm.20251006.11},
abstract = {While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach.},
year = {2025}
}
TY - JOUR T1 - A Study on the Incremental Text-to-speech Synthesis Taking into Account Intermediate Feature-level Context AU - Song-Yun Kim AU - Jin-Hyok Song AU - Dae-Hun Pak AU - Dong-Song Pak AU - Myong-Hyok Won AU - Hakho Hong Y1 - 2025/12/29 PY - 2025 N1 - https://doi.org/10.11648/j.ajetm.20251006.11 DO - 10.11648/j.ajetm.20251006.11 T2 - American Journal of Engineering and Technology Management JF - American Journal of Engineering and Technology Management JO - American Journal of Engineering and Technology Management SP - 94 EP - 100 PB - Science Publishing Group SN - 2575-1441 UR - https://doi.org/10.11648/j.ajetm.20251006.11 AB - While end-to-end text-to-speech (E2E TTS) has significantly improved speech quality compared to traditional TTS, the computational cost is very expensive due to the use of complex neural network architectures. The synthetic time has been much reduced by the efforts to reduce the computational cost. In TTS, the real-time factor in the device should be less than 1, and the latency is required to be possibly small. One way to reduce latency in E2E TTS system is incremental TTS. In incremental TTS, there is a disadvantage of the loss of naturalness at the boundary between the sentence segments, as speech is synthesized in units of the sentence segment. To improve naturalness at the boundary between the sentence segments, we take into account the context. Then, taking into account the context as text or the context as an intermediate feature of encoder and decoder containing attention, the amount of computation in acoustic model can increase and the synthetic speech can be broken at the boundary between the sentence segments. That is, incremental TTS is subject to a trade-off between the amount of computation and naturalness of synthetic speech. In this paper we propose an incremental Korean TTS method taking into account the intermediate feature-level context, which is based on the analysis of two-stage E2E TTS consisting of an acoustic model and a vocoder. We present experimental result conducted on the FastSpeech2 model, which shows the effectiveness of our approach. VL - 10 IS - 6 ER -