This study presents an interactive AI-driven framework for real-time piano music generation from human body motion, establishing a coherent link between physical gesture and computational creativity. The proposed system integrates computer vision–based motion capture with sequence-oriented deep learning to translate continuous movement dynamics into structured musical output. Human pose is extracted using MediaPipe, while OpenCV is employed for temporal motion tracking to derive three-dimensional skeletal landmarks and velocity-based features that modulate musical expression. These motion-derived signals condition a Long Short-Term Memory (LSTM) network trained on a large corpus of classical piano MIDI compositions, enabling the model to preserve stylistic coherence and long-range musical dependencies while dynamically adapting tempo and rhythmic intensity in response to real-time performer movement. The data processing pipeline includes MIDI event encoding, sequence segmentation, feature normalization, and multi-layer LSTM training optimized using cross-entropy loss and the RMSprop optimizer. Model performance is evaluated quantitatively through loss convergence and note diversity metrics, and qualitatively through assessments of musical coherence and system responsiveness. Experimental results demonstrate that the proposed LSTM-based generator maintains structural stability while producing diverse and expressive musical sequences that closely reflect variations in motion velocity. By establishing a closed-loop, real-time mapping between gesture and sound, the framework enables intuitive, embodied musical interaction without requiring traditional instrumental expertise, advancing embodied AI and multimodal human–computer interaction while opening new opportunities for digital performance, creative education, and accessible music generation through movement.
| Published in | International Journal of Intelligent Information Systems (Volume 14, Issue 6) |
| DOI | 10.11648/j.ijiis.20251406.12 |
| Page(s) | 121-135 |
| Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
| Copyright |
Copyright © The Author(s), 2025. Published by Science Publishing Group |
Artificial Intelligence, Human-Computer Interaction, Deep Learning, LSTM Networks, Computer Vision, Real-Time Systems, AI, Musical Expression
| [1] | Jensenius, Alexander Refsum. 2013. “Some Video Abstraction Techniques for Displaying Body Movement in Analysis and Performance.” Leonardo 46 (1): 53–60. |
| [2] | Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80. |
| [3] | Huang, Cheng-Zhi Anna, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, and Ian Simon. 2018. “Music Transformer: Generating Music with Long-Term Structure.” arXiv, June 12, 2018. |
| [4] | Yang, Li-Chia, Szu-Yu Chou, and Yi-Hsuan Yang. 2017. “MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation.” In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), 324–331. |
| [5] | Fiebrink, Rebecca, and Perry R. Cook. 2010. “Real-Time Human Interaction with Supervised Learning Algorithms for Music Composition and Performance.” Proceedings of NIME. |
| [6] | Bevilacqua, Frédéric, Norbert Schnell, and Romain Flety. 2005. “Gesture Control of Sound Synthesis: Approaches and Design.” Gesture Workshop. |
| [7] | Kim, Jong-Wook, Jin-Young Choi, Eun-Ju Ha, and Jae-Ho Choi. 2023. “Human Pose Estimation Using MediaPipe Pose and Optimization Method Based on a Humanoid Model.” Applied Sciences 13 (4): 2700. |
| [8] | Olah, Christopher. 2015. “Understanding LSTM Networks.” Colah’s Blog. August 27, 2015. |
| [9] | Lugaresi, Camillo, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wei Hua, Manfred Georg, and Matthias Grundmann. 2019. “MediaPipe: A Framework for Building Perception Pipelines.” arXiv. June 14, 2019. |
| [10] | Sengar, Sandeep Singh, Abhishek Kumar, and Owen Singh. 2024. “Efficient Human Pose Estimation: Leveraging Advanced Techniques with MediaPipe.” arXiv. June 21, 2024. |
| [11] | Zhang, Fan, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. “MediaPipe Hands: On-Device Real-Time Hand Tracking.” arXiv. June 18, 2020. |
| [12] | Karpathy, Andrej, Justin Johnson, and Li Fei-Fei. 2015. “Visualizing and Understanding Recurrent Networks.” arXiv. June 5, 2015. |
| [13] | Briot, Jean-Pierre, Gaëtan Hadjeres, and François-David Pachet. 2017. “Deep Learning Techniques for Music Generation – A Survey.” arXiv. |
| [14] | Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. |
| [15] | Roberts, Adam, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018. “A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music.” ICML Workshop. |
| [16] | Miranda, Eduardo Reck. 2021. Artificial Intelligence and Music Ecosystems. Springer. |
| [17] | Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music Gesture for Visual Sound Separation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10478–10487. |
| [18] | Rhodes, C., Allmendinger, R., & Climent, R. (2020). New interfaces and approaches to machine learning when classifying gestures within music. Entropy, 22(12), 1384. |
| [19] | Li, R., Yang, S., Ross, D. A., & Kanazawa, A. (2021). AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. IEEE/CVF International Conference on Computer Vision (ICCV) 2021. |
| [20] | Li, B., Zhao, Y., Shi, Z., & Sheng, L. (2022). DanceFormer: Music Conditioned 3D Dance Generation with Parametric Motion Transformer. AAAI Conference on Artificial Intelligence (AAAI) 2022, 1272–1279. |
| [21] | Jiang, D. (2022). Matching model of dance movements and music rhythm features using human posture estimation. Computational Intelligence and Neuroscience, 2022: 7331210. |
| [22] | Christodoulou, A. M., et al. (2024). Multimodal music datasets? Challenges and future goals in music-multimodal research. |
APA Style
Bukaita, W., Artiles, N. G., Pathak, I. (2025). AI-Powered Music Generation from Sequential Motion Signals: A Study in LSTM-Based Modelling. International Journal of Intelligent Information Systems, 14(6), 121-135. https://doi.org/10.11648/j.ijiis.20251406.12
ACS Style
Bukaita, W.; Artiles, N. G.; Pathak, I. AI-Powered Music Generation from Sequential Motion Signals: A Study in LSTM-Based Modelling. Int. J. Intell. Inf. Syst. 2025, 14(6), 121-135. doi: 10.11648/j.ijiis.20251406.12
@article{10.11648/j.ijiis.20251406.12,
author = {Wisam Bukaita and Nestor Gomez Artiles and Ishaan Pathak},
title = {AI-Powered Music Generation from Sequential Motion Signals: A Study in LSTM-Based Modelling},
journal = {International Journal of Intelligent Information Systems},
volume = {14},
number = {6},
pages = {121-135},
doi = {10.11648/j.ijiis.20251406.12},
url = {https://doi.org/10.11648/j.ijiis.20251406.12},
eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.20251406.12},
abstract = {This study presents an interactive AI-driven framework for real-time piano music generation from human body motion, establishing a coherent link between physical gesture and computational creativity. The proposed system integrates computer vision–based motion capture with sequence-oriented deep learning to translate continuous movement dynamics into structured musical output. Human pose is extracted using MediaPipe, while OpenCV is employed for temporal motion tracking to derive three-dimensional skeletal landmarks and velocity-based features that modulate musical expression. These motion-derived signals condition a Long Short-Term Memory (LSTM) network trained on a large corpus of classical piano MIDI compositions, enabling the model to preserve stylistic coherence and long-range musical dependencies while dynamically adapting tempo and rhythmic intensity in response to real-time performer movement. The data processing pipeline includes MIDI event encoding, sequence segmentation, feature normalization, and multi-layer LSTM training optimized using cross-entropy loss and the RMSprop optimizer. Model performance is evaluated quantitatively through loss convergence and note diversity metrics, and qualitatively through assessments of musical coherence and system responsiveness. Experimental results demonstrate that the proposed LSTM-based generator maintains structural stability while producing diverse and expressive musical sequences that closely reflect variations in motion velocity. By establishing a closed-loop, real-time mapping between gesture and sound, the framework enables intuitive, embodied musical interaction without requiring traditional instrumental expertise, advancing embodied AI and multimodal human–computer interaction while opening new opportunities for digital performance, creative education, and accessible music generation through movement.},
year = {2025}
}
TY - JOUR T1 - AI-Powered Music Generation from Sequential Motion Signals: A Study in LSTM-Based Modelling AU - Wisam Bukaita AU - Nestor Gomez Artiles AU - Ishaan Pathak Y1 - 2025/12/29 PY - 2025 N1 - https://doi.org/10.11648/j.ijiis.20251406.12 DO - 10.11648/j.ijiis.20251406.12 T2 - International Journal of Intelligent Information Systems JF - International Journal of Intelligent Information Systems JO - International Journal of Intelligent Information Systems SP - 121 EP - 135 PB - Science Publishing Group SN - 2328-7683 UR - https://doi.org/10.11648/j.ijiis.20251406.12 AB - This study presents an interactive AI-driven framework for real-time piano music generation from human body motion, establishing a coherent link between physical gesture and computational creativity. The proposed system integrates computer vision–based motion capture with sequence-oriented deep learning to translate continuous movement dynamics into structured musical output. Human pose is extracted using MediaPipe, while OpenCV is employed for temporal motion tracking to derive three-dimensional skeletal landmarks and velocity-based features that modulate musical expression. These motion-derived signals condition a Long Short-Term Memory (LSTM) network trained on a large corpus of classical piano MIDI compositions, enabling the model to preserve stylistic coherence and long-range musical dependencies while dynamically adapting tempo and rhythmic intensity in response to real-time performer movement. The data processing pipeline includes MIDI event encoding, sequence segmentation, feature normalization, and multi-layer LSTM training optimized using cross-entropy loss and the RMSprop optimizer. Model performance is evaluated quantitatively through loss convergence and note diversity metrics, and qualitatively through assessments of musical coherence and system responsiveness. Experimental results demonstrate that the proposed LSTM-based generator maintains structural stability while producing diverse and expressive musical sequences that closely reflect variations in motion velocity. By establishing a closed-loop, real-time mapping between gesture and sound, the framework enables intuitive, embodied musical interaction without requiring traditional instrumental expertise, advancing embodied AI and multimodal human–computer interaction while opening new opportunities for digital performance, creative education, and accessible music generation through movement. VL - 14 IS - 6 ER -