SENTIMENT ANALYSIS MODEL BASED ON ATTENTION MECHANISM MULTIMODAL FUSION AND SPATIO-TEMPORAL GRAPH NEURAL NETWORK

AiQun Zhu; Jin Lu

doi:10.61784/wjit3104

Authors

AiQun Zhu Shenzhen City Polytechnic, Shenzhen 518000, Guangdong, China.
Jin Lu (Corresponding Author) Shenzhen Polytechnic University, Shenzhen 518000, Guangdong, China.

Keywords:

Multimodal sentiment analysis, Attention mechanism, Feature fusion, Space-time graph neural network, Dynamic modeling, Temporal dependence

Abstract

As a key task in the field of artificial intelligence, the application of sentiment analysis has expanded from a single text modality to cover multi-modal information such as vision and hearing. However, the existing multi-modal sentiment analysis methods often ignore the time sequence and structure of the dynamic interaction between modalities, and fail to fully model the evolution of emotional States in the time dimension. Therefore, this study proposes a sentiment analysis model based on multi-modal fusion of attention mechanism and spatio-temporal neural network. The model first extracts the high-dimensional features of text, visual and auditory modalities through a dedicated encoder, and then designs a hierarchical attention fusion mechanism to adaptively weight the contributions of different modalities at the feature level and decision level. In order to capture the dynamic evolution of emotional expression, the spatio-temporal graph neural network module is introduced into the model, and the fusion features of each time step are regarded as graph nodes, and the dynamic edges are constructed based on the correlation and temporal continuity between modalities, so as to model the spatio-temporal dependence of cross-modal interaction. Experiments are conducted on three open multimodal emotion datasets, including CMU-MOSEI, IEMOCAP and MOSI. The results show that the proposed model significantly outperforms the baseline method in both sentiment classification and sentiment regression tasks. On the CMU-MOSEI data set, the accuracy of the model reaches 88. 7% in the binary classification and 53. 2% in the seven-class classification, both of which reach the current advanced level. The ablation experiment further verifies the effectiveness of the hierarchical attention mechanism and the spatiotemporal neural network module. This study provides a new framework for multimodal sentiment analysis that can simultaneously model the complex interaction and temporal dynamics between modalities, which has theoretical significance for understanding the complex mechanism of human emotional expression, and provides technical support for the development of more accurate emotional intelligence applications.

References

[1] Mitsea E, Drigas A, Skianis C. Systematic Review of Artificial Intelligence in Positive and Existential Psychiatry: Advancing Mental and Emotional Health Through Metacompetency Development. Healthcare, 2026, 14(6): 783-783.

[2] Alhussein G, Ziogas I, Saleem S, et al. Speech emotion recognition in conversations using artificial intelligence: a systematic review and meta-analysis. Artificial Intelligence Review, 2025, 58(7): 198-198.

[3] Guangquan L, Jiecheng L, Jian W. Aspect sentiment analysis with heterogeneous graph neural networks. Information Processing and Management, 2022, 59(4).

[4] Zhou L, Xu X, Wang X. LSRD-Net: A fine-grained sentiment analysis method based on log-normalized semantic relative distance. Computer Speech & Language, 2025: 93101782-101782.

[5] Zuo S. E-Commerce Platform User Evaluation Sentiment Analysis and Emotion-Oriented Marketing Strategy Research Based on Big Data Technology. International Journal of High Speed Electronics and Systems, 2024(prepublish).

[6] Lin M, Wenfeng C, Xiaolan F, et al. Emotional expression and micro-expression recognition in depressive patients. Chinese Science Bulletin, 2018, 63(20): 2048-2056.

[7] Xing H, Ma Z, Su B, et al. Global and local AU-assisted graph convolutional network for micro-expression recognition. Signal, Image and Video Processing, 2025, 19(8): 599-599.

[8] Lee D, Park E, Kim G, et al. A Multimodal Dataset for Assessing Emotion, Stress, and Emotional Workload in Interpersonal Work Scenario. Scientific data, 2026, 13(1): 214-214.

[9] Zhang H, Shang Y, Liu T, et al. Multimodal sentiment analysis with query-based distillation and asymmetric fusion. Neurocomputing, 2026: 683133494-133494.

[10] Cai L, Liu D, Liu L, et al. Multimodal Sentiment Analysis Based on Dynamic Language Enhancement and Synergistic Cross-Modal Transformer. IEEE transactions on neural networks and learning systems, 2026.

[11] Shi C, Zhang Y. MMKT: Multimodal Sentiment Analysis Model Based on Knowledge-Enhanced and Text-Guided Learning. Applied Sciences, 2025, 15(17): 9815-9815.

[12] Azani A S, Alfy E M S E. A review and critical analysis of multimodal datasets for emotional AI. Artificial Intelligence Review, 2025, 58(10): 334-334.

[13] Steven H, L. J H, Haseon P, et al. A Multimodal Emotion Perspective on Social Media Influencer Marketing: The Effectiveness of Influencer Emotions, Network Size, and Branding on Consumer Brand Engagement Using Facial Expression and Linguistic Analysis. Journal of Interactive Marketing, 2023, 58(4): 414-439.

[14] Yue G, Kangning Y, Shiyu F, et al. Multimodal Affective Analysis Using Hierarchical Attention Strategy with Word-Level Alignment. Proceedings of the conference. Association for Computational Linguistics. Meeting, 2018: 20182225-2235.

[15] Majumder N, Hazarika D, Gelbukh A, et al. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-Based Systems, 2018: 161124-133.

[16] Bo Y, Bo S, Lijun W, et al. Multimodal sentiment analysis with unidirectional modality translation. Neurocomputing, 2022: 467130-137.

[17] Syrett K, Becker M. More hard words: Learning emotion and mental state adjectives from linguistic context. Language Acquisition, 2025, 32(4): 391-431.

[18] Lu L, Yuan L, Chen L. Deep learning based emotion recognition for analyzing students’ psychological states during competitions. Entertainment Computing, 2025: 55101005-101005.

[19] Rahman U A, Ali S, Wason R, et al. Emotion‐Based Mental State Classification Using EEG for Brain‐Computer Interface Applications. Computational Intelligence, 2025, 41(4): e70112-e70112.

[20] Hoang C D, Tan X P, Nguyen N A, et al. Vote-based multimodal fusion for hand-held object pose estimation. Alexandria Engineering Journal, 2025: 120237-249.

[21] Hu Z, Zhang G, Yin Z, et al. HaHeAE: Learning Generalisable Joint Representations of Human Hand and Head Movements in Extended Reality. IEEE transactions on visualization and computer graphics, 2025.

[22] Linna P, Rencan N, Gucheng Z, et al. WAE-TLDN: self-supervised fusion for multimodal medical images via a weighted autoencoder and a tensor low-rank decomposition network. Applied Intelligence, 2024, 54(2): 1656-1671.

[23] An Y, Lan R, Lin H, et al. Multimodal Fusion Framework Based on Low-rank Interaction for Tumor Prognostic Prediction. IEEE transactions on computational biology and bioinformatics, 2025.

[24] Liu D, Xu X, Wang Y, et al. Implementing general working memory via Hebbian plasticity in a theta-gamma coupled network. Neurocomputing, 2026: 680133320-133320.

[25] Jin H, Yang T, Yan L, et al. Multimodal Emotion Recognition in Conversations Using Transformer and Graph Neural Networks. Applied Sciences, 2025, 15(22): 11971-11971.

[26] Fu C, Qian F, Su K, et al. HiMul-LGG: A hierarchical decision fusion-based local–global graph neural network for multimodal emotion recognition in conversation. Neural Networks, 2025: 181106764-106764.

[27] Xie J, Wang Y, Meng T, et al. Multimodal Emotion Recognition Method Based on Domain Generalization and Graph Neural Networks. Electronics, 2025, 14(5): 885-885.

[28] Yang F, Peng D, Xu Y, et al. Contrastive learning enhanced multi-level interest-aware for session-based recommendation via graph attention networks. Expert Systems With Applications, 2026: 323132463-132463.

[29] Hallur S, Gavade A. Hierarchical Multi-Scale Attention Framework with Heterogeneous Deep Network Fusion for Robust Facial Emotion Recognition. Signal, Image and Video Processing, 2026, 20(5): 291-291.

[30] Feng Z, Tariq Z, Zhang Z, et al. CoSwinNet: A conditional Swin Transformer multimodal surrogate model for subsurface multiphase flow. Fuel, 2026: 411138067-138067.

[31] Guo K J, Hofmann O M. Interactive Pattern Discovery in High-Dimensional, Multimodal Data Using Manifolds. Procedia Computer Science, 2017: 114258-265.

[32] Bhagawati M, Gupta S, Paul S, et al. Attention-based hybrid deep learning models and its scientific validation for cardiovascular disease risk stratification. Biomedical Signal Processing and Control, 2025: 108107824-107824.

[33] Gupta A, Misra C D. Hybrid IoT security model with integration of LSTM, BERT, ROBERTA and transform learning for attack classification. International Journal of Information Technology, 2025, 17(8): 1-18.

[34] Jamshidi S, Mohammadi M, Bagheri S, et al. Effective text classification using BERT, MTM LSTM, and DT. Data & Knowledge Engineering, 2024, 151: 102306-102306.

[35] Srilakshmi V, Devarasetty P, Chetana L V, et al. Alzheimer's Disease Staging Using Enhanced Inception-ResNet-V2 and Improved XceptionNet Models for 3D MRI Classification and Segmentation. Journal of neuroscience methods, 2026: 432110767.

[36] Kavitha N, Soundar R K, Karthick R, et al. Correction: Automatic video captioning using tree hierarchical deep convolutional neural network and ASRNN-bi-directional LSTM. Computing, 2025, 107(4): 92-92.

[37] Lv J, Wang Y, Wang M, et al. Single-Ended Fault Location Method for DC Distribution Network Based on Bi-LSTM. Energies, 2026, 19(8): 1866-1866.

[38] Qi G, Zhihui W, Daoerji F, et al. Multi-face detection and alignment using multiple kernels. Applied Soft Computing Journal, 2022, 122.

[39] Lakshmi L K, Muthulakshmi P, Nithya A A, et al. Recognition of emotions in speech using deep CNN and RESNET. Soft Computing, 2023(prepublish): 1-17.

[40] Saritha B, Laskar A M, Monsley A K, et al. ReptoNet: A 3D Log Mel Spectrogram-Based Few-Shot Speaker Identification with Reptile Algorithm. Arabian Journal for Science and Engineering, 2024, 50(10): 1-16.

[41] Li J, Wang Y, Liang W, et al. Visual Anomaly Detection via CNN-BiLSTM Network with Knit Feature Sequence for Floating-Yarn Stacking during the High-Speed Sweater Knitting Process. Electronics, 2024, 13(19): 3968-3968.

[42] Feng Y, Guo L. MGDNet: Modality-grouped dual-encoder network with cross-modality difference fusion for multi-modal MRI brain tumor segmentation. Biomedical Signal Processing and Control, 2026, 121110319-110319.

[43] Sharma K A, Tiwari R, Dixit R, et al. Modelling of features of fusion using a hybrid swarm optimization algorithm with deep learning methodology for copy-move image forgery detection. International Journal of System Assurance Engineering and Management, 2025, 17(1): 1-21.

[44] Wenhao C, Haojie X, Rencheng S, et al. Dynamic modeling and performance evaluation of piezoelectric impact drive system based on neural network. Measurement Science and Technology, 2023, 34(10).

[45] Chhavi D, Mouli S S. Deep CNN with late fusion for real time multimodal emotion recognition. Expert Systems With Applications, 2024, 240.

[46] Chae E D, Lee P S. Dual Routing Mixture-of-Experts for Multi-Scale Representation Learning in Multimodal Emotion Recognition. Electronics, 2025, 14(24): 4972-4972.

[47] Antje B, Olivier C D, Tatjana T, et al. Combining SIMS elemental mapping with FIB-based imaging for multimodal analysis of biological specimen. BIO Web of Conferences, 2024, 129.

[48] Sikindar S, Raghavendran V C, Madhavi G. AI-driven multimodal imaging fusion using swin transformer and optimized tensor fusion networks for pneumonia detection. Scientific reports, 2026, 16(1): 12611-12611.

[49] Tan X, Chen X, Tian R, et al. Triad-LMF: a hierarchical low-rank multimodal fusion framework for robust cancer subtype classification using multi-omics data. BMC bioinformatics, 2026.

[50] Hu Y, Li X, Li H, et al. Motion-Aware Spatio-Temporal Fusion Memory Network for Semi-Supervised Echocardiography Video Segmentation. Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference, 2025.

[51] Römer C, Heindel W, Helfen A, et al. RAVEN: An open-source assessment framework for automated entity extraction in structured radiology reporting. Informatics in Medicine Unlocked, 2026: 63101755-101755.

[52] Almadani M, Atalla S, Himeur Y, et al. A Multimodal Transformer for Joint Prediction of Comfort and Energy Consumption in Smart Buildings. Energies, 2026, 19(7): 1779-1779.

[53] Wang C, Tian X, Zhou F, et al. Fault diagnosis of electric transmission system based on graph-enhanced deep feature fusion network model using efficient decision mapping. Measurement Science & Technology, 2025(4): 36.

SENTIMENT ANALYSIS MODEL BASED ON ATTENTION MECHANISM MULTIMODAL FUSION AND SPATIO-TEMPORAL GRAPH NEURAL NETWORK

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

DOI:

How to Cite