CLASSROOM VIDEO BEHAVIOUR PROPOSAL MODEL BASED ON MULTIMODAL ATTENTION MECHANISMS AND ADAPTIVE SEARCH-Upubscience Publisher

CLASSROOM VIDEO BEHAVIOUR PROPOSAL MODEL BASED ON MULTIMODAL ATTENTION MECHANISMS AND ADAPTIVE SEARCH

Volume 3, Issue 8, Pp 21-30, 2025

Author(s)

Ji Li¹, Jin Lu^2*, MaoLi Wang³

Affiliation(s)

¹Research Management Office, Shenzhen Polytechnic University, Shenzhen 518000, Guangdong, China.

²Guangdong Key Laboratory of Big Data Intelligence for Vocational Education, Shenzhen Polytechnic University, Shenzhen 518000, Guangdong, China.

³Institute for Technical and Vocational Education, Shenzhen Polytechnic University, Shenzhen 518000, Guangdong, China.

Corresponding Author

Jin Lu

ABSTRACT

The analysis of teacher-student behaviour within classroom settings forms the bedrock of smart education research and application. However, existing general-purpose behaviour detection models often exhibit suboptimal accuracy and efficiency when processing extended classroom videos. This stems primarily from their inability to effectively address four key challenges: variable behaviour duration, complex semantic layers, heterogeneous multimodal information, and high background redundancy. To address these challenges, this paper proposes a novel classroom video behaviour proposal model. Its core innovation lies in the synergistic utilisation of multimodal attention mechanisms and adaptive search strategies. First, a robust multimodal feature extraction backbone network is constructed to extract highly discriminative features from video, audio, and automatic speech recognition (ASR) transcribed text. Subsequently, a hierarchical multimodal attention fusion module is designed. This module dynamically captures and integrates behaviour-related key visual segments, audio events, and semantic keywords through two-stage computations: intra-modal attention and cross-modal attention. Building upon this foundation, we innovatively propose an adaptive boundary search algorithm inspired by reinforcement learning principles. This algorithm dynamically adjusts search stride and direction based on the contextual semantics and behavioural confidence of the current video segment, enabling efficient and precise boundary localisation for action proposals within lengthy video sequences. To validate model performance, we constructed a large-scale classroom behaviour dataset, ‘Edu-Action’. Comprehensive experimental results demonstrate that our model achieves significant improvements in the core evaluation metric for action proposal tasks, average recall at action number (AR@AN). At a tIoU threshold of 0.5, recall reaches 68.7%, comprehensively outperforming multiple advanced baseline models. Extensive ablation studies further validate the effectiveness and necessity of each component within the model. This paper presents an effective solution for fine-grained action localisation in long-duration video environments, holding significant theoretical implications and broad practical application prospects.

KEYWORDS

Behavioural proposal generation; Multimodal learning; Attention mechanisms; Adaptive search; Classroom video analysis; Smart education; Deep learning

CITE THIS PAPER

Ji Li, Jin Lu, MaoLi Wang. Classroom video behaviour proposal model based on multimodal attention mechanisms and adaptive search. World Journal of Educational Studies. 2025, 3(8): 21-30. DOI: https://doi.org/10.61784/wjes3106.

REFERENCES

[1] Jain A, Dubey K A, Khan S, et al. A PSO weighted ensemble framework with SMOTE balancing for student dropout prediction in smart education systems. Scientific Reports, 2025, 15(1): 17463-17463.

[2] Chen J, Qian L, Ni H. The Smart Classroom Practices in Science Courses. Higher Education and Practice, 2024, 1(8).

[3] Yanqiu Z ,Yinghua S, Rongxia H, et al. The Development and Application of the Metaverse in Smart Education in Chinese Universities. Frontiers in Educational Research, 2024, 7(6).

[4] Xu Z, Zhou Q, Li Z, et al. Adaptive Multi-Function Radar Temporal Behavior Analysis. Remote Sensing, 2024, 16(22): 4131-4131.

[5] Andriani R, Disman, Ahman E, et al. Polychronic Behaviors: The Role of Job Residency and Education Level. International Journal of Entrepreneurship, 2019, 23(3).

[6] Petrov M. Analyzing and classifying a range of incorrect actions made by students during an educational process using an interval temporal behavior observation. Educational Alternatives, 2019, 17(1): 117-126.

[7] Sarra A, Leila J M, Fahima H, et al. Fuzzy Vikor Application for Learning Management Systems Evaluation in Higher Education. International Journal of Information and Communication Technology Education (IJICTE), 2021, 17(2): 17-35.

[8] Ukpong E D, George N I. Length of Study-Time Behaviour and Academic Achievement of Social Studies Education Students in the University of Uyo. International Education Studies, 2013, 6(3): 172.

[9] Derya S, Felix H, Magdalena B, et al. Early and middle latency auditory event-related potentials do not explain differences in neuropsychological performance between schizophrenia spectrum patients and matched healthy controls. Psychiatry Research, 2021, 304(prepublish): 114162-114169.

[10] Tucholka I, Gold B. Analysing classroom videos in teacher education— How different instructional settings promote student teachers’ professional vision of classroom management. Learning and Instruction, 2025: 97102084-102084.

[11] Stapleton N J, Richardson R M. Social Support Network and Sedentary Behavior Among US Adults With and Without Mobility Impairment. American journal of health promotion: AJHP, 2024, 38(7): 8901171241252526-8901171241252526.

[12] Peng Z, Jitong W, Mengshu L, et al.Structure, electrical properties and energy storage performance of BNKT-BMN ceramics. Journal of Materials Science: Materials in Electronics, 2022(prepublish): 1-12.

[13] Ghaderi A, Athitsos V. Selective Unsupervised Feature Learning with Convolutional Neural Network (S-CNN). CoRR, 2016.

[14] Moss H E, Tantry K E, Le E, et al. Distinct patterns of PV and SST GABAergic neuronal activity in the basal forebrain during olfactory-guided behavior in mice. The Journal of neuroscience: the official journal of the Society for Neuroscience, 2025.

[15] Zeinab H Z, Shekoufeh R K, Esmaeil F, et al. Distributed RMI-DBG model: Scalable iterative de Bruijn graph algorithm for short read genome assembly problem. Expert Systems With Applications, 2023, 233.

[16] Changyou D, Hong L, Han Z, et al. New Understanding on Relationship Between RTD Curve and Inclusion Behavior in the Tundish. Metallurgical and Materials Transactions, 2024, 55(4): 2224-2239.

[17] Tu L, Hong H. Multimodal Learning Data Analysis and Algorithmic Teaching Effectiveness Evaluation Model Construction. International Journal of High Speed Electronics and Systems, 2024(prepublish).

[18] Cai Q, Bajuri R M, Leong E K, et al. Multimodal Learning Interactions Using MATLAB Technology in a Multinational Statistical Classroom. Multimodal Technologies and Interaction, 2025, 9(10): 106-106.

[19] Sun C, Huang S, Sun B, et al. Personalized learning path planning for higher education based on deep generative models and quantum machine learning: a multimodal learning analysis method integrating transformer, adversarial training and quantum state classification. Discover Artificial Intelligence, 2025, 5(1): 29-29.

[20] Gajghate S S , Noor M M, Kumar S, et al. A transformer guided multi modal learning framework for predictive and causal assessment of thermal runaway in high energy batteries. Scientific Reports, 2025, 15(1): 37054-37054.

[21] Zhang X, Bahri A, Desrosiers C, et al. SegMamba: Mamba-based Incomplete Multimodal Learning for Brain Tumor Segmentation with Few Samples. IEEE journal of biomedical and health informatics, 2025. DOI: 10.1109/JBHI.2025.3600652.

[22] Huafeng W, Hanlin L, Wanquan L, et al. Temporal information oriented motion accumulation and selection network for RGB-based action recognition.I mage and Vision Computing, 2023, 137.

[23] Ng L H D, Chia T R T, Young E B, et al. Study protocol: infectious diseases consortium (I3D) for study on integrated and innovative approaches for management of respiratory infections: respiratory infections research and outcome study (RESPIRO). BMC infectious diseases, 2024, 24(1): 123-123.

[24] Abd I E L E, Emad N S E, K K M, et al. VGGish transfer learning model for the efficient detection of payload weight of drones using Mel-spectrogram analysis. Neural Computing and Applications, 2024, 36(21): 12883-12899.

[25] Iancu B. Evaluating Google Speech-to-Text API's Performance for Romanian e-Learning Resources. Informatica Economica, 2019, 23(1): 17-25.

[26] CallMiner Combines Conversation Analytics Platform With Microsoft Azure Speech to Text. Telecomworldwire, 2021.

[27] Huang P, Zhu H, Wang Y, et al. Enhanced Semantic BERT for Named Entity Recognition in Education. Electronics, 2025, 14(19): 3951-3951.

[28] Shylaja R, Kumari R V L. ArrhythTransform: Multi-Head Attention-Based Transformer Encoder for Arrhythmia Classification. Engineering Letters, 2025, 33(11).

[29] Dixit A, Kumar S, Kumar N, et al. Assessing the impact of process and design variations on reliability of complementary FET. Solid State Electronics, 2025: 230109226-109226.

[30] Ying L, Wujie Z. Hierarchical Multimodal Adaptive Fusion (HMAF) Network for Prediction of RGB-D Saliency. Computational Intelligence and Neuroscience, 2020: 20208841681-8841681.