
This study constructs an intelligent classroom teaching evaluation system integrating multimodal deep learning. Based on models such as computer vision (ResNet+Pose Estimation), speech processing (CNN+LSTM), and natural language processing (BERT+Transformer), the system comprehensively analyzes multimodal data including students' facial expressions, speech emotions, and classroom speaking content, to accurately quantify student focus, classroom interaction index, and teaching quality. Through a hybrid fusion strategy combining Early Fusion and Late Fusion, effective integration of different modal features is achieved. Experiments show that the system can objectively, in real-time, and comprehensively feedback the classroom teaching process, providing data-driven personalized improvement suggestions for teachers, and offering a feasible path for the construction of an intelligent classroom teaching evaluation system.