The data visualization and intelligent text analysis for effective evaluation of English language teaching

Datasets collection
This study evaluates the effect of English language teaching, and collects student data from multiple dimensions, including students’ personal information, academic performance, emotional analysis, learning progress, teaching interaction and other information. Data are collected mainly in two ways: First, teachers manually record students’ grades, homework feedback and classroom interaction in class. Second, the online learning platform is used to automatically collect students’ homework submission, online test scores and emotional analysis results.
The dataset used in this study is independently collected by the research team. The dataset has undergone strict anonymization processing to ensure that it does not contain any personal identification information of students (such as name, student ID, contact information). During the data collection process, all student participants joined the study voluntarily and signed informed consent forms. To protect privacy, student identity information was replaced with randomly generated unique identifiers\ (UIDs\) at the data entry stage. Text data (such as writing assignments and oral transcriptions) was manually reviewed to remove content that might reveal personal identities (such as descriptions of specific events and home addresses). Original audio files are deleted after classroom recordings and text data were transcribed, and only anonymized text was retained. The research process does not involve any inductive or sensitive questions to ensure that the legitimate rights and interests of participants were not infringed.
The main fields of data collection are shown in Table 2:
Text data preprocessing includes the following steps: (1) Tokenization: Standardized tokenization of students’ written and transcribed oral texts using the spaCy English tokenizer. (2) Noise Removal: Elimination of punctuation, stopwords (from the NLTK English stopword list), and non-English characters. (3) Handling Multilingual Content: Identification and filtering of non-English paragraphs using the LangDetect library. (4) Data Balancing: Addressing the imbalance in sentiment labels (e.g., disproportionately high neutral samples) through SMOTE algorithm oversampling of minority classes to achieve an approximate 1:1:1 ratio among the three sentiment categories. (5) Text Standardization: Conversion to lowercase and application of lemmatization to restore word forms. These steps ensure the dataset is clean, linguistically consistent, and balanced for effective model training.
Experimental environment and parameters setting
The experimental environment of this study is equipped with efficient hardware and software to ensure the smooth training of deep learning model and data processing. The experimental environment is shown in Table 3.
In the process of model training and data analysis, the selection of hyperparameters has an important influence on the performance of the model. All model hyperparameters were determined via grid search combined with fivefold cross-validation. Take XGBoost as an example, the search scope covered learning rate (0.01–0.2), maximum tree depth (3–10), and regularization term λ (0.1–1.0). The parameter combination with the minimum validation set loss was finally selected: learning rate 0.05, maximum depth 6, and λ = 0.5. For the BERT model, a layer-wise unfreezing strategy is adopted during training: only the top layers were fine-tuned in the first two rounds, and all layers are unfrozen in the third round to balance training efficiency and model performance. Additionally, the meta-learner (logistic regression) in Stacking used L2 regularization (C = 1.0) to avoid overfitting. Table 4 shows the main model parameter settings.
Performance evaluation
Figure 4 shows the performance evaluation results of different models. It shows that the ensemble learning method (especially the Stacking model) shows more obvious advantages than the single model. In the performance of training set, the accuracy of Stacking model reaches 95.0%, which is significantly higher than other single models, such as BERT (93.2%), LSTM (89.7%), random forest (86.3%) and XGBoost (94.5%). Moreover, the performance of the Stacking model in precision, recall and F1 value is 94.8%, 95.1% and 94.9% respectively, which are superior to other models, especially in recall and F1 value. Stacking model makes up for the limitations of a single model in a specific task and improves the stability and accuracy of the overall evaluation by weighted fusion of the prediction results of several basic models (such as BERT, LSTM, XGBoost, etc.). In the test set results, the performance of Stacking model is still better than other single models. Its accuracy rate is 94.3%. Although it is slightly lower than the training set, it is still significantly higher than BERT (91.9%) and LSTM (87.9%), and also higher than XGBoost (93.1%) and random forest (83.4%). In addition, the Stacking model also maintains a high level in precision, recall and F1, which are 94.0%, 94.5% and 94.2% respectively, which further proves the advantages of the ensemble learning method. It is worth noting that although the BERT model performs well on the training set, its accuracy on the test set drops slightly, which shows that the adaptability of the model on new data is not as good as that of the Stacking model. However, the performance of LSTM and random forest is relatively weak, especially the LSTM model, whose accuracy on the test set is only 87.9%, which is far from other models.

The performance evaluation results of different models ((a) training set; (b) test set).
Figure 4 shows the significant advantages of integrated model (especially Stacking) in comprehensive evaluation indicators. To investigate the performance of the model in the key subtask-emotion analysis, Fig. 5 shows the emotional analysis results of the model optimized by BERT, LSTM and CNN. It shows that the accuracy, precision, recall and F1 value of the optimized fusion model of BERT, LSTM, CNN and self-attention mechanism are greatly improved in emotional analysis tasks, which proves the powerful ability of multimodal fusion model in handling emotional analysis tasks.

BERT and LSTM combine the self-attention mechanism with the emotional analysis results of CNN optimized model.
In addition to emotional analysis, the analysis of the theme structure and semantic depth of students’ writing content is another core dimension of teaching effect evaluation. Figure 6 shows the performance of LDA + Word2Vec in teaching content analysis. It shows that the combination model of LDA and Word2Vec is particularly prominent in the analysis of teaching content. By combining LDA for topic modeling and Word2Vec for lexical semantic analysis, the accuracy of the model reaches 90.1%, which is significantly improved compared with LDA (86.5%) and Word2Vec (88.3%) alone. This shows that the model has a strong ability to evaluate students’ understanding and mastery of different teaching contents by extracting the topics of students’ concern through LDA and combining with the in-depth analysis of semantic relations between words by Word2Vec.

Performance of LDA + Word2Vec in teaching content analysis.
The improvement of the above-mentioned emotional analysis and teaching content analysis ability will ultimately serve the comprehensive evaluation of English teaching effect. Figure 7 shows the evaluation results of English language teaching effect based on the optimization model of BERT, LSTM and CNN combined with self-attention mechanism. It shows that the optimization model based on the self-attention mechanism of BERT, LSTM and CNN is obviously superior to the traditional model and other deep learning models in all dimensions. Especially in the comprehensive evaluation, the accuracy rate of the optimized model is 92.3%. It is nearly 8% points higher than that of the traditional LDA+ Support Vector Machine (SVM) model (83.8%), showing the remarkable advantages of the deep learning optimization scheme in the evaluation of English language teaching effect.

Evaluation results of English language teaching effect based on the optimization model of BERT, LSTM and CNN combined with self-attention mechanism.
To more clearly highlight the progress of the fusion model proposed in this study in the task of emotional analysis, Fig. 8 shows the performance of fusion model and traditional model in students’ emotional analysis. Through the direct comparison with the representative traditional model, it shows that the accuracy of sentiment analysis of the fusion model is obviously better than that of the traditional model. By introducing BERT, LSTM, CNN and self-attention mechanism, the accuracy of emotional analysis has been improved from the traditional 78.4 to 91.7%, which shows that the model is extremely accurate in the classification of emotional tendencies. This provides a powerful tool for teachers and researchers to accurately understand the emotional state of students in the learning process.

Performance of fusion model and traditional model in students’ emotional analysis.
The results of Fig. 8 directly show the performance differences of each model. To quantitatively verify whether these differences (especially the advantages of Stacking model over other models) are statistically significant, and to confirm the statistical significance of the differences between the Stacking model and other models (such as XGBoost and BERT), this study uses the paired t-test to compare the accuracy of the training set and test set, and calculated the 95% confidence interval (CI). All experiments are based on 5 repeated runs with random seeds to obtain the mean value. The results show that the accuracy difference between the Stacking model and XGBoost on the test set is 1.2% (p = 0.003, CI [0.8%, 1.6%]), and the difference from BERT is 2.4% (p < 0.001, CI [1.9%, 2.9%]), indicating that the performance improvement was statistically significant (significance level α = 0.05). The statistical test results show that the performance improvement of the Stacking model compared with other models reach a significant level (p < 0.05). The confidence interval does not include zero, further supporting the superiority of the integration method. Although the difference between LSTM and random forest is significant, its absolute value is low, reflecting the limitations of single models in complex tasks. The comparison results between LSTM and random forest are shown in Table 5.
In addition to statistical significance, the practicability of the model in the actual teaching scene is very important, and its evaluation results need to be comparable to the judgments of senior teachers. Therefore, to verify the model’s practicality, this study collects teacher evaluation data (N = 200) from the same group of students, where three senior English teachers independently score assignments using a 5-point scale, focusing on writing logic, grammatical accuracy, and emotional expression. By aligning teacher scores with model predictions (discretized into high, medium, low tiers), the study calculates consistency (Cohen’s Kappa coefficient) and accuracy between model outputs and teacher evaluations. Results show that the Stacking model outperformed traditional teacher assessments in both accuracy and consistency. Notably, inter-rater reliability among teachers is relatively low (Kappa = 0.71), highlighting the subjectivity of manual evaluations. The model’s high Kappa value (0.82) indicates its effectiveness in replacing or supplementing human assessments to enhance objectivity. Specific comparisons are presented in Table 6.
To further verify the model’s novelty, this study compared it with representative models in the field of educational AI, including the Multimodal Fusion Model (MFM) proposed by Liu et al.37 and the LSTM-CNN self-attention model by Hu et al.38. Experimental results show that the Stacking model achieved a significantly higher accuracy (94.3%) on the same test set than MFM (92.1%) and LSTM–CNN (89.5%), with F1-scores improved by approximately 3.8% and 5.2%, respectively. Compared with existing studies, the model in this paper demonstrates significant advantages in all indicators, especially in recall (94.5%) and F1-score (94.2%), reflecting the effectiveness of the multi-model integration and optimization strategy. The performance comparison between the proposed model and educational AI benchmark models is shown in Table 7.
To verify the model’s generalization ability, this study adds student data (N = 600) from three universities in Malaysia, Singapore, and China, covering different English proficiency levels and cultural backgrounds. Stratified Cross-Validation is adopted to divide the training set and test set by school. The results show that the Stacking model achieves an average accuracy of 93.2% with a standard deviation of 1.5% in cross-institutional tests, significantly higher than single models (e.g., BERT’s cross-institutional accuracy is 89.7% ± 2.1%). Although the accuracy on the PKU dataset is slightly higher (93.3%), possibly related to the distribution of students’ English proficiency, the overall performance is stable. The specific results are shown in Table 8.
Discussion
Based on the above experimental results (including the overall performance of the model, the performance of key subtasks, statistical significance verification, teacher evaluation and comparison, and cross-institutional generalization ability test), in this study, the ensemble learning method and deep learning optimization technology have significantly improved the accuracy and stability of English language teaching effect evaluation. Especially when Stacking model is used for multi-dimensional data fusion, it shows more obvious advantages than single model (such as BERT, LSTM, XGBoost, etc.). It is consistent with the research results of Liu et al.37. Liu et al.37 used ensemble learning method in their research, and found that the accuracy of sentiment analysis and text classification tasks can be effectively improved by fusing the results of multiple basic models. However, by optimizing the model design (such as BERT combined with self-attention mechanism and CNN), this study further improves the performance of Stacking model in the evaluation of English language teaching effect, making its accuracy in training set and test set reach 95.0% and 94.3% respectively, far exceeding the traditional model and single deep learning model. Compared with the emotion analysis model proposed by Hu et al.38 based on the self-attention mechanism of LSTM and CNN, the optimized model not only improves the accuracy of emotion analysis (91.7%), but also improves the comprehensive performance of the model by fusing multimodal information. In addition, the combined use of LDA and Word2Vec in this study improves the accuracy of teaching content analysis, which is similar to the research of who extracted teaching topics through LDA and analyzed lexical semantics with Word2Vec, and achieved good results. However, the research of Qi et al.39 mainly focused on single topic modeling and vocabulary analysis. While this study significantly improves the accuracy in the evaluation of teaching content through more detailed model optimization and multimodal data fusion, especially in the evaluation of students’ mastery of different teaching content.
link