Intelligent text analysis for effective evaluation of english Language teaching based on deep learning

0
Intelligent text analysis for effective evaluation of english Language teaching based on deep learning

Dataset collection

This experiment uses the Automated Student Assessment Prize (ASAP) dataset, which contains a large number of English compositions primarily designed to evaluate students’ writing proficiency. The ASAP dataset includes eight distinct topics, each corresponding to a subset of compositions labeled with overall scores. To protect privacy and reduce scoring bias, sensitive information such as specific names and locations in the compositions is anonymized by replacing them uniformly with the placeholder “@entity.” Text preprocessing also involves removing non-standard characters and special symbols, and converting all letters to lowercase to minimize noise during model training. For text segmentation, the NLTK toolkit is employed to perform sentence- and word-level tokenization, supporting subsequent hierarchical semantic modeling. Given the differing scoring intervals across topics, all original scores are normalized to the range [0,1] to ensure fairness and comparability in cross-topic scoring. To prevent data leakage in cross-topic experiments, prompt words and keywords explicitly related to the composition topics are removed. This step avoids the model “cheating” by learning the prompt content directly and helps ensure the scoring model’s generalization truly reflects writing quality. The dataset is split into training and test sets in a 3:2 ratio. For each experiment, one topic serves as the test set while the remaining seven topics form the training set. The model performance is evaluated using 50% cross-validation.

Experimental environment and parameters setting

Table 3 lists the software, hardware, and development environment used in this experiment, along with key parameter settings. Model parameters are primarily determined using empirical rules and optimized based on validation set performance.

Table 3 Experimental environment and parameter settings.

The Quadratic Weighted Kappa (QWK) is used as the evaluation index, which mainly measures the consistency between the model and the real rater, and considers the square penalty of the scoring deviation. The calculation equation of QWK is as follows:

$$QWK=1-\frac{\sum\:_{i=1}^{N}\sum\:_{j=1}^{N}{w}_{ij}{O}_{ij}}{\sum\:_{i=1}^{N}\sum\:_{j=1}^{N}{w}_{ij}{E}_{ij}}$$

(34)

N refers to the total number of rating levels. \({O}_{ij}\) is the number of the actual score i and the predicted score j. \({E}_{ij}\) represents the expected frequency calculated according to the rater’s score distribution and the predicted score distribution. \({w}_{ij}\) is the weight based on the difference of scores, and the square difference weight is usually adopted:

$${w}_{ij}=\frac{{(i-j)}^{2}}{{(N-1)}^{2}}$$

(35)

The value range of QWK is [−1,1], where 1 means complete consistency, 0 means random consistency, and a negative value means poor consistency.

Performance evaluation

  1. (1)

    Comparison of model scoring results.

To evaluate the effectiveness of the HFC-AES model, this study compared it with five established AES models: the Hierarchical Attention Model (Hi-att)27, Co-attention28, Temporary Deep Neural Network (TDNN)29, Siamese Enhanced Deep Neural Network (SEDNN)30, and Cross-Task Scoring Model (CTS)31. Hi-att and Co-attention target single-topic scoring, while TDNN, SEDNN, and CTS address cross-topic scoring. To further strengthen the results, two additional mainstream Transformer-based models were included: the BERT-based AES model (BERT-AES) and the GPT-based generative AES model (GPT-AES). BERT-AES uses multi-task fine-tuning to emphasize sentence-level semantic consistency, while GPT-AES incorporates prompt information and generates scoring predictions by producing rating sequences. Figure 5 presents the QWK results of all models in cross-topic evaluation.

Fig. 5
figure 5

Comparison of QWK values of various models in cross-topic scenes.

Figure 5 shows that in cross-topic AES, single-topic models such as Hi-att and Co-attention perform worse than cross-topic AES models. Among all models, HFC-AES achieves the highest performance, with an average QWK of 0.856, surpassing other cross-topic approaches and confirming its effectiveness. GPT-AES and BERT-AES achieve mean QWK scores of 0.810 and 0.791, respectively, outperforming traditional RNN and CNN models but still falling short of HFC-AES. These results indicate that while Transformer architectures excel at feature extraction, HFC-AES gains further advantages through structural optimization and cross-task modeling. Its multi-level semantic modeling and accurate topic-related feature extraction enhance the alignment between compositions and prompts. By integrating the task and shared layers with a cross-task attention mechanism, the model effectively handles semantic differences between topics, improving cross-topic scoring accuracy. Furthermore, the joint optimization mechanism enhances robustness and scoring consistency, enabling HFC-AES to achieve superior performance. To investigate the reasons behind HFC-AES’s performance advantage, a comparison was conducted with two-stage cross-topic AES models, TDNN and SEDNN, focusing on QWK results for pre-scoring compositions in the topic-independent stage, as shown in Fig. 6.

Fig. 6
figure 6

QWK Comparison of Three Cross-Topic Models in the Topic-Independent Stage.

Figure 6 shows that the HFC-AES model achieved a higher QWK than TDNN and SEDNN in pre-scoring compositions during the topic-independent stage. Its average QWK across eight prompts was 0.769, outperforming TDNN (0.546) and SEDNN (0.681). These results confirm that the first stage of HFC-AES is critical for improving pre-scoring quality. Unlike the comparison models, HFC-AES better incorporates prompt information, leading to stronger cross-topic performance. In the topic-related stage, HFC-AES again performed best. Its hierarchical neural network structure effectively captured the complex semantic relationships between compositions and prompts. By extracting general semantic features in the shared layer and emphasizing topic-relevant information in the task layer, the model improved topic alignment and scoring accuracy. The Bi-LSTM and attention mechanisms further enhanced contextual modeling and feature extraction, enabling superior results in cross-topic scoring.

To assess the model’s generalization across different writing types and datasets, supplementary experiments were conducted on the publicly available TOEFL11 and International Corpus of Learner English (ICLE) datasets. TOEFL11 contains compositions from 11 groups of non-native English speakers, and ICLE consists of academic texts from multiple non-English-speaking countries. To ensure robust and unbiased evaluation, tenfold cross-validation with repeated verification was applied to each dataset to minimize overfitting. Figure 7 reports the QWK scores of all models.

Fig. 7
figure 7

The QWK evaluation results of each model on TOEFL11 and ICLE datasets.

Figure 7 presents the QWK evaluation results of all models on the TOEFL11 and ICLE datasets. HFC-AES consistently outperformed the comparison models, achieving a QWK of 0.852 on TOEFL11 and demonstrating strong adaptability to non-native English writing. In contrast, traditional models such as Hi-att and Co-attention delivered lower accuracy and weaker consistency, highlighting the superiority of HFC-AES in handling compositions from diverse linguistic backgrounds. These results confirm the model’s robust generalization capability, particularly in scoring tasks involving non-native writers.

To further examine performance differences in real scoring scenarios, a qualitative error analysis was conducted on representative samples. Table 4 lists three compositions with their prompts, human-assigned scores, model predictions, and explanations for scoring discrepancies.

Table 4 Sample analysis of the differences between model and human scoring.

The discrepancies primarily stem from the model’s limited capacity to interpret rhetorical devices, nuanced tone, and deeper reasoning. Essays with complex structures or implicit meaning were more prone to misjudgment. This highlights an area for improvement: integrating advanced discourse reasoning modules or fine-tuning pre-trained language models at the discourse level to enhance recognition of implicit semantics and rhetorical strategies.

  1. (2)

    Ablation experiments.

Systematic ablation experiments were designed to evaluate the contributions of different features and mechanisms in the HFC-AES model to scoring performance. Two categories were examined: feature-level ablation (discourse structure, topic-independent features, and topic-related features) and mechanism-level ablation (e.g., attention mechanisms). Each feature or mechanism was removed individually and in combination to assess its impact on performance.

For feature-level ablation, the following configurations were tested: Structural features: discourse structure removed; Topic-independent features: all topic-independent features removed; Topic-related features: all topic-related features removed; Structural + topic-independent features: both discourse structure and topic-independent features removed. The results are presented in Fig. 8.

Fig. 8
figure 8

Results of the feature-level ablation experiment.

Figure 8 shows that each feature type contributes differently to model performance. Removing discourse structure features reduces the average QWK to 0.827, confirming their value in capturing overall organization and logical coherence. The impact is greater when topic-independent features are excluded, with the QWK dropping to 0.765, highlighting the importance of basic linguistic indicators such as vocabulary and syntax in modeling text complexity and writing style. Eliminating topic-related features results in a similar decline, with the QWK decreasing to 0.770, underscoring their role in assessing how well a composition aligns with its prompt. The largest drop occurs when both discourse structure and topic-independent features are removed, with the QWK falling to 0.735. This demonstrates that each feature type supports the others: removing one weakens performance, and removing both amplifies the effect. For example, tasks using Prompts 1–4 show the steepest degradation under this combination. Compared to the full HFC-AES model, which achieves an average QWK of 0.856, this ablation produces a 0.121 loss, emphasizing the need for diverse feature inputs. Overall, these results confirm that discourse structure, basic linguistic features, and topic-semantic matching work together to enable accurate scoring.

The next step evaluates the role of the attention mechanism. Three configurations are tested: Attention: removal of the attention mechanism; Attention + topic-related features: removal of both attention and topic-related features; Structural + topic-independent + attention: removal of discourse structure, topic-independent features, and attention. The outcomes are shown in Fig. 9.

Fig. 9
figure 9

Results of mechanism-level ablation experiments.

Figure 9 shows that removing the attention mechanism alone lowers the model’s average QWK to 0.818, only a slight decrease from the complete model. This indicates that the attention mechanism, though secondary, still contributes meaningfully, especially in handling compositions with complex structures or inter-sentence relationships. Its impact becomes more pronounced when combined with other features. For instance, removing both the attention mechanism and topic-related features reduces the average QWK to 0.792, a much larger drop than removing either alone, highlighting their interdependence. The attention mechanism enhances the modeling of semantic alignment between compositions and prompts, ensuring accurate topic matching. When discourse structure, topic-independent features, and the attention mechanism are all removed, the QWK further falls to 0.778, resulting in a loss of 0.078 compared with the full model (0.856). Performance declines are especially evident in Prompts 4 and 7, which require high-level semantic abstraction and contextual reasoning. Prompt 4 involves balancing ethical concerns and scientific progress, often using metaphors, concessions, and dual-argument structures that demand strong semantic and structural comprehension. Prompt 7 calls for critical analysis of social phenomena and technological impacts, with frequent logical reasoning and subjective expression. Without topic-related features and the attention mechanism, the model struggles to determine whether a composition stays focused on the prompt, reducing scoring consistency.

Overall, the attention mechanism is not the sole determinant of performance, but its synergy with semantic features significantly improves topic understanding and contextual semantic capture, making it a vital component in cross-topic scoring. To further evaluate the HFC-AES model under different feature configurations, additional ablation experiments were conducted on shallow learning (SL) and DL features. By removing each type separately, the SL-only and DL-only models were obtained, and their effects on cross-topic composition scoring are presented in Fig. 10.

Fig. 10
figure 10

Experiments on ablation with different feature types.

Figure 10 shows that the overall scoring performance of the HFC-AES model drops when either shallow learning (SL) or DL features are removed. The average QWK decreases to 0.821 without SL features and to 0.812 without DL features, both lower than the complete model’s 0.856. This indicates that both feature types are essential for accurate scoring. Shallow features, such as word frequency, sentence length, and syntactic diversity, provide intuitive and stable indicators of linguistic complexity and writing style, helping the model assess basic language quality. DL features, by contrast, capture richer semantic representations and contextual relationships through neural networks, improving the model’s ability to evaluate semantic coherence and logical flow. Together, they form a complementary multi-level semantic representation of each composition, making their joint use a key factor in achieving high-precision scoring. Figures 8 and 9 further reveal that among all features, topic-related features have the greatest impact. Removing them lowers the average QWK from 0.856 to 0.827, with marked performance drops on Prompts 4 and 7. These features directly model the semantic alignment between compositions and their prompts—an especially challenging aspect of cross-topic scoring—allowing the model to more accurately judge topical relevance. In HFC-AES, this is accomplished through bidirectional LSTM and attention mechanisms in the task layer, substantially improving scoring consistency and accuracy across topics.

To enhance interpretability, attention weight distributions and feature importance were further analyzed to provide deeper insights into the model’s decision-making. Table 5 presents the attention weights assigned to specific features across different scoring dimensions.

Table 5 Interpretability analysis of attention mechanisms.

To further reveal how the model assigned attention weights within specific texts, the attention distribution for the sentence “College education should be free so that everyone can access knowledge. However, the government needs a sustainable plan to fund it.” was visualized. The visualization is shown in Fig. 11.

Fig. 11
figure 11

Visualization of attention weight distribution for the example sentence.

The intra-sentence attention distribution reveals that the model assigns higher weights to phrases like “sustainable plan” and “government needs,” indicating its focus on the practical feasibility issues raised in the essay. This focus is crucial for evaluating the logical completeness of argumentative writing. However, the attention on the phrase “should be free so that everyone can access knowledge” is more dispersed, reflecting the model’s lower sensitivity to idealistic or emotional expressions compared to factual statements. This difference further highlights the model’s limitation in handling subjective stances and shifts in tone. This word-level visualization based on attention aids in explaining specific scoring discrepancies and represents a promising direction for improving model interpretability. The feature importance assessment quantifies the contribution of each feature to the scoring decisions. The results are presented in Table 6.

Table 6 Feature importance assessment.

Tables 5 and 6 reveal that, in the interpretability analysis of the attention mechanism, the model places greater emphasis on organizational structure during scoring. This suggests that the HFC-AES model prioritizes logical coherence and structural quality when evaluating compositions. Regarding feature importance, grammatical and semantic features hold significant weight, underscoring their critical role in determining final scores. In contrast, discourse structure shows relatively lower importance, possibly due to its reduced influence in certain composition types. These findings indicate that the model’s scoring decisions largely depend on grammar and semantic quality, while its attention to organizational structure supports effective assessment of coherence and logical consistency.

  1. (3)

    Influence of the cross-attention mechanism on the scoring model.

The HFC-AES model incorporates a cross-attention mechanism to evaluate both overall composition quality and specific scoring dimensions, including semantics, grammar, vocabulary usage, and organizational structure. The impact of this mechanism on overall scoring performance is assessed, with results presented in Fig. 12.

Fig. 12
figure 12

Feature Weight Distribution for Topic 1 in Predicting Overall and Individual Scores.

Figure 12 illustrates how the HFC-AES model dynamically adjusts the weight assigned to various features for different scoring tasks after incorporating the cross-attention mechanism. This adjustment notably enhances scoring accuracy and consistency. For the overall composition score, the cross-attention mechanism allocates weights thoughtfully across scoring dimensions. Semantic and grammatical features receive weights of 0.159 and 0.168, respectively, highlighting the model’s emphasis on semantic coherence and grammatical accuracy—aligning well with human scoring criteria. Vocabulary usage is weighted at 0.133, reflecting its importance in scoring, particularly in terms of diversity and precision. When predicting the organizational structure score, the mechanism concentrates the majority of the weight (0.173) on organizational features, significantly down-weighting other aspects. This selective focus enables the model to prioritize key features relevant to specific scoring tasks, thereby improving the accuracy of individual dimension scores. In summary, the cross-attention mechanism allows the model to flexibly reweight features depending on the scoring task, enhancing the precision and rationale of composition evaluation.

  1. (3)

    Practical application of the HFC-AES model.

To evaluate the practical utility of the HFC-AES model, its automatic scoring results are compared with human evaluator scores. This comparison helps verify the model’s accuracy and feasibility in real-world settings. Figure 13 presents this comparison, with scores normalized to a maximum of 100 points.

Fig. 13
figure 13

Comparison of Human Score and HFC-AES model score.

Figure 13 shows that the differences between the HFC-AES model’s scores and human ratings are minimal, with most errors falling within a 3-point range. This indicates that the HFC-AES model closely approximates human scoring standards, making it well-suited for practical automatic composition scoring tasks. While minor discrepancies may occur in individual cases, the model generally performs reliably, effectively supporting automatic scoring needs in real-world applications and demonstrating strong feasibility and potential.

To further assess the model’s practical applicability, its processing time was evaluated by measuring the average scoring time per composition. All experiments were conducted on a consistent hardware and software platform. Comparative models included HFC-AES, TDNN, SEDNN, CTS, BERT-AES, and GPT-AES. The results are summarized in Table 7.

Table 7 Comparison of processing time in model rating stage (Unit: seconds/article).

Table 7 shows that the processing time of the HFC-AES model is slightly longer than that of traditional DL models. This is primarily due to its integration of shallow features, deep semantic representations, discourse structure information, and a multi-module collaborative training mechanism. However, its processing time remains significantly shorter than that of GPT-AES and BERT-AES, which rely on large-scale pre-trained models and suffer from considerable time bottlenecks in practical applications due to their vast parameter sizes and complex inference procedures. Overall, HFC-AES achieves a strong balance between high scoring accuracy and acceptable processing efficiency, making it well-suited for scenarios that demand precise grading. In practical educational settings, this means the HFC-AES model can score approximately 69 essays per minute. For instance, in a medium-sized high school where 3,000 essays need to be graded in a single exam, the model can complete the task within 45 min. This level of performance offers a feasible and effective solution for classroom assessments, online writing platforms, and large-scale standardized testing.

Discussion

In summary, the proposed HFC-AES model integrates shallow textual features with DL representations in a two-stage framework that includes both topic-independent and topic-related feature extraction and modeling. This design significantly improves scoring consistency and robustness compared to existing approaches. For example, Li et al. (2023) developed an AES method that combined multi-scale features with Sentence-BERT embeddings and shallow linguistic and topic-related features, achieving a QWK of 0.79332. Wang (2023) extracted semantic features via CNN and LSTM and topic features through TF-IDF, which resulted in a neural network-based AES model with a QWK of 0.81633. Dhini et al. (2023) proposed an AES model based on semantic and keyword similarity using Sentence Transformers; by incorporating multilingual Paraphrase-Multilingual-MiniLM-L12-V2 and DistilBERT-Base-Multilingual-Cased-V1 models, their approach improved evaluation scores by 0.2 points34. In contrast, this model enhances the understanding of composition content and semantics and strengthens the robustness and adaptability of topic information through a cross-task attention mechanism. Consequently, it offers a more comprehensive and effective technical solution for intelligent evaluation in English language teaching.

In practical applications, computational efficiency is crucial for automated scoring systems deployed at scale. This study evaluates the HFC-AES model’s performance in processing thousands of essays in near real-time. On a single GPU machine, the model achieves an inference throughput of approximately 200 compositions per minute, satisfying the demands of most online education platforms. To increase throughput further, distributed computing and data parallelism can be employed to distribute scoring tasks across multiple servers for near-linear acceleration. Additionally, asynchronous batch processing can substantially improve overall system capacity while maintaining scoring latency within seconds. These features meet the low-latency, high-concurrency requirements of large-scale educational environments. To address scenarios with limited computing resources, lightweight model alternatives are explored. Recent advances in DL have produced compressed pretrained models like DistilBERT and TinyBERT, which maintain strong semantic understanding while greatly reducing parameter counts and computational overhead. These distilled models can be efficiently deployed on edge devices or resource-constrained classroom settings. By integrating the HFC-AES multi-stage feature fusion strategy with these lightweight models as substitutes for deep semantic extractors, the system retains high scoring accuracy while lowering latency and computational costs. This makes the scoring system more practical for large-scale real-world education. Future work will focus on systematically evaluating and optimizing these lightweight versions to further enhance the model’s applicability in educational contexts.

Although the HFC-AES model demonstrates strong efficiency and scoring consistency, deploying automated scoring systems raises important ethical concerns. The model may place excessive emphasis on surface-level features like language fluency and syntactic accuracy, potentially undervaluing creativity and critical thinking. This could lead to a bias favoring style over substance. Moreover, compositions reflecting significant differences in gender, cultural background, or language variants risk being unfairly scored due to imbalances in the training data, which can introduce algorithmic bias. To address these issues, future work should focus on enhancing training mechanisms to promote diversity, inclusiveness, and fairness—for example, by integrating fairness correction modules and improving the recognition and understanding of non-standard linguistic expressions. Additionally, quality control should be enforced through manual audits and human-in-the-loop processes to ensure that automated systems complement rather than fully replace human evaluators, thus mitigating risks of misuse or overreliance on technology.

link

Leave a Reply

Your email address will not be published. Required fields are marked *