A systematic review of AI-driven intelligent tutoring systems (ITS) in K-12 education

Table 1 outlines the included articles, study location, a sample description, the intervention duration and the controlled variable. As previously stated, two of the articles each contained two studies. To differentiate between the studies within an article, each one was labeled as [a] and [b], for example: Cui et al. [a].
It is worth noting that the table does not include effect sizes. Indeed, although we recalculated these Cohen’s d values using the available information (means, standard deviations, group sizes, eta-squared, etc.), we observed that the research designs vary so much from one study to another that it becomes complex and not very relevant, for comparison purposes, to present the information in a clear and concise manner.
Also, It is important to mention that an additional article by Roscoe and McNamara26 described the same study as another article by Roscoe et al.27 albeit with less detail. For the sake of this review, both articles were considered as one study.
General overview
Ninety-six percent of the articles were authored by individuals affiliated with educational science, computer science, or both. Only one article was authored by individuals affiliated with the studied ITSs’ company28. Of all the articles included, 62% were authored by individuals with an educational science background29,30,31,32,33,34,35,36,37,38,39,40,41,42,43. Fifteen percent were authored by individuals with a computer science background44,45,46,47,48. The remaining articles were authored by individuals from both backgrounds (19%)27,28,49,50,51.
It is encouraging that the majority of authors are from the fields of education or computer science, with fewer from ITS companies. This enhances the reliability of the results regarding student learning and performance in these studies.
This result differs from the systematic review of ITSs research in higher education conducted by Zawacki-Richter et al.12 In their review, only 8.9% of the included articles were written by authors with an educational science background, while the others were written by authors with Computer Science and STEM backgrounds. This discrepancy could arise from the use of different databases, namely EBSCO Education Source, Web of Science, and Scopus, or from the fact that Zawacki-Richter’s review covered all AI applications, whereas the current specifically targeted ITSs. ITSs are primarily employed in traditional and alternative educational settings within the field of AIEd.
The publication rate of the included articles shows signs of slight increase in recent years. Given that ITSs are regarded as the most prevalent and highly sought-after educational application of AI, attracting considerable investment and interest from technology companies15, we anticipated a higher number of studies meeting our criteria. For comparison, Zawacki-Richter et al. reported 29 studies investigating ITSs in higher education from 2007 to 201812. Similarly, this variation could stem from the difference in databases used. However, it may also result from the current review solely concentrating on the impacts of ITSs on learning and performance rather than other educational variables such as interest, attitude or motivation towards learning. Additionally, it could be due to the feasibility of conducting studies in higher education compared to K-12 settings involving minors. Nevertheless, the effects of ITSs on learning and performance have continued to be a subject of interest over the past decade. There has been a consistent publication trend, with one or two articles published annually from 2011 to 2016, followed by an increase to two or three articles each year since 2017.
This review included contributions from eight countries, with a notable concentration in the USA and Asia. Most articles originated from the USA (54%)27,32,33,34,35,38,39,40,44,46,49,50,51. The remaining articles primarily originated from Asia (27%), including Taiwan29,52, China28,41, Thailand37, Korea31, Turkey45, and Saudi Arabia43, while only four (15%) came from Europe (Slovenia, Hungary, Spain and Netherlands)30,42,48,53.
None of the articles included in this review mentioned any consideration of AI ethics. This lack of attention on ethical concerns in studies investigating the effects of ITSs on student learning and performance prompts questions regarding the extent to which educators and researchers have addressed the ethical implications associated with the use of AI in education. This oversight highlights the need to thoroughly examine the ethical implications of the widespread use of intelligent tutoring.
What experimental designs are used to evaluate the effects of ITSs?
The effects of ITSs have been studied using various experimental designs across diverse educational contexts, including different school levels and subjects.
Educational context: The school level and the subject were compared to assess differences in educational context. Roughly half of the studies (54%)27,32,33,34,35,36,37,39,40,41,43,45,48,50 involved high school students and one included both high school and middle school students31. Nearly all the remaining (32%) involved middle school students28,29,30,44,46,49,51. Only four studies (14%) involved elementary school students, and none were conducted with preschool groups38,42,47,53.
Most of the studies (82%) were carried out in subjects related to STEM28,29,30,32,33,34,35,36,37,39,40,41,42,43,44,45,46,48,49,50,51,53, while the others were focused in Language arts, first27,38,47 or second28,31 language. This finding was consistent with the research conducted by Holmes and Tuomi16, and UNESCO,15 which suggested that ITSs are well-suited for subjects with a structured approach, such as mathematics or physics. This may explain why studies have primarily focused on higher education, where these subjects are taught.
Experimental designs: Fig. 1 shows that researchers mainly utilized quasi-experimental methods in most cases27,28,29,30,31,32,33,34,35,36,37,38,44,45,46,47,49,50. These methods involved an experimental group using an ITS, while a control group used an alternative intervention to learn the same subject. Effects were measured with a pre- and post-test administered to both groups. In their meta-analysis of ITSs, Kulik and Fletcher observed that the effect measure depended on the nature of the tests, whether they were locally developed or standardized17. The difference between locally developed and standardized tests is worth noting, as it affects the interpretation of educational outcomes. This suggests that the context and design of the assessment tools may influence the measurement of educational effects. This also suggests that alignment of the test and the instructional aims are critical determinants of assessment results. As not all studies included in this review have the same type of control group, the studies were categorized into four groups based on their control group type (as listed below and in Fig. 1). This was done to analyze the effects of ITSs on students’ performance. The four types of control group were:
-
ITS vs Teacher (8 studies)28,29,30,31,38,39,40,53: The control group received a traditional, non-digital teaching on the same concepts as the experimental group.
-
ITS vs Non-intelligent tutoring system (TS) (5 studies)32,36,37,42,51: A digital learning environment without artificial intelligence was used in the control group. It is noteworthy that all these studies took place in high school physics classes.
-
ITS vs Modified ITS (11 studies)28,33,34,35,41,44,46,48,49,50: The control group used a modified or older version of the ITS tested by the experimental group.
-
ITS vs No control (4 studies)27,43,45,47: There was no control group. This category included a qualitative study45, an implementation study27, and a study on gender differences in performance47.

Bubble size indicates the sample size. An asterisk indicates studies that included the effect sizes of their results.
Intervention Duration: Fig. 1 illustrates that half of the interventions lasted less than a week28,30,32,33,36,39,41,42,44,46,47,51, with some as brief as a single class period35,46,51. The International Brain Research Organization (IBRO), in partnership with UNESCO’s International Bureau of Education (IBE), suggested that novelty can improve students’ memory and learning19. Considering this, it remains unclear how one can draw conclusions about long-term effects on students’ performance from such brief interventions. Can these effects be attributed to the ITS itself or simply to the novelty aspect of it? Some other studies lasted several weeks27,29,31,34,37,38,40,43,45,49,50,53, limiting the novelty aspect of the intervention. The longest intervention lasted 30 weeks and was conducted with a control group39.
What are the effects of ITSs on K-12 students’ learning and performance?
As previously mentioned, studies were categorized into four groups based on experimental design. However, the included studies do not provide many effect sizes.
ITS vs teacher
Seven out of eight studies comparing an ITS to traditional or usual teaching reported a significant positive effect of ITSs on student performance, with effect sizes ranging from medium to large. One study39 found no significant difference between traditional teaching graphs and ITS use.
Thus, Cui et al. [a] compared the Yixue Squirrel AI ITS to traditional offline teaching methods regarding the Pythagorean theorem over a three-day period, comprising a total of five hours of learning28. A total of 90 students used the ITS in the experimental condition, while 73 were in the control condition. According to Cui et al., the learning gains were 4.19 times greater for the experimental group compared to the control group, with a medium-sized effect (Experimental group M = 9.38, SD = 11.08; Control group M = 1.81, SD = 10.91; Hedges’s g = 0.68; F(1.160) = 16.80, p < 0.001, partial η2 = 0.10)28.
In the study by Chen and Huang, 160 computer science students were taught how to use the Internet, Word, and PowerPoint29. In this experimental condition, 81 participants used an unnamed ITS designed specially for this study, while 79 were assigned to the control condition, receiving traditional teaching methods. The ITS was used for seven weeks, but there is no further indication of the actual time devoted to learning with or without the ITS. Learning gains were measured using pre- and post-tests. A significant difference in the experimental and control groups’ test results was shown by one-way analysis of variance (ANCOVA) (Experimental group M = 68.889; Control group M = 64.621; F = 4.272; p < 0.05)29.
In a study conducted by Choi, 32 high school students and 30 middle school students used ITS iTutor, a tool designed to teach English as a foreign language, for eight ninety-minute sessions over four weeks31. The control group comprised an equal number of students taught the same grammatical concepts in a traditional, teacher-centered, paper-based setting31. The study found a statistically significant difference between the experimental and control groups in the pre- and post-tests, demonstrating the effectiveness of the ITS as shown by a two-way ANOVA (F = 234.344 and p < 0.05)31. The ITS had varying effects on middle and high school students as a statistically significant interaction effect between the experimental groups and the education level was reported (p = 0.013; α = 0.05). This finding suggested that students react differently to the tutoring program depending on their education level, middle school or high school, and between the control and experimental groups31. In particular, middle school students showed greater improvement in the ITS condition and benefited more from it compared to high school students. Notably, this type of comparison between levels of education is rare in the literature. This study suggested that educators may need to differentiate ITSs depending on the level of education to effectively increase student’s performance.
Wijekumar et al. conducted an experiment with 5th-grade students from 7 different schools using the We Write ITS over a 6-week period38. The study consisted of two parts, but only the first part, which used an ITS, was considered here. The study aimed to investigate planning skills and writing quality, which were assessed using one pre-test and two post-tests. The experimental group (n = 299) consisted of 194 students who took the writing quality pre-test, 193 who took the planning pre-test, and 145 who completed both post-tests. The control group (n = 165) consisted of 127 students who took the writing quality pretest, 126 who took the planning pretest, and only 9 who completed both post-tests. To understand these numbers, it is important to acknowledge that some students took only the pre-test, not the post-test, and vice-versa. Although the small control sample may have been due to the teacher’s reluctance to allocate instructional time for further assessments, as stated by the authors, it limited the ability to have a normal statistical sample38. The authors reported a significant medium-sized effect size (d = 0.77) on the planning skills of the students in the ITS compared to those in the control38. The authors mentioned that classes with lower initial writing quality scores seemed to benefit from the ITS more than classes with higher initial scores, although the effect on writing quality was small and not statistically significant. It is noteworthy that the authors of this study emphasize that the ITS enhances but cannot replace teacher-led instruction and that teachers should receive adequate training regarding the use of computer tools38.
Dolenc et al. did not directly compare the performance of students who used an ITS with a control group, but rather with national standardized test results30. Fifty-eight students used the TECH8 ITS for two 45-minute sessions to study the gear subject and then underwent a summative assessment of knowledge comparable to the National Assessment of Knowledge (NAK) in Technology and Science for the years 2008 and 2010. Their results were compared to the national results of the 2008 and 2010 NAK, revealing a large effect size of the ITS (2008: d = 0.99; 2010 : d = 1.30). The authors suggested that the TECH8 ITS attained better outcomes compared to traditional teaching methods30. The ITS results were also comparable to those of other ITSs, although the authors did not specify any particular ITSs.
Ökördi et al.53 conducted a quasi-experimental study on 2187 students from third and fourth grade. After excluding students with more than 50% missing data on a test or those who did not meet the minimum participation criteria, the final sample comprised 810 pupils who completed a pre-test, a post-test, and a follow-up test three months later on multiplication and division. This included 414 students in Grade 3 and 396 in Grade 4, equally divided between intervention and control groups in a manner that mitigated school-related factors. After the pre-test, both conditions groups had classroom lessons, and the intervention group combined those lessons with sessions on the eDia online platform. The intervention lasted four to six weeks and took place in the school during regular school hours and each online session took approximately 10–20 min. According to Ökördi et al., students who completed more than half of the online sessions improved their skills by one-third of a standard deviation, while the control group’s progress was only half that amount53.
Nehring et al.40 conducted a study on the ALEKS PPL web-based mathematics learning platform in 12th grade across five different schools, with a total of one hundred students. Students from two schools constituted the control group, receiving only traditional classroom lessons (n = 27). Students from the other three schools formed the intervention group (n = 73), combining traditional classroom lessons with modules on the ALEKS platform. Both groups completed the ALEKS PPL Mathematics Placement Exam in October and again in May, and the data was combined with data from the online platform in a 2 × 2 mixed ANOVA (F(1, 98) = 19.16, η² = 0.16, p < 0.001). The results indicated that students in the intervention group significantly increased their exam scores between October and May (M_diff = 13.55, SE = 1.72, p < 0.001, d = 0.87, 95% CI [10.14, 16.96]), whereas the control group exhibited no statistically significant change in mean performance (M_diff = −0.93, SE = 2.83, p = 0.744).
In Borchers et al.’s39 study, 82 9th-grade mathematics students used the ITS Mathtutor to learn three units on linear graphs. The students completed a pre-test on all units on Day 1, learned two units on Day 2, completed a first post-test on those two units on Day 3, learned the last unit on Day 4, and completed a second post-test on the last unit on Day 5. Each test was completed in two different formats (paper and ITS); half of the groups answered the paper test first (PT), while the other half did the tutor test first (TP). The students were divided into four conditions, alternating between paper and tutor learning and testing. Paired t-tests showed statistically significant learning gains for both the paper and ITS conditions (tutor: t(281) = 2.76, p < 0.001; paper: t(287) = 7.94, p < 0.001). An ANOVA showed that learning gains were similar in both the ITS (M = 0.13, SD = 0.30) and paper conditions (M = 0.15, SD = 0.31) (F(2, 222) = 5.28, p = 0.006). There was a significant interaction between the condition and the learning unit, favoring paper for one unit and ITS for another unit. There was also a significant main effect, indicating that learning gains were twice as high when the test format matched the practice environment (M = 0.18, SD = 0.30) compared to when there was no such match and students had to transfer knowledge across formats (M = 0.10, SD = 0.30) (t(567.96) = − 3.28, p = 0.001).
ITS vs non-intelligent TS
Four studies compared an ITS to a non-adaptive or non-intelligent tutoring system in high school physics classes32,36,37. One study reported positive learning gains with the ITS37, while the other three reported no significant difference in learning gains between the ITS and the non-intelligent tutoring system32,36,41,48.
Ingkavara et al. conducted a study with an experimental group of 144 students who participated in a self-regulated online learning approach guided by personalized learning, supported by an unnamed ITS37. The control group of 148 students followed a conventional self-regulated online learning approach without guidance from a teacher. Both groups studied electric circuits for a month37. Learning gains were significantly higher in the experimental group (M = 7.37, SD = 2.237) compared to the control group (M = 6.07, SD = 1.908): (t(290) = 5.350, p < 0.05)37. This is the only study in this category that reported significant results.
Jordan et al. conducted a study with 37 students in the experimental group and 35 in the control group32. Both groups used the Rimac system for one class period, specifically on the kinematic subject32. In the control version, the tutoring system broke down each step, regardless of the student’s prior knowledge while, in the experimental version, the ITS only broke down the necessary steps into sub-steps based on the student’s prior knowledge of the content32. The authors reported no significant effect difference between both groups, suggesting that students learned the same regardless of their assigned group32.
Katz et al.33 reported two studies: Jordan et al.’s32 study, presented above, and Albacete et al.’s.53 In the study by Albacete et al., the Rimac system was used over a four-day period.54 The 31 students in the experimental group used an adaptive version of the system, while the 42 students in the control group used a non-adaptive version.54 The study found no significant difference in learning gains between the two conditions when controlling for students’ prior knowledge (F(1.70) = 1.770; p = 0.19).54 An additional independent samples t-test showed no significant difference between mean learning gains in the experimental and control groups (Experimental group M = 0.087, SD = 0.074; Control group M = 0.112, SD = 0.096; t(71) = 1.226, p = 0.22). However, an analysis of variance (ANOVA) revealed that students in the experimental group learned significantly faster, irrespective of their prior knowledge of the content. reported two studies: Jordan et al.’s32 study, presented above, and Albacete et al.’s53. In the study by Albacete et al., the Rimac system was used over a four-day period54. The 31 students in the experimental group used an adaptive version of the system, while the 42 students in the control group used a non-adaptive version54. The study found no significant difference in learning gains between the two conditions when controlling for students’ prior knowledge (F(1.70) = 1.770; p = 0.19)54. An additional independent samples t-test showed no significant difference between mean learning gains in the experimental and control groups (Experimental group M = 0.087, SD = 0.074; Control group M = 0.112, SD = 0.096; t(71) = 1.226, p = 0.22). However, an analysis of variance (ANOVA) revealed that students in the experimental group learned significantly faster, irrespective of their prior knowledge of the content.
In Tang et al.41, 80 volunteer 10th-grade students were arbitrarily divided into two groups and had to complete a pre-test, a training session, and a post-test over the winter holiday. Sixty-five of them completed the study. The experimental group (n = 28) used Guided and Adaptive Tutoring Tips (GATT) within a Mathematics Intelligent Assessment and Tutoring System (MIATS) created by the authors, which provides step-by-step prompts and immediate personalized feedback on each incorrect question from the pre-test during the training sessions. The control group (n = 37) used another platform that provided only regular answer-based feedback, indicating whether their answers were correct or not. In the pre-test, the control group obtained a significantly higher average score than the experimental group, which aligned with the final examination scores from the previous semester. The mean difference between the treatment and control groups was −11.06, demonstrating statistical significance in an independent sample t-test (p = 0.0023 < 0.01). In the post-test, there was no statistically significant difference between the two groups (p = 0.113 > 0.01). The effect size indicated a small to moderate difference (Cohen’s d = 0.340). Between the tests, both groups made statistically significant progress (p < 0.01) in a paired sample t-test. The treatment group’s score increased by 15.50%, while the control group’s score increased by 5.13%. Compared to the pre-test, the treatment group showed greater progress in the post-test than the control group, which, according to the authors, indicates that the use of the intelligent teaching system significantly benefited the students in the treatment group.
Uriarte-Portillo48 conducted a study with 106 middle school students to compare an intelligent tutoring system with augmented reality (ARGeoITS) and a system with only augmented reality (ARGeo). Students were randomly assigned to the control group (n = 53) using ARGeo or the experimental group (n = 53) using ARGeoITS. In the first session of the experiment, all students received a lesson on basic geometry, a tutorial on augmented reality, and a pre-test. In the second session, students used a tablet with their respective platform for 50 minutes and then answered a post-test. The ANOVA test revealed that there was no statistically significant difference between the groups on the pre-test (F(1,106) = 0.182, p = 0.670). For the post-test, the mean achievement score was higher in the experimental group (M = 7.47, SD = 1.601) compared to the control group (M = 6.83, SD = 1.424), and the ANOVA showed a statistically significant difference (F(1,106) = 4.752, p = 0.032). This result indicates a better learning outcome for students using the ITS version of the learning platform. The authors also compared the post-test results according to the type of school the students attended (public or private). The ANOVA revealed a statistically significant difference (F(1,106) = 6.675, p = 0.011), favoring students from private schools (M = 7.62, SD = 1.396) compared to those from public schools (M = 6.84, SD = 1.566).
ITS vs modified ITS
This experimental design category includes eleven studies that compared one ITS to another ITS or a modified version of the same ITS. Identifying general trends in this category is challenging as the studies either compare ITSs with each other or with pedagogical methods for using an ITS in specific contexts. Only three studies in this category provide effect sizes, mostly small-sized35,49,50.
In Huang et al., 60 students used a redesigned ITS, while 69 students used the original version in high school Algebra 1 classes for a month, accumulating a total use of 320 min50. The redesigned ITS estimates the number of opportunities each student is likely to need to master easy and hard fine-grained knowledge contents, as to avoid under- or over-practice of a content. This feature was found to produce a significant improvement in learning, albeit with a small effect size. According to the authors, after an independent sample t-test, the redesigned version resulted in significantly higher learning gains with a small effect size (b = 0.05, p = 0.046; Cohen’s d = 0.31)50. Authors highlighted the relevance of data-driven redesign of ITSs to enhance their effectiveness.
In one of the studies conducted by Cui et al. [b], 46 students used Yixue in the experimental condition, while 58 students used BOXFiSH in the control condition28. The study aimed to investigate the effectiveness of these two language learning apps in teaching basic English grammar to foreign language learners over a two-day period. The results showed that the Yixue users performed significantly better than the control group, with 4.62 greater learning gains (Experimental group M = 5.86, SD = 10.81; Control group M = 1.04, SD = 8.93)28. The authors mention that these results might be due to Yixue’s fine granularity of knowledge contents. However, both groups showed improvement from pre-test to post-test. When interpreting the results of this particular study, it is important to remind that the main author is affiliated with the ITS company.
Long and Aleven conducted a randomized experiment to investigate the effectiveness of Cognitive Tutor in teaching geometry44. The study involved 47 students in the control group who used Cognitive Tutor with a control diary consisting of general questions, and 48 in the experimental group who used Cognitive Tutor with a skill diary for self-assessment. The experiment was conducted over three class periods. The results of the study indicated that there was a significant difference in learning gains between the groups on the reproduction problems section of the post-tests, which were isomorphic to the problems in Cognitive Tutor. This was determined through a one-way ANOVA (F(1, 93) = 3.861, p = 0.052, η² = 0.040). However, there was no significant difference between the two groups on the transfer problems section (F(1, 93) = 0.056, p = 0.814, η² = 0.001)44. Results suggest that self-assessment prompts could also support self-regulated learning for students using the ITS.
Long and Aleven [a] conducted a 2 × 2 experiment with 98 eighth-grade students over three class periods46. The study tested how the following independent factors influenced students’ performance: whether students were shown their skill-level and their progress in problem types, and whether students were allowed to select their next problem from an incomplete level49. The experiment used an ITS built with Cognitive Tutor Authoring Tools. No statistically significant differences were found among the four conditions, and there was no significant improvement between the pre- and post-tests46. The authors suspected a ceiling effect and decided to run another experiment, in which Long and Aleven [b] used the same 2 × 2 model with 56 seventh-grade students over five class periods on linear equations46. This second study modified the condition of displaying progress information to students, with the aim of encouraging reflection and limiting self-assessment biases. This change aimed to improve the accuracy of skill-level and problem-type-level progress information. Although the sample size was small, with only 14 students per group, the authors observed a positive effect on learning gains due to the modified progress information46. Both groups that received progress information outperformed their peers, showing a medium to large effect size (η² = 0.078) according to a one-way ANOVA46. These results suggest that the ITS leads to greater learning gains when it encourages students to reflect on their own progress and abilities.
McCarthy et al. conducted a study in which 118 high school students in science class used the iSTART ITS for three two-hour sessions in the same week33. The students were divided into four groups, following a 2 × 2 design to test how two metacognitive supports implemented within the ITS, performance threshold and self-assessment, influenced students’ performance in understanding complex text33. An additional 116 students who did not receive iSTART training only took the pre- and post-test. Contrary to the three previous studies’ results which suggested that self-assessment enhanced the effectiveness of the ITS44,46, none of the experimental conditions in this study were reported to influence performance33. Afterward, the 118 iSTART training students were compared to the 116 students without any iSTART training. The authors stated that the treatment improved the quality of self-explanations but did not affect test performance33.
In Walkington and Bernacki’s study, three experiment conditions were used to teach mathematics concepts using Cognitive Tutor: a surface personalization condition (n = 35), a deep personalization condition (n = 35) and a control condition with no personalization (n = 36)35. The students were also grouped based on their level of engagement with their interests, which were then used to personalize the exercises. The authors reported t-tests on the number of correct first attempts and correct answers per minute in a post-test, along with their respective effect sizes. The results indicated that students who received deep personalization had more correct first attempts, with a small-sized effect, than those who received surface personalization when their degree of engagement with the interest was high (d = 0.39)35. The study found that students who received personalization had a higher rate of correct answers per minute compared to the control group, with a large-sized effect, but this effect was observed only among those with a higher level of engagement with their interests (d = 0.92)35. Additionally, students who received surface personalization had more correct first attempts, with a small-sized effect, than those who received deep personalization when their level of engagement with their interests was low (d = −0.43)35. The same ITS, Cognitive Tutor, was also tested in a study by Bernacki and Walkington34, where 150 eleventh-grade students in Algebra I used Cognitive Tutor for four months. Ninety-nine participants used a personalized version based on an interest survey, while 51 participants used the standard version34. Results suggest that this personalization significantly improved students’ performance on a teacher-administered algebra exam (β = 0.062, p = 0.045)34. These two studies’ results suggest that the personalization of the ITS is sufficient to improve students’ performance34,35.
Holstein et al. conducted a three-condition experiment with 286 middle school students who used the Lynnette ITS for a total of 60 min over two days49. The first experimental group used Lynnette along with the complete version of the Lumilo glasses, enabling the teacher to monitor their activities and progress in real-time. The second experimental group used Lynnette with a limited version of the Lumilo glasses, which shared less data with the teacher. The control group used only Lynnette. The full version of Lumilo had a small positive effect size on student performance compared to the control condition (r = 0.21) and compared to the limited version (r = 0.11). These results corroborated with the authors’ hypotheses that combining real-time teaching with AI, supported by the analytics of an ITS, would enhance the student’s performance and learning surpassing the effects of monitoring support alone and the effects of conventional methods in ITS classrooms49. Only these authors acknowledged the potential impact of novelty on the effectiveness of the ITS in facilitating learning.
Vest et al.51 conducted a study comparing two approaches to problem-solving in basic algebra among 167 middle school students. One group used only an ITS for practice, while the other worked with examples that required selecting self-explanations before problem-solving activities. Participants were recruited via an online database and word of mouth, comprising 57 sixth graders, 73 seventh graders, and 36 eighth graders (one unreported). The study examined different types of worked examples, with or without visual representations and warm-up activities. However, the authors found little impact on student performance. These four conditions were consolidated into a single experimental group (n = 134) for comparison with a control group (n = 33). In both conditions, students received immediate feedback on their responses and could request scaffolded hints from the tutor at any time. Pre-tests and post-tests assessed procedural and conceptual knowledge, with results analyzed accordingly. The findings were similar across both item types, showing no significant effect of the condition: students with higher pre-test scores performed better on post-tests (procedural: F(1, 161) = 48.8, p < 0.001; conceptual: F(1, 161) = 90.62, p < 0.001). However, when the number of problems solved was included as a covariate, students in the experimental group outperformed those in the control group on both procedural and conceptual post-tests (procedural: β = 0.28, F(1, 161) = 5.32, p = 0.022; conceptual: β = 1.23, F(1, 161) = 7.18, p = 0.008). This suggests that generating self-explanations provided greater learning benefits than simply solving a comparable number of problems.
Horvers et al.42 studied the use of an ITS platform already used daily by four schools in 5th grade, involving a total of 114 students. Two schools continued using it as usual for learning fraction simplification (control condition), while the other two added goal-setting prompts via the Learning Path app (experimental condition). The experiment lasted one week, with 55-min lessons each day. On the first day, students completed the pre-test and received their first instruction on simplifying basic fractions. On the second day, they learned to simplify mixed fractions; on the third day, they worked on simplifying complex fractions; and on the fourth day, all three topics were reviewed in an integrated repetition lesson. On the fifth day, students completed the post-test. The ANCOVA revealed that students in the co-regulation condition solved more problems (F(1,111) = 4.26, p = 0.041, partial η² = 0.037) and had higher accuracy (F(1,112) = 45.68, p < 0.001, partial η² = 0.290) than those in the control condition. This suggests that engaging in co-regulation and goal-setting practices can support monitoring in an ITS. Analyses also showed that the control condition had higher learning gains than the experimental condition for complex fractions (F(1,107) = 10.67, p = 0.001, partial η² = 0.091). However, similar learning gains were found for basic fractions (F(1,107) = 0.90, p = 0.345, partial η² = 0.008) and mixed fractions (F(1,107) = 2.24, p = 0.137, partial η² = 0.021). Thus, students in both conditions learned equally well on easy and intermediate topics, but for the most difficult topic, the control condition outperformed the experimental condition.
ITS/No control
The four studies in this category did not include a control condition in their experimental design.
In Özyurt et al., 81 students used UZWEBMAT for 32 h over eight weeks in their mathematics class45. At the end of the eight weeks, 26 of them were interviewed. UZWEBMAT personalized the students’ learning paths according to their learning styles. Of the interviewed students, 21 expressed that their learning was facilitated. According to the feedback provided, some students found it helpful to be directed to the content of a different learning style when they failed to complete an exercise, as it provided a different perspective. Additionally, 18 students reported that they were able to complete the assignments independently without the need for teacher assistance45. This study’s results remind others previously described, according to which personalization of the ITS can enhance its effectiveness34,35.
In the study conducted by Chen et al., gender differences in cognitive load were compared using pre- and post-tests and a cognitive load questionnaire among a group of 24 fourth-grade students who learned with Zenbo47. The results of a sample t-test showed that boys had rated significantly lower than girls, with medium-sized effects, for mental effort (t = 2.859, p < 0.05, d = 0.825), mental load (t = 2.335, p < 0.05, d = 0.674), as well as cognitive load (t = 2.844, p < 0.05, d = 0.872)47. Boys also outperformed girls in the post-tests, but not significantly47. This study is the only one specifically investigating gender differences in ITS effects on learning and performance. The authors mention that the gender differences might be due to the fact that new technologies are more interesting and engaging to boys, but more distracting to girls47.
In Roscoe et al., 113 students used Writing Pal, which teaches English as a first language, for an average of 16 h over a six-month period27. The participants wrote an essay in November and another in May on two similar SAT prompts. Both pre-and post-study essays were graded, and their essay scores increased significantly (t(112) = 5.85, p = 0.001, d = 0.71)27. The authors also noted that positive changes were observed in essay structure and lexical sophistication27.
In Khasawneh 2024, 300 high school students from various grades and three schools used an ITS platform for eight weeks, integrated with their math instruction. A pre-test and post-test evaluation of three cognitive skills tools underwent thorough validation processes to confirm their reliability and validity by experts. After a pilot test with 50 students, internal consistency was assessed using Cronbach’s alpha coefficient (α = 0.80). Paired samples t-tests compared pre-test and post-test scores, revealing significant improvements in problem-solving (pre-test M = 65.4, post-test M = 72.8, t(299) = 4.67, p < 0.001), critical thinking (pre-test M = 68.9, post-test M = 74.3, t(299) = 3.82, p < 0.001), and logical reasoning abilities (pre-test M = 63.2, post-test M = 70.1, t(299) = 3.45, p = 0.001). An ANCOVA analysis also showed a significant positive impact of the intervention on post-test scores, controlling for pre-test scores, in problem-solving abilities (F(1, 298) = 10.21, p < 0.001), critical thinking abilities (F(1, 298) = 8.75, p < 0.001), and logical thinking abilities (F(1, 298) = 7.92, p < 0.001) after the intervention. These results suggest improvement after using adaptive learning technology in mathematics instruction and indicate a beneficial effect of the intervention on students’ cognitive ability enhancement.
link