Deep learning-based multi-criteria recommender system for technology-enhanced learning

This section presents a conceptual summary of the proposed approach. It discusses the dataset used, the experimental framework of the approach, and how our system was modeled.

Table of Contents

Datasets

To evaluate the performance and how well the proposed DeepFM-SVD + + model can be generalized, we used two datasets that cater to different recommendation system domains:

(1)

ITM-REC dataset: This dataset focuses on multi-criteria ratings relevant to the field of TEL. It includes ratings on various aspects such as App, data, and ease. These criteria align closely with the challenges in personalized learning environments, making it ideal for validating the model’s application in TEL contexts. The dataset is particularly useful for understanding the optimization of multi-criteria relationships.
(2)

Yahoo movies dataset: This multi-criteria dataset includes user ratings for movies based on various aspects such as direction, story, acting, and visuals. It provides a complementary domain for evaluating the proposed model, enabling an analysis of its generalizability beyond TEL. By leveraging this dataset, we demonstrate how the DeepFM-SVD + + model effectively captures complex relationships in user preferences across different domains.

The inclusion of these datasets ensures a comprehensive evaluation of the model’s capabilities in both educational and entertainment contexts. The ITM-REC dataset highlights the model’s relevance to technology-enhanced learning, while the Yahoo Movies dataset showcases its adaptability to other domains, as shown in Table 1.

Table 1 Dataset overview.

TEL multicriteria dataset analysis

The experiment was conducted using an educational dataset collected from questionnaires for five (5) years, from 2017 to 2022. The ITM-Rec³⁴ dataset was collected from graduate students enrolled in data management and analytics specialization at the ITM department at the Illinois Institute of Technology³⁴. Students’ individual preference on final project topics in three courses, Data Analytics (DA), Data Science (DS), and Database (DB), was collected. The students provided ratings on their selection by giving an overall rating (c₀) and additional ratings for three criteria: App (c₁), Data (c₂), and Ease (c₃). The App represents the likeness of the application domain, the Data represents the likeness of the data’s processing or storage, and Ease represents the student’s likeness of the degree of ease in using the data for the final project. The dataset contains 5,230 ratings given by 476 unique users over 70 unique items. Ratings of each criterion were measured on a scale of 5, varying from 1 to 5, with 1 being the lowest preference and 5 being the highest preference. A sample of the dataset is shown in Table 2. While analyzing the data, it was observed that a student could provide ratings for an item more than once based on the class in which the topic was taken. For example, student A can take item B in both DA and DM classes and provide different ratings based on the classes in which the item was taken. This led to duplicates in the dataset, as there was no unique user-item interaction. To resolve this, we created a new column named UID. The values in the UID column were generated based on the combination of the UserID, ItemID, and Class, using a dash as a separator. The UID was used to identify each user-item interaction uniquely.

Table 2 Sample dataset for ITM-Rec rating.

Furthermore, the dataset was cleaned to check and remove inconsistencies to improve its quality. The average rating for the App, Data, Ease, and Overall ratings are 3.421, 3.390, 3.177, and 3.374, respectively. Although the statistical analysis in Table 3 shows that all the aspect ratings had positive correlations with the overall ratings, it can be seen from the table that the App aspect had the most significant influence in deciding the overall ratings, while the Ease had the least impact on the overall ratings.

Table 3 Pearson correlation matrix of the ITM-Rec dataset.

Proposed deep learning-based MCRSs framework

DeepFM combines FM for low-order feature interactions with a DNN for high-order dependencies, enabling it to capture both explicit and complex non-linear relationships. In this hybrid structure, the FM component explicitly models pairwise feature interactions, while the DNN component learns intricate, higher-order patterns that traditional methods like SVD + + cannot fully address. By unifying these two approaches, DeepFM-SVD + + gains a better understanding of multi-dimensional learner preferences in TEL, leading to improved predictive performance and greater recommendation accuracy in multi-criteria contexts. Additionally, the ability to learn from limited data (via embeddings) helps mitigate cold-start and data sparsity issues more effectively than purely factorization-based models. Our proposed aggregation function approach is structured into four key phases, as outlined below:

Phase 1: Decompose the multi-criteria rating dataset from n-dimensional rating problems into n-single criteria rating problems. To predict the known ratings with UserID and ItemID, we used two traditional collaborative filtering techniques, which are SVD and SVD + + .
Phase 2: In this phase, the aggregation function f is chosen using the DeepFM technique. The known overall rating, considered as another criterion, is trained using a DeepFM algorithm.
Phase 3: The overall rating of each unrated item is computed on the n predicted individual criteria ratings and the aggregation function f.
Phase 4: This is the recommendation phase. The calculated predictions are used to support the user’s decision. The user is recommended a set of items with the highest predicted ratings. Figure 5 summarizes the working process of our proposed approach.

Implementation of the proposed approach

The proposed system was implemented using Python. Python produces an exceptional amount of power and versatility in deep learning environments. In the implementation of our DeepFM model, we utilized various Python libraries, including TensorFlow, Keras, and Scikit-learn. These libraries collectively enabled us to build, train, and evaluate the model, harnessing the capabilities of both DNN and traditional machine learning techniques. For our proposed model, we configured the embedding dimension to be 10 and introduced L2 regularization with a strength of 0.0001. Our Dense layer consisted of 128 units, offering a balance between model complexity and computational efficiency.

To optimize the model, we employed the Adam optimizer with a learning rate set to 0.01. In the training phase, the model was updated using mini-batches, each comprising 32 samples. Our model underwent training for a maximum of 45 epochs, during which we used two callbacks: ‘ReduceLROnPlateau’ and ‘EarlyStopping’ to prevent overfitting and improve convergence. A sigmoid activation function was used for the output layer, which scales the model’s predictions to the range [0,1]. Subsequently, we transformed these values to fit the ITM-Rec rating scale of ^1,5 using a lambda layer. For robust evaluation, we conducted cross-validation using K-fold, which determines how many subsets the data is partitioned into for training and validation. Furthermore, we incorporated bias terms, which are crucial for capturing user and item biases. These biases were separately represented as embedding layers within the model.

In the implementation of SVD + + and SVD, we harnessed the power of the Surprise library. We initiated a reader object, explicitly specifying the rating scale within the range of 1 to 5. We then defined a parameter grid tailored for hyperparameter tuning, encompassing three hyperparameters: the number of training epochs (n_epochs: 20, 30, 40), the learning rate (lr_all: 0.005, 0.01, 0.05), and the regularization parameter (reg_all: 0.06, 0.1, 0.2, 0.5). To perform hyperparameter tuning, we employed GridSearchCV, configuring it with the algorithms and the predefined parameter grid. Subsequently, we fitted the grid search to our training data, allowing it to explore various combinations of hyperparameters. As a result, we obtained the best parameters and the corresponding best estimator (model). Taking the best estimator, we proceeded to train it on our training set, enabling the model to capture patterns and relationships within the data. Subsequently, we employed this trained model to make predictions on our test set, offering valuable insights into its predictive capabilities.

Performance evaluation metrics

We evaluate the recommendation’s accuracy and performance based on the top-N recommendation task. We broadly used six different evaluation metrics to measure the performance accuracy, which are as follows:

1. Mean absolute error (MAE): The MAE was used to measure the rating prediction accuracy of our proposed recommendation approach. MAE estimates the deviation between the predicted ratings and the actual ratings. Equation 11 shows how MAE is calculated, where n is the number of test set ratings generated from the test sets, x_j‘ and x_j are the predicted and actual ratings, respectively.

$$MAE= \frac{1}{n}\sum_{j=1}^{n}\left|{x}_{j}{\prime}-{x}_{j}\right|$$

(11)

2. Root mean square error (RMSE): RMSE was also used to measure the rating prediction accuracy of our proposed system. RMSE uses the squared standard deviation and emphasizes larger errors. Equation 12 shows how RMSE is calculated, where n is the number of test set ratings generated from the test sets, x_j‘ and x_j are the predicted and actual ratings, respectively.

$$RMSE= \sqrt{\frac{1}{n}\sum_{j=1}^{n}{\left({x}_{j}{\prime}-{x}_{j}\right)}^{2}}$$

(12)

3. F1-score: This was used to measure the usage prediction, and it is the harmonic mean of the precision and recall. Precision is the fraction of true positive predictions over the total number of positive predicted recommendations. Recall is the fraction of the positive predictions to a specific user’s list of relevant items²⁵. Equations 13, 14, and 15 show how the precision, recall, and F1-score are calculated, respectively.

$$Precision=\frac{Number of true positive}{Number of true positive+Number of false positive}$$

(13)

$$Recall=\frac{Number of true positive}{Number of true positive+Number of false negative}$$

(14)

$$F1-score=2\times \frac{precision \times recall}{precision+recall}$$

(15)

4. Mean average precision (MAP): MAP was also used to measure the usage prediction; it computes the Average Precision (AP) across different levels of recall. Equations 16 and 17 show how the AP and MAP are calculated. Where N is the total number of relevant items among the list of recommended items.

$$AP= \frac{\sum_{j=1}^{n}precision\left(j\right)\times recall(j)}{number of relevant items}$$

(16)

$$MAP= \frac{\sum_{i=1}^{N}{AP}_{i}}{N}$$

(17)

5. Area under the curve (AUC): This was used to measure the ranking accuracy. AUC of a receiver operating characteristics (ROC) curve measures how accurately the proposed algorithm separates predictions into relevant and irrelevant by finding the sensitivity curve against specificity. Equation 18 shows how the AUC_u is computed, where ${rank}_{ui}$ is the position of the nth relevant item among a list of N recommended items, and t_pu is the number of true positives of each user.

$${AUC}_{u}= \frac{1}{N}\left[\left(\sum_{i=1}^{{tp}_{u}}{rank}_{ui}\right)+\left(\frac{{tp}_{u}+1}{2}\right)\right]$$

(18)

6. Fraction of concordant pairs (FCP): This was also used to measure the ranking accuracy of our proposed system. The n_c is the number of concordant pairs calculated as $\mathop \sum \limits_{u \in U} \left| {i,j} \right|\left\{ {r_{ui}{\prime} > r_{uj}{\prime} \Rightarrow r_{ui} > r_{uj} } \right\}$, and n_d is the corresponding number of discordant pairs calculated as$\mathop \sum \limits_{u \in U} \left| {i,j} \right|\left\{ {r_{ui}{\prime} > r_{uj}{\prime} \Rightarrow r_{ui} \le r_{uj} } \right\}$. Concordant pairs are predicted ratings ${r}_{ui}{\prime}$ and ${r}_{uj}{\prime}$ for some items i and j, such that if ${r}_{ui}{\prime}>{r}_{uj}{\prime}$, then the corresponding ratings ${r}_{ui}$ and ${r}_{uj}$ from the dataset must also satisfy the same condition, i.e., ${r}_{ui}>{r}_{uj}$, else the items i and j are discordant pairs. FCP is calculated as shown in Eq. 19.