This section explains how to evaluate the models generated by WSO2 ML with regard to their accuracy. The following topics are covered.
Terminology of Binary Classification Metrics
Binary Classification Metrics refer to the following two formulas used to calculated the reliability of a binary classification model.
|True Positive Rate (Sensitivity)||TPR = TP / P = TP / (TP + FN)|
|True Negative Rate (Specificity)||SPC = TN / N = TN / (TN + FP)|
The following table explains the abbreviations used in the above formulas.
|P||Positives||The total number of positive outcomes (i.e. the total number of items that actually belong to the positive class).|
|N||Negatives||The total number of negative items (i.e. the total number of items that actually belong to the negative class).|
TP data items:
FP data items:
TN data items:
FN data items:
Model evaluation measures
The following methods are used to evaluate the performance of models in terms of accuracy.
|Predicted vs Actual|
|Precision and Recall|
The confusion matrix is a table layout that visualises the performance of a classification model by displaying the actual and predicted points in the relevant grid. The confusion matrix for a binary classification is as follows:
The confusion matrix for a multi class (n number of classifications) classification is as follows:
This matrix allows you to identify which points are correctly classified and which points are incorrectly classified. The points in the grids with matching actual and predicted classes are the correct predictions and these points should be maximised for greater accuracy. The green grids in the above images are the grids for the correctly classified points. In an ideal scenario, all other grids should have zero points.
The following is an example of a confusion matrix with both correctly classified points as well as incorrectly classified points.
The accuracy of a model can be calculated using the following formula.
Accuracy = Correctly Classified Points / Total Number of Points
For a binary classification model, this can be calculated as follows
Accuracy = (TP + TN) / (TP + TN + FP + FN) = (TP + TN) / (P + N)
e.g., The rate of accuracy can be calculated as follows based on the example for the Confusion Matrix above.
Correctly classified points = 12 + 16 + 16 = 44
Total number of points = 12 + 16 + 16 + 1 + 1 + 1 = 47
Accuracy = 44 / 47 = 93.62%
You can find this metric for classification models in the model summary as shown in the above image.
This illustrates the performance of binary classifier model by showing the TPR (True Positive Rate> against the SPC (False Positive Rate) for different threshold values. A completely accurate model would pass through the 0, 1 coordinate (i.e. TPR of 1 and SPC of 0) in the upper left corner of the plot. However, this is not achievable in practical scenarios. Therefore, when comparing models, the model with the ROC curve closest to the 0, 1 coordinate can be considered the best performing model in terms of accuracy. The best threshold for this model is the threshold associated with the point that is closest to the 0,1 coordinate on the ROC curve. You can find ROC curve for a particular binary classification model under the model summary in WSO2 ML UI.
AUC (Area Under the Curve) is another metric for accuracy of a binary classification model that is associated with the ROC curve. A model with greater accuracy should have a value closer to 1 for AUC (area of the ROC curve). Therefore, when comparing the accuracy of multiple models using the AUC, the one with the highest AUC can be considered the best performing model.
You can find the AUC value for a particular model in its ROC curve in the model summary (see the image of the ROC curve in the previous section with text
ROC Curve (AUC = 0.619).
This chart visualizes the importance (weight) of each feature in the model according to its significance in creating the final model. In regression models (numerical predictions), each of these weights represents the amount by which the response variable would be changed, when the respective predictor variable is increased by one unit. By looking at this chart you can make decisions for feature selection. This chart type is available for both binary classification and numerical prediction models.
Predicted vs Actual
This chart plots the data points according to the correctness of the classification. You can select two dataset features to be visualized and the plot will display data distribution with the classification accuracy (correct/incorrect) for each point.
MSE (Mean Squared Error) is the average of the squared errors of the prediction. An error is the difference between the actual value and the predicted value. Therefore, a better performing model should have a comparatively lower MSE. This metric is widely used to evaluate the accuracy of numerical prediction models. You can find this metric for numerical prediction models in the model summary as shown in the above image.
Residual plot shows the residuals on the y-axis and a predictor variable (feature) on the x-axis. A residual is defined as the difference between the observed (actual) value and the predicted value of the response variable. A model can be considered accurate then the residual points are:
Randomly distributed (do not form a pattern)
Centered around zero on the vertical axis (indicating that, there are equal numbers of positive and negative values)
Precisely distributed around zero in the vertical axis (indicating that there are no very large positive or negative residual values)
If the above conditions are not satisfied, it is possible that there are some missing/hidden factors/predictor variables that have not been taken into account. Residual plot is available for numerical prediction models. You can select a dataset feature to be plotted with its residuals.
Precision and Recall
Precision and Recall are performance measures used to evaluate search strategies. They are typically used in document retrieval scenarios.
When search is carried out on a set of records in a database, some of the records are relevant to the search and the rest of the records irrelevant to the search. However, the actual set of records retrieved may not perfectly match the set of records that are relevant to the search. Based on this, Precision and Recall can be described as follows.
|Precision||The number of selected items that are relevant.||TP / (TP + FP)|
The number of relevant items that are selected.
This is the same as the TPR.
|TP / (TP + FN)|
The F1 Score gives the weighted average of Precision and Recall. It is expressed as a value between 0 and 1, where 0 indicates the worst performance and 1 indicates the best performance.
2TP / (2TP + FP + FN)