Research Article  Open Access
Cuiqing Zhang, Maojun Zhang, Xijun Liang, Zhonghang Xia, Jiangxia Nan, "Perceptron Ranking Using Interval Labels with Ramp Loss for Online Ordinal Regression", Mathematical Problems in Engineering, vol. 2020, Article ID 8866257, 15 pages, 2020. https://doi.org/10.1155/2020/8866257
Perceptron Ranking Using Interval Labels with Ramp Loss for Online Ordinal Regression
Abstract
Due to its wide applications and learning efficiency, online ordinal regression using perceptron algorithms with interval labels (PRIL) has been increasingly applied to solve ordinal ranking problems. However, it is still a challenge for the PRIL method to handle noise labels, in which case the ranking results may change dramatically. To tackle this problem, in this paper, we propose noiseresilient online learning algorithms using ramp loss function, called PRILRAMP, and its nonlinear variant KPRILRAMP, to improve the performance of PRIL method for noisy data streams. The proposed algorithms iteratively optimize the decision function under the framework of online gradient descent (OGD), and we justify the algorithms by showing the order preservation of thresholds. It is validated in the experiments that both approaches are more robust and efficient to noise labels than stateoftheart online ordinal regression algorithms on realworld datasets.
1. Introduction
Ordinal regression, also called ranking learning, plays a central role in the learning task where the labels of data samples need to be ordered. It has been routinely used in social ranking tasks, e.g., collaborative filtering [1], ecology [2], and detecting the severity of Alzheimer disease [3], to name a few. In contrast to other types of regression analysis, ordinal regression describes the relationship between variables where the order matters. For example, the label of age category can be sorted as (0–9, 10–19, , 90–99) and the levels of bond credits can be sorted as “B” “A” “AA” “AAA” [4]. In general, ordinal regression requires a linear (nonlinear) function and a set of thresholds, where each threshold corresponds to a class. However, ordinal regression also distinguishes multiclassification in the sense that the output labels have a natural order.
A seminal work of ordinal regression can be traced back to the proportional odds model (POM) proposed by McCullagh [5], in which a general class of regression models for ordinal data was studied. Later, large margin formulations for ordinal regression were proposed in [6, 7]. Both approaches outperformed existing benchmarks when they were applied to ranking and multiclass classification. Furthermore, to accelerate the training, Gu et al. [8] proposed asynchronous parallel coordinate descent algorithms. Besides, Berg et al. [9] showed that training a classifier can improve neural network accuracy and proposed that using several discrete data representations simultaneously can improve neural network learning compared to a single representation. In many situations, interval labels have to be used instead of the exact label. As an example of predicting product ratings, we get an entire range of scores (e.g., 1–3 and 4–7) from different customers. In [10], a large margin batch algorithm was proposed by using interval labels. The approaches discussed so far are batch algorithms. Recently, in order to reduce computational time, Crammer et al. [11] proposed an online learning algorithm, passive aggressive (PA) algorithm, for ordinal regression, which chooses an appropriate step size to find the new parameters in every trial. Later, Manwani and Chandra [12] employed the PA algorithm to solve problems with interval labels. Perceptron is another principled method of learning classifiers in online fashion. The authors in [13, 14] proposed an online algorithm (with similar principles to the classic perceptron used for 2class separation) for finding the set of parallel hyperplanes which would comply with the separation rule. In [15], perceptronbased approach was proposed for ordinal regression using interval labels.
Although existing works have made huge progress in reducing memory requirement and computational time, training models on the dataset with poor quality, e.g., having a lot of noise, remains challenging. Existing scene classification algorithms predominantly focus on static data and are designed to learn discriminant information from clean data. They, however, suffer from two major shortcomings, i.e., the noisy label may negatively affect the learning procedure, and learning from scratch may lead to a huge computational burden. The authors in [16] propose a noiseresilient online classification algorithm, which is scalable and robust to noisy labels and applied it to peptide identification [17]. To reduce the negative influence of outliers, the authors in [18] propose a more robust algorithm termed as ramp loss for twin Kclass support vector classification (RampTKSVC) where the ramp loss function was used to substitute the Hinge loss function.
In this work, we aim to study a noiseresilient online learning algorithm to promote the online ordinary regression problem for noisy data streams. Particularly, we propose perceptron ranking algorithms using interval labels with ramp loss (PRILRAMP) and its kernel variant (KPRILRAMP) for ordinal regression in an online learning manner. Our key contributions are as follows:(1)Propose linear and nonlinear noiseresilient online algorithms, i.e., PRILRAMP and KPRILRAMP, and design a procedure to update the model parameters. The optimal parameters are obtained by online gradient descent (OGD) procedure.(2)It is theoretically proved that the proposed the PRILRAMP algorithm keeps the thresholds sequentially.(3)Practical experimental studies on various datasets demonstrate the effectiveness of the proposed algorithms by comparing them with stateoftheart algorithms.
The paper is organized as follows. In Section 2, we discuss a generic framework of ordinal regression using interval labels. Furthermore, we introduce the proposed PRILRAMP and KPRILRAMP algorithms and discuss the order preservation of thresholds of the proposed algorithms. In Section 3, we present experiments and illustrate the comparison results between PRILRAMP and the stateoftheart algorithms. We conclude the whole paper in Section 4.
2. Method
2.1. Learning to Rank in Online Ordinary Regression
Ordinary regression has been successfully applied to a great number of realworld problems. Typical ordinary regression trains models in a batch manner, i.e., feeding all data into training model at once [6–8]. The batch training manner demands huge computation and memory for largescale problems, and it is also not adaptable to streaming data. Comparing with the traditional batch learning framework, online learning (shown in Figure 1) learns samples in a streaming fashion so that it can be scaled and respond in realtime. We first introduce the online ordinary regression of the perceptron algorithm and its variant, and then present noiseresilient online learning algorithms PRILRAMP and KPRILRMAP in the next section.
Let be the instance space, the label space, and a sequence of instancerank pairs, where and are its corresponding rank, . We define with “” as the order relation. Every instance is assigned an interval label . Particularly, it becomes the exact label scenario when . Let be the training set. In ordinary regression, our objective is to learn model , which also defined as the upper interval endpoint:where function : is defined as , and thresholds are.
The lower interval endpoint can be defined similarly. Let denote the implicit constraints for ordering of thresholds in , and stands for the interval [8]. Then, we have the loss function defined as follows [13]:
As the loss function is convex for each , the sum of them is convex. Similarly, the sum of , is also convex. Note that the loss tends to if there is a large amount of noise. Loss function is primarily designed to learn information from clean data as the overall performance of the learning model degrades with interference of label noise.
2.2. Perceptron Ranking Using Interval Label Algorithms with Ramp Loss
We start the presentation of noiseresilient online learning algorithm with the PRIL algorithm and its variants.
2.2.1. PRIL Algorithm
Perceptron ranking using interval labels (PRILs) was first introduced in [13] for ordinal regression in online fashion. Let be the observed example, , and , which are the parameters of the online ordinary at time . Let . Then, we definewhere is an indicator function which takes value 1 if A is true. Thus and . Let . For given sampleinterval pair, the loss only when
Therefore, requires that . The perceptron ranking loss function in linear situation can be rewritten as
2.2.2. PRILRAMP and KPRILRAMP Algorithms
Now, we present the PRILRAMP procedure to minimize the estimated risk using the OGD structure [14] of ramp loss in online ordinary regression.
(1) Ramp Loss. To offset the influence of noise data, we apply ramp loss as a surrogate of the perceptron loss function. The surrogate of perceptron loss function is ramp loss is defined as
Notice that the ramp loss is a noiseresilient function, which makes the contribution of noise loss zero when tends to . In classification methodologies, robustness to noise is always an important issue. The effect of noise samples can be significantly large since the penalty given to the outliers by the perceptron loss is quite huge, as any convex loss is unbounded. On the other hand, the ramp loss has an upper bound, and hence, it can control the effect of noisy samples and remove the effect of noise. Plots of perceptron loss and ramp loss in Figure 2 show the robustness (noiseresilient) of the ramp loss.
In addition, the ramp loss is a difference of convex (DC) function and can be represented as the difference of two convex functions:where is perceptron loss function, as illustrated in Figure 3.
(2) PRILRAMP Algorithm. In the setting of ordinal regression, instead of considering all the categories to contribute errors to threshold, we allow the ramp loss function to update the thresholds. Ordinal inequalities on thresholds are satisfied automatically at the optimal solution.
Figure 4 uses an example to illustrate the update rule of PRILRAMP algorithm. In this example, we set . Note that is omitted from all the plots in Figure 4. The correct rank of the instance is , and thus, the value of should fall in the last interval, between and . However, as you can see in Figure 4, the value of falls below interval , and the predicted rank is . Threshold values of , and are errors since the values of , and are higher than . To mend the mistake, the algorithm decreases the values of by a unit and replace them with and . It also modifies to be since . Thus, the innerproduct increases by . This update is illustrated at the middle plot of Figure 4. The updated prediction rule is sketched on the righthand side. Note that after the update, predicted rank of is which is closer to the true rank . We allow a small error of one unit. Note that the data in the interval label (in red) are noise. By replacing perceptron loss with ramp loss, PRILRAMP guarantees that the parameters of this data with scoring are not updated, which significantly controls the effect of noisy data.
Next, we describe the procedure in PRILRAMP to update the parameters for predicting the label:where
Note that is a convex function while is a concave function. Let , then we have
The concaveconvex procedure (CCCP) [19] can be used to obtain the optimal solution to the DC function. However, as is mentioned above, this batch learning manner cannot meet the realtime requirement when dealing with streaming data. In this work, we use the OGD method to find a nearoptimal solution. It is a tradeoff between the accuracy and scalability. Particularly, we have
To estimate and , we initialize and . Let and be the estimates of the parameters in the beginning of trial , be the example observed, and be its label interval. and are estimated as follows:
Let , then we have
Note that sets are known at every trial . The complete description of the PRILRAMP algorithm is as given in Algorithm 1.

(3) KPRILRAMP Algorithm. For nonlinear ordinary regression, we choose the suitable kernel function as proposed by Manwani [15]. That is,where :
The Kernel PRIL algorithm with ramp loss (KPRILRAMP) for PRILRAMP is discussed in Algorithm 2.

2.3. Theoretical Analysis of PRILRamp Algorithm
Considering the property of PRIL algorithms, we now show that the PRILRAMP algorithm inherently maintains the ordering of thresholds at each iteration.
Theorem 1. Order preservation of thresholds in the PRILRAMP algorithm: let and be the thresholds at trial . Let be the instance at trial and be its corresponding interval label. Let be the updated thresholds using PRILRAMP. Then, .
Proof. We need to analyse four different cases as follows:(1)We know that . PRILRAMP does not update thresholds in this range. Thus, .(2): in this case, is in the update rule. Thus, using the fact that , we get(3): we see that Thus, there can be two cases only.(a): we simply get .(b): since , we get . That means However, . Thus,(4): we see that(a): we simply get .(b): since , we get . That meansHowever, . Thus,This completes proof.
Similar demonstrations can be given for the KPRILRAMP algorithm, and hence, we skip its proof. Theorem 1 shows that the KPRILRAMP algorithm can keep the thresholds in sequential.
3. Experiments
We evaluate performance of the PRILRAMP algorithm by comparing it with other benchmark methods on datasets with various ratios of noise. All experimental studies are performed in MATLAB R2016a environment on a PC with 2.5 GHz Intel Core i5 processors and 8 GB RODRAM running under the Windows 10 operating system.
3.1. Dataset
All datasets are obtained from UCI (http://archive.ics.uci.edu/ml/) machine learning repository [20] and LIBSVM website, and data features are normalized to zero mean and unit variance coordinatewise. More details about the datasets are provided as follows: Abalone: the dataset contains 4177 instances with 8 attributes, related to the physical measurement of Abalone found in Australia. A typical task based on the dataset is to predict the age of the Abalone using the “Rings” attribute which varies from 1–29. We divide the target variable 1–29 into 4 intervals as 1–7, 8–9, 10–12, and 13–29 [12]. Parkinsonsupdrs: the dataset consists of a range of biomedical measurements by voice with earlystage Parkinson’s disease. There are 5847 instances with 21 features. The target variable of this dataset is “total UPDRS” (Clinician’s total UPDRS score) for the instance which varies from 7 to 54.992. There are 5 groups of target variable 7–54.992: = (17, 17, 27, 37, 54.992) [12]. Real estate valuation: the real estate valuation is collected from Sindian District, New Taipei City, Taiwan. This dataset has 414 instances with 7 attributes. We focus on predicting the house price of unit area which ranges from 0 to 200. The created 4 intervals are 0–20, 21–40, 41–60, and 61–200.
Table 1 shows the characteristics of the 5 datasets including the number of patterns, attributes, and classes, and also the number of patterns per class. To show the statistical properties of different datasets, we drew a box plot as shown in Figure 5. The xaxis denotes the dimensions of examples, and the yaxis corresponds to the outliers. It indicates that there exists noise in the datasets itself, especially in Parkinsonsupdrs dataset.

(a)
(b)
(c)
3.2. Generating Noise
Noises are generated by choosing instances from the dataset randomly. Firstly, for covariance variables , the error term follows a mixture normal distribution [21]. The visualization of the noise datasets is proved in Figure 6. The red points denote the original examples, and the blue circles are corresponding to the outliers. Then, the label is generated in the target variables. For each of these examples, we randomly assign one of the following intervals, i.e., , where is the actual label. Finally, we consider and 50.
(a)
(b)
(c)
3.3. Kernel Functions
We use the following kernel functions for different datasets [15] for Kernel PRIL. Abalone: Parkinsonsupdrs: Real Estate Valuation:
3.4. Result Comparison
We compare noise tolerance performance of PRILRAMP and KPRILRAMP algorithms with basic PRIL and Kernel PRIL proposed in [15], respectively, where PRIL is perceptronbased approach for ordinal regression using interval labels in online. The difference between PRIL and Kernel PRIL is that the PRIL is linear while Kernel PRIL uses nonlinear kernel function in the experiments.
3.4.1. Accuracy
We use average accuracy as the metrics to evaluate the performance of all algorithms in the experimental studies. Accuracy, the short form of the average accuracy, is defined as (mean zeroone error) [22], where is the true label and is the predicted label. We run all algorithms 10 times and average the instantaneous accuracy across the 10 runs. The values of accuracy output by four algorithms running on different noisy data are reported in Figures 7–9. In the figures, the xaxis represents the number of examples, and the yaxis represents the accuracy on test examples. We select appropriate ramp loss parameter for PRILRAMP and KPRILRAMP and set .
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
The results of experimental studies are summarized as follows:(1)We see that for all the datasets, accuracy will gradually increase and stabilize as increases. The accuracy in all algorithms decreases as the level of noise increases. The interference of label noise in the datasets does degrade the accuracy of prediction.(2)For different datasets, linear function algorithms (PRIL and PRILRAMP) and nonlinear function kernel algorithms (Kernel PRIL and KPRILRAMP) perform not the same. That is determined by each dataset. Nevertheless, kernel algorithms need to solve a more complex model for ranking problems.(3)In general, PRILRAMP outperforms PRIL. Particularly, KPRILRAMP shows consistently higher test accuracy than Kernel PRIL over all datasets with different noise levels. Also, noiseresilient algorithms (PRILRAMP and KPRILRAMP) and noisesensitive algorithms (PRIL and Kernel PRIL) work equally well in terms of the accuracy in the environment without noise added. Moreover, the gap of accuracy between noisetolerances (PRILRAMP and KPRILRAMP) and noisesensitive (PRIL and Kernel PRIL) algorithms increases along with the level of noise.(4)On Abalone and Parkinson datasets, linear algorithms perform better than kernel methods when noise is added. The prediction accuracies of the PRIL and Kernel PRIL algorithms are relatively low and unstable in datasets with 25% and 50% noise, and this result shows that the two algorithms are very sensitive to noise. Interestingly, the noiseresilient PRILRAMP and KPRILRAMP algorithms show better noisetolerant performance. On the real estate valuation dataset, the accuracies of four algorithms are close when the dataset has 0% noise. However, when noise is added, PRILRAMP and KPRILRAMP outperform Kernel PRIL across all levels of noise. In other words, PRILRAMP and KPRILRAMP are more robust to noise.
Overall, the noiseinsensitive PRILRAMP and KPRILRAMP algorithms significantly outperform basic PRIL and Kernel PRIL, which indicates that ramp lossbased algorithms can effectively handle noisy data.
The major difference between linear algorithms (PRIL and PRILRAMP) and nonlinear (Kernel PRIL and KPRILRAMP) is that PRIL and PRILRAMP use a constant step size whereas nonlinear algorithms choose a more complex step size to find the new and in every trial. In PRILRAMP and KPRILRAMP, the step size is determined by solving an optimization problem of OGD whose ramp loss is a noiseresilient function, which makes the contributes of noise loss become 0 when there is a lot of noise, whereas PRIL and Kernel PRIL take all samples to ensure that the loss on the current example becomes 0. Thus, we see that the proposed PRILRAMP and KPRILRAMP perform better or comparable to other methodologies and are noise resistance.
3.4.2. Noise Statistical Study
We investigate the proposed noiseresilient online ordinal regression algorithm in the case of covariance variable noisy data by considering the order. Specifically, we attempt to apply the statistical measures to answer the question of how effective the proposed PRILRAMP method is in handling data with noise input. In this subsection, we have reported a number of simulation studies on finitesample performance evaluation . Considering order and statistical measure, we estimate the influence relation of prediction MAE (mean square error) [22], RMSE (root mean square error), Spearman’s correlation coefficient, discarded sample, discarded rate, and run time between PRIL, PRILRAMP, and PAI [10] for different noisy data. and , where is predicted rank and is true one. The value ranges from 0 to (maximum deviation in the number of categories). It suggests that the larger the values are, the worse the prediction accuracy is. In ordinal matching, a useful and common measure is Spearman’s rank correlation which measures the correspondence in rank terms of the two distributions. It measures the structural similarity and values range from 0 to 1. The higher the value, the closer it is. The discard sample denotes the samples in the model that has no updating effect, namely, samples that make in each algorithm iteration. We calculate the number of discard samples by tracking the value of . The discard rate is the percentage of the discard sample and the total. For each dataset, we have four noise scenarios to discuss: 0%, 25%, 50%, and 75%.
The results are summarized in Tables 2–5, and as can be seen from the columns of MAE and RMSE, the proposed PRILRAMP method outperforms the competing PRIL and PAI in four noisy data cases. Especially, in the case of a high level (50% and 75%), PRILRAMP significantly outperforms based on the algorithm PRIL. Spearman’s correlation coefficient shows that the correspondence of PRILRAMP in ranking terms of the two distributions is greater in three cases of noise. Empirically, as we increase the ratio of noise, the difference between the two methods in each evaluation index becomes apparent. Due to the intrinsic flaw of hinge loss, the PAI method is highly sensitive to noise. Ramp loss can effectively reduce the impact of noise data in some sense, though except for individual data. Besides, as we increase our sample, the number of discarded samples increases with the increase of noise, compared with PRIL and PAI. Most of the samples are discarded when the noise level is 75%. Not surprisingly, the percentage of the noisy data significantly affects the discarded rate and the results of the algorithms. Empirically, the discarded rate is approximately equal to the noise ratio. Moreover, the estimation of small samples can lead to the instability of the model because the prediction accuracy depends on how much proportion of noise points is used to train the model. It indicates that PRILRAMP is a better noiseresilient method than PAI to deal with noisy data, though the latter is more faster.
