A kernel method for multi labelled classification

上传者：zhaoqingfang19 2020-07-30 01:33:07上传 PDF文件 90.83KB 热度 42次

This article presents a Support Vector Machine (SVM) like learning sys-tem to handle multi-label problems. Such problems are usually decom-posed into many two-class problems but the expressive power of such asystem can be weak [5, 7]. We explore a new direct approach. It is basedon a large margiFor multi-label linear models, we need to define a way of minimizing the empirical errormeasured by the appropriate loss and at the same time to control the complexity of theresulting model. a direct method would be to use the binary approach and thus take thebenefit of good two-class systems. However, as it has been raised in 15, 7, the binarapproach does not take into account the correlation between labels and therefore does notcapture the structure of some learning problems. We propose here to instead focus on theranking approach. This will be done by introducing notions of margin and regularizationas has been done for the two-class case in the definition ofsvms3 Ranking based systemOur goal is to define a linear model that minimizes the Ranking loss while having a largemargin. For systems that rank the values of(wk. 2)+ bk, the decision boundaries for care defined by the hyperplanes whose equations are (u k m, r+ bk b=0, where hbelongs to the label sets of r: and does not. So, the margin of (a, Y)can be expressed as飞minⅦ-c,)+bk-bk∈YY|k-llIt represents the signed f 2 distance of r: to the decision boundary. Considering that all thedata in the learning set s are well ranked, we can normalize the parameters U k such thatk-0;,x)+bk-b>1with equality for some E S, and(ki, l)EY X Y. Maximizing the margin on the wholelearning set can then be done via the following probleminaxmni(r,Y)∈smk∈ Y,ifY TK=0subject to:(l-;x;)+b-b≥1.(k,l)∈Y;xYIn the case where the problem is not ill-conditioned( two labels are always co-occurring ),u l2. In order to get a simpler optimization procedure we approximate this maximum bythe sum and, after some calculations(see [3] for details ), we obtainmin1,j=1,…∑subject to:(k-wt.m)+bk-b>1.(k,∈Y;×YTo generalize this problem in the case where the learning set can not be ranked exactly wefollow the same reasoning as for the binary case: the ultimate goal would be to maximizethe margin and at the same time to minimize the ranking Loss. The latter can be expressedquite directly by extending the constraints of the previous problems. Indeed, if we haveSikt for(k, 7)E, then the ranking Loss on thelearning set s is∑6(-1+5x)Y排Yx.Cx:xY)where g is the Heaviside function. As for SVMs we approximate the functions 6(-1+Eiklby only Sik! and this gives the final quadratic optimization problemmIn∑‖P+c∑subject to:{mkmt2)+bb≥15;k,(k.l)∈Y×YNv山hs7 find the same optimization probIn the case whetived for multi-class Support Vector Machines [8]. For this reason, wesolution of this problem a ranking Support Vector Machine(Rank-SVM).Anothern property with Svm is the possibility to use kernels rather than linear dot products. This can be achieved by computing the dual of the former optimization problem. Werefer the reader to [3] for the dual formluation and to [2] and references therein for moreinformation about kernels and SVmsSolving a constrained quadratic problem like those we just introduced requires an amountof memory that is quadratic in terms of the learning set size and it is generally solved inO(m")computational steps where we have put into the O the number of labels. Such acomplexity is too high to apply these methods in many real datasets. To circumvent this limitation, we propose to use a linearization method in conjunction with a predictor-correctordure.Details are described in 3 with all the calculations relativethe implementation. The methe method thO(mQQ ma.Qmaz= maxi Yi is the maximum number of labels. In many applications Q is muchhe time cost of each iteration is O(mQ4 Set size predicSo far we have only developed ranking systems. To obtain a complete multi-label systemwe need to design a set size predictor s(r). A natural way of doing this is to look forinspiration from the binary approach. The latter can indeed be interpreted as a rankinggsystem whose ranks are derived from the real values(f1,.fQ). The predictor of the setsize is then quite simple: s(a)=f()>0 is the number of fu that are greater than OThe function s(a)is computed from a threshold value that differentiates labels in the targetset from others. For the ranking system introduced in the previous section we generalizethis idea by designing a function s()=Hfa(a)>t(). The remaining problem nowis to choose t( ) which is done by solving a learning problem. The training data arecomposed by the(i(ai), fq(ail) given by the ranking system, and by the target valuesdefined byt()= argmin1|{k∈ Yst. fk(x)≤tN+{k∈Ys(a)>t}When the minimum is not unique and the optimal values are a segment, we choose themiddle of this segment. We refer to this method of predicting the set size as the thresholdbased method. In the following, we have used linear least squares, and we applied it notonly to rank -svm but also to boostexter in order to transform these algorithms fromanking methods to multi-label onesNote that we could have followed a much simpler scheme to build the function s( ) .Anaive method would be to consider the set size prediction as a regression problem on theoriginal training data with the targets (Yili=l. m and to use any regression learningsystem. This however does not provide a satisfactory solution mainly because it does nottake into account how the ranking is performed. In particular, when there are some errors inthe ranking, it does not learn how to compensate these errors although the threshold basedapproach tries to learn the best threshold with respect to these errors5 Tov problemAs previously noticed the binary approach is not appropriate for problems where correla-tion between labels exist. To illustrate this point consider figure 2. There are only three la-bels. One of them (label 1)is present for all points in the learning set. The binary approachleads to a system that will fail to separate, for instance, points with label 3 from points oflabel sets not containing 3, that is, on points of label I and 2. We see then that the express-ible power of a binary system can be quite low when simple configurations occur. If weconsider the ranking approach, one can imagine the following solution: W1=0, b1w2, b 2) is the hyperplane separating class 2 from class 3, and(wg, b 3)(2,b2).Bytaking the number of labels at point c to be s(a)=(, r )+ b where w=(1, 1)andb=0, we have a simple multi-label system that separates all the regions exactlyFigure 2: Three labels and threeregions in the input space. The upperleft region is labelled with 1. The1,3bottom right region is partitioned intotwo sub-regions with labels 1, 2 or 1, 31.2To make this point more concrete we sampled 50 points uniformly on[0, 12 and solved alloptimization problems with C= o. On the learning set the hamming Loss for the binaryapproach was 0.08 although for the direct approach it was 0 as expected6 Experiments on real dataYeastCe I RescUeMetabolismTranscr pliFacifta:ionfense, Ce l DeathOrganizat an-Signal TansductionhanismBiogenesisFigure 3: First level of the hierarchy of the gene functional classes. There are 14 groupsOne gene, for instance the gene YAL041 W can belong to different groups(shaded in greyon the figureThe Yeast dataset is formed by micro-array expression data and phylogenetic pro-files with 1500 genes in the learning set and 917 in the test set. The input dimen-sion is 103. each gene is associated with a set of functional classes whose max-imum size can be potentially more than 190. This dataset has already been analyzed with a two-class approach 6 and is known to be difficult. In order to makeit easier, we used the known structure of the functional classes. The whole set oftree whose leare the functional categories(seehttp://mips.gsf.de/proj/yeast/catalogues/funcat/formoredetails).Givena gene, knowing which edge to take from one level to another leads directly to a leaf andthus to a functional class. Here we try to predict which edge to take from the root to thefirst level of the tree(see figure 3)Since one gene can have many functional classes this is a multi-label problem: one gene isassociated to different edges. We then have Q= 11 and the average number of labels foall genes in the learning set is 4.2+ 1.6. We assessed the quality of our method from twoperspectives. First as a ranking system with the ranking Loss and the precision. In thatcase, for the binary approach, the real outputs of the two-class SVMs were used as rankingvalues. Second, the methods were compared as multi-label systems using the HammingLoSs. We computed the latter for the binary approach used in conjunction with SVMs, forthe rank - svm and for boostexter To measure the hamming loss with boostexter we useda threshold based s(a)function in combination with the ranking given by the algorithmRank-SVMBinary-SVMdegreePrecision‖0.7030740074607620.6920721|0.7140753Ranking loss0227019101900.17510.241021210.1960184Hamming loss0.2380.2170.2090.201‖0.2470.2240one-error0.340.2620.2550.2820.34110.3060.2670.250Figure 4: Polynomials of degree 2-5. Loss functions for the rank-SVM and the binaryapproach based on two-class SVMs. Considering the size of the problem, two values different from less than 0.01 are not significantly different. Bold values represent superiorperformance comparing classifiers with the same kernelFor rank-SVMs and for two-class SVMs in the binary approach we choose polynomialkernels of degrees two to nine(experiments on two-class problems using the Yeast data in[6] already showed that polynomial kernels were appropriate for this task). Boostexter wasused with the standard stump weak learner and was stopped after 1000 iterations. Resultsare reported in tables 4, 5 and 6Rank-SVMBinary-SVIdegreePrecision0.7650.7700.7730.7690.7600.760.700.769Ranking loss0.i700.,1660.1630.1630.1760.1700.1650.164Hamming Loss0.19910.1980.1960.1970.2000.1990.1950.195one-error0.2320.22302170.2250.2320.2270.218026Figure 5: Polynomials of degree 6-9. Loss functions for the rank-SVM and the binaryapproach based on two-class SVMs. Considering the size of the problem, two values dif-ferent from less than 0. 01 are not significantly different. Bold values represent superiorperformance comparing classifiers with the same kerneBoostexter(1000 iterations)]Pre0.717Ranking Loss 0.298Hamming loss0.2370.30Figure 6: Loss functions for Boostexter. Note that these results are worse than with thebinary approach or with rank- SVMNote that Boostexter performs quite poorly on this dataset compared to svM-based approaches. This may be due to the simple decision function realized by Boostexter. Oneof the main advantages of the SVM-based approaches is the ability to incorporate prioriknowledge into the kernel and control complexity via the kernel and regularization webelieve this may also be possible with boostexter but we are not aware of any work in thisTo compare the binary and the rank-SVM we put in bold the best results for each kernelFor all kernels and for almost all losses, the combination ranking based SvM approach isbetter than the binary one. In terms of the Ranking Loss, the difference is significantly infavor of the rank-SVM. It is consistent with the fact that this system tends to minimize thisparticular loss function. It is worth noticing that when the kernel becomes more and morecomplex the difference between rank-SVM and the binary method disappears7 Discussion and conclusionIn this paper we have defined a whole system to deal with multi-label problems. The maincontribution is the definition of a ranking based svm that extends the use of the latter tomany problems in the area of Bioinformatics and Text MiningWe have seen on complex real data that rank- sVms lead to better performance than boostexter and the binary approach. On its own this could be interpreted as a sufficient argumentto motivate the use of such a system. However we can also extend the rank -SVm sytem to perform feature selection on ranking problems [3. This application can be veryuseful in the field of bioinformatics as one is often interested in interpretability of a multilabel decision rule. For example one could be interested in a small set of genes which isdiscriminative in a multi-condition physical disorderWe have presented only first experiments using multi-labelled systems applied to bioinfor-matics. Our future work is to conduct more investigations in this areaReferences[1 B. Boser, I Guyon, and V. Vapnik, A training algorithm for optimal margin classifiers. In FifthAnnual Workshop on Computational Learning Theory, pages 144 152, Pittsburgh, 1992. ACM21 N. Cristianini and J. Shawe-Taylor. Introduction to Support Vector Machines. Cambridge Uni-versity press. 2000[3 Andre Elisseeff and Jason Weston. Kernel methods for multi-labelled classification and categoricalregressionproblemsTechnicalreportBioWulFTechnologies2001.http://www.bht-labs. com publica4 T Joachims. Text categorization with support vector machines: learning with many relevantfeatures. In Claire Nedellec and celine rouveirol, editors, Proceedings of ECML-98, 10th european Conference on Machine learning, number 1398, pages 137-142, Chemnitz, DE, 1998Springer Verlag, Heidelberg, DE[5 A. McCallum. Multi-label text classification with a mixture model trained by em. AAA199earning[6 P. Pavlidis, J. Weston, J Cai, and W.N. Gry expression dataphylogenetic profiles to learn functional categories using support vector machines. In RECOMBpage248.2001[7R E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine lea?irg,39(2/3):135-168,2008I. Weston and C. watkins. Multi-class support vector machines. Technical Report 98-04, RoyalHolloway uiof London. 1998

下载地址

用户评论

更多下载