多任務(wù)學(xué)習(xí)早期的研究工作源于對機(jī)器學(xué)習(xí)中的一個重要問題,即“歸納偏置(inductive bias)”問題的研究。機(jī)器學(xué)習(xí)的過程可以看作是對與問題相關(guān)的經(jīng)驗數(shù)據(jù)進(jìn)行分析,從中歸納出反映問題本質(zhì)的模型的過程。歸納偏置的作用就是用于指導(dǎo)學(xué)習(xí)算法如何在模型空間中進(jìn)行搜索,搜索所得模型的性能優(yōu)劣將直接受到歸納偏置的影響,而任何一個缺乏歸納偏置的學(xué)習(xí)系統(tǒng)都不可能進(jìn)行有效的學(xué)習(xí)。不同的學(xué)習(xí)算法(如決策樹,神經(jīng)網(wǎng)絡(luò),支持向量機(jī)等)具有不同的歸納偏置,人們在解決實際問題時需要人工地確定采用何種學(xué)習(xí)算法,實際上也就是主觀地選擇了不同的歸納偏置策略。一個很直觀的想法就是,是否可以將歸納偏置的確定過程也通過學(xué)習(xí)過程來自動地完成,也就是采用“學(xué)習(xí)如何去學(xué)(learning to learn)”的思想。多任務(wù)學(xué)習(xí)恰恰為上述思想的實現(xiàn)提供了一條可行途徑,即利用相關(guān)任務(wù)中所包含的有用信息,為所關(guān)注任務(wù)的學(xué)習(xí)提供更強(qiáng)的歸納偏置。
C. Multitasksparsity via maximum entropy discrimination
這篇文章可以看作是比較全面的總結(jié)性文章,文中總共討論了四種情況,feature selection, kernel selection,adaptive pooling and graphical model structure。并詳細(xì)介紹了四種多任務(wù)學(xué)習(xí)方法,很具有參考價值。
[1] T. Evgeniouand M. Pontil. Regularized multi-task learning. In PRoceeding of thetenth ACM SIGKDD international conference on Knowledge Discovery and DataMining, 2004.
[2] T. Jebara. MultitaskSparsity via Maximum Entropy Discrimination. In Journal of Machine LearningResearch, (12):75-110, 2011.
[3] A. Argyriou,T. Evgeniou and M. Pontil. Convex multitask feature learning. In MachineLearning, 73(3):243-272, 2008.
中國科學(xué)技術(shù)大學(xué)多媒體計算與通信教育部-微軟重點實驗室 MultiMedia Computing Group
Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately.[1][2][3]
In a widely cited 1997 paper, Rich Caruana gave the following characterization:
Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better.[3]
In the classification context, MTL aims to improve the performance of multiple classification tasks by learning them jointly. One example is a spam-filter, which can be treated as distinct but related classification tasks across different users. To make this more concrete, consider that different people have different distributions of features which distinguish spam emails from legitimate ones, for example an English speaker may find that all emails in Russian are spam, not so for Russian speakers. Yet there is a definite commonality in this classification task across users, for example one common feature might be text related to money transfer. Solving each user's spam classification problem jointly via MTL can let the solutions inform each other and improve performance.[4] Further examples of settings for MTL include multiclass classification and multi-label classification.[5]
Multi-task learning works because regularization induced by requiring an algorithm to perform well on a related task can be superior to regularization that prevents overfitting by penalizing all complexity uniformly. One situation where MTL may be particularly helpful is if the tasks share significant commonalities and are generally slightly under sampled.[4] However, as discussed below, MTL has also been shown to be beneficial for learning unrelated tasks.[6]
Contents
[hide] 1Methods1.1Task grouping and overlap1.2Exploiting unrelated tasks1.3Transfer of knowledge2Mathematics2.1Reproducing Hilbert space of vector valued functions (RKHSvv)2.1.1RKHSvv concepts2.1.2Separable kernels2.1.3Known task structure2.1.3.1Task structure representations2.1.3.2Task structure examples2.1.4Learning tasks together with their structure2.1.4.1Optimization of Q2.1.4.2Special cases2.1.4.3Generalizations3applications3.1Spam filtering3.2Web search3.3RoboEarth4Software package5See also6References7External links7.1Software
Methods[edit]
Task grouping and overlap[edit]
Within the MTL paradigm, information can be shared across some or all of the tasks. Depending on the structure of task relatedness, one may want to share information selectively across the tasks. For example, tasks may be grouped or exist in a hierarchy, or be related according to some general metric. Suppose, as developed more formally below, that the parameter vector modeling each task is a linear combination of some underlying basis. Similarity in terms of this basis can indicate the relatedness of the tasks. For example with sparsity, overlap of nonzero coefficients across tasks indicates commonality. A task grouping then corresponds to those tasks lying in a subspace generated by some subset of basis elements, where tasks in different groups may be disjoint or overlap arbitrarily in terms of their bases.[7] Task relatedness can be imposed a priori or learned from the data.[5][8]
Exploiting unrelated tasks[edit]
One can attempt learning a group of principal tasks using a group of auxiliary tasks, unrelated to the principal ones. In many applications, joint learning of unrelated tasks which use the same input data can be bene?cial. The reason is that prior knowledge about task relatedness can lead to sparser and more informative representations for each task grouping, essentially by screening out idiosyncrasies of the data distribution. Novel methods which builds on a prior multitask methodology by favoring a shared low-dimensional representation within each task grouping have been proposed. The programmer can impose a penalty on tasks from different groups which encourages the two representations to be orthogonal. Experiments on synthetic and real data have indicated that incorporating unrelated tasks can result in significant improvements over standard multi-task learning methods.[6]
Transfer of knowledge[edit]
Related to multi-task learning is the concept of knowledge transfer. Whereas traditional multi-task learning implies that a shared representation is developed concurrently across tasks, transfer of knowledge implies a sequentially shared representation. Large scale machine learning projects such as the deep convolutional neural network GoogLeNet,[9] an image-based object classifier, can develop robust representations which may be useful to further algorithms learning related tasks. For example, the pre-trained model can be used as a feature extractor to perform pre-processing for another learning algorithm. Or the pre-trained model can be used to initialize a model with similar architecture which is then fine-tuned to learn a different classification task.[10]
Mathematics[edit]
Reproducing Hilbert space of vector valued functions (RKHSvv)[edit]
The MTL problem can be cast within the context of RKHSvv (a complete inner product space of vector-valued functions equipped with a reproducing kernel). In particular, recent focus has been on cases where task structure can be identified via a separable kernel, described below. The presentation here derives from Ciliberto et al, 2015.[5]
RKHSvv concepts[edit]
Suppose the training data set is {/displaystyle {/mathcal {S}}_{t}=/{(x_{i}^{t},y_{i}^{t})/}_{i=1}^{n_{t}}},%20with {/displaystyle%20x_{i}^{t}/in%20{/mathcal%20{X}}}, {/displaystyle%20y_{i}^{t}/in%20{/mathcal%20{Y}}},%20where {/displaystyle%20t} indexes%20task,%20and {/displaystyle%20t/in%201,...,T}.%20Let {/displaystyle%20n=/sum%20_{t=1}^{T}n_{t}}.%20In%20this%20setting%20there%20is%20a%20consistent%20input%20and%20output%20space%20and%20the%20same loss%20function {/displaystyle%20{/mathcal%20{L}}:/mathbb%20{R}%20/times%20/mathbb%20{R}%20/rightarrow%20/mathbb%20{R}%20_{+}} for%20each%20task:%20.%20This%20results%20in%20the%20regularized%20machine%20learning%20problem: