{"title": "Sparse probabilistic projections", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 80, "abstract": "We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a pre-processing tool for the construction of template attacks.", "full_text": "Sparse probabilistic projections\n\nC\u00b4edric Archambeau\n\nDepartment of Computer Science\n\nFrancis R. Bach\n\nINRIA - Willow Project\n\nUniversity College London, United Kingdom\n\nEcole Normale Sup\u00b4erieure, Paris, France\n\nc.archambeau@cs.ucl.ac.uk\n\nfrancis.bach@mines.org\n\nAbstract\n\nWe present a generative model for performing sparse probabilistic projections,\nwhich includes sparse principal component analysis and sparse canonical corre-\nlation analysis as special cases. Sparsity is enforced by means of automatic rele-\nvance determination or by imposing appropriate prior distributions, such as gener-\nalised hyperbolic distributions. We derive a variational Expectation-Maximisation\nalgorithm for the estimation of the hyperparameters and show that our novel prob-\nabilistic approach compares favourably to existing techniques. We illustrate how\nthe proposed method can be applied in the context of cryptoanalysis as a pre-\nprocessing tool for the construction of template attacks.\n\n1 Introduction\n\nPrincipal component analysis (PCA) is widely used for data pre-processing, data compression and\ndimensionality reduction. However, PCA suffers from the fact that each principal component is a\nlinear combination of all the original variables. It is thus often dif\ufb01cult to interpret the results. In\nrecent years, several methods for sparse PCA have been designed to \ufb01nd projections which retain\nmaximal variance, while enforcing many entries of the projection matrix to be zero [20, 6]. While\nmost of these methods are based on convex or partially convex relaxations of the sparse PCA prob-\nlem, [16] has looked at using the probabilistic PCA framework of [18] along with (cid:96)1-regularisation.\nCanonical correlation analysis (CCA) is also commonly used in the context for dimensionality re-\nduction.The goal is here to capture features that are common to several views of the same data.\nRecent attempts for constructing sparse CCA include [10, 19].\nIn this paper, we build on the probabilistic interpretation of CCA outlined by [2] and further extended\nby [13]. We introduce a general probabilistic model, which allows us to infer from an arbitrary\nnumber of views of the data, both a shared latent representation and individual low-dimensional\nrepresentations of each one of them. Hence, the probabilistic reformulations of PCA and CCA \ufb01t\nthis probabilistic framework. Moreover, we are interested in sparse solutions, as these are important\nfor interpretation purposes, denoising or feature extraction. We consider a Bayesian approach to\nthe problem. A proper probabilistic approach allows us to treat the trade-off between the modelling\naccuracy (of the high-dimensional observations by low-dimensional latent variables) and the degree\nof sparsity of the projection directions in principled way. For example, we do not need to estimate\nthe sparse components successively, using, e.g., de\ufb02ation, but we can estimate all sparse directions\njointly as we are taking the uncertainty of the latent variable into account.\nIn order to ensure sparse solutions we propose two strategies. The \ufb01rst one, discussed in Ap-\npendix A, is based on automatic relevance determination (ARD) [14]. No parameter needs to be\nset in advance. The entries in the projection matrix which are not well determined by the data are\nautomatically driven to zero. The second approach uses priors from the generalised hyperbolic fam-\nily [3], and more speci\ufb01cally the inverse Gamma. In this case, the degree of sparsity can be adjusted,\neventually leading to very sparse solutions if desired. For both approaches we derive a variational\nEM algorithm [15].\n\n1\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Graphical model (see text for details). Arrows denote conditional dependencies.\nShaded and unshaded nodes are respectively observed and unobserved random variables. Plates\nindicate repetitions. (b) Marginal prior on the individual matrix entries (b = 1).\n\n2 Generative model\n\nWe consider the graphical model shown in Figure 1(a). For each observation, we have P independent\nmeasurements x1, . . . , xP in different measurement spaces or views. The measurement xp \u2208 RDp\nis modelled as a mix of a common (or view independent) continuous latent vector y0 \u2208 RQ0 and a\nview dependent continuous latent vector yp \u2208 RQp, such that\n\nxp = Wpy0 + Vpyp + \u00b5p + \u0001p,\np=1 are the view dependent offsets and \u0001p \u223c N (0, \u03c4\u22121\n\n(1)\nwhere {\u00b5p}P\np IDp) is the residual error in view\np.\nWe are interested in the case where y0 and yp are low-dimensional vectors, i.e., Q0, Qp (cid:28) Dp for\nall p. We impose Gaussian priors on the latent vectors:\n\nWp \u2208 RDp\u00d7Q0, Vp \u2208 RDp\u00d7Qp ,\n\ny0 \u223c N (0, \u03a6\u22121\n0 ),\n\nyp \u223c N (0, \u03a6\u22121\n\np ), p \u2208 {1, . . . , P}.\n\n(2)\nThe resulting generative model comprises a number of popular probabilistic projection techniques as\nspecial cases. If there is a single view (and a single latent cause) and the prior covariance is diagonal,\nwe recover probabilistic factor analysis [9]. If the prior is also isotropic, then we get probabilistic\nPCA [18]. If there are two views, we recover probabilistic CCA [2].\nWe seek a solution for which the matrices {Wp}P\np=1 are sparse, i.e. most of their en-\ntries are zero. One way to achieve sparsity is by means of ARD-type priors [14]. In this framework,\na zero-mean Gaussian prior is imposed on the entries of the weight matrices:\n\np=1 and {Vp}P\n\nwipj \u223c N (0, 1/\u03b1ipj),\nvipkp \u223c N (0, 1/\u03b2ipkp),\n\nip \u2208 {1, . . . , Dp}, j \u2208 {1, . . . , Q0},\nip \u2208 {1, . . . , Dp}, kp \u2208 {1, . . . , Qp}.\n\n(3)\n(4)\nType II maximum likelihood leads then to a sparse solution when considering independent hyper-\nparameters. The updates arising in the context of probabilistic projections are given in Appendix A.\nSince marginalisation with respect to both the latent vectors and the weights is intractable, we apply\nvariational EM [15]. Unfortunately, following this route does not allow us to adjust the degree of\nsparsity, which is important e.g. for interpretation purposes or for feature extraction.\nHence, we seek a more \ufb02exible approach. In the remaining of this paper, we will assume that the\nmarginal prior on each weight \u03bbij, which is either an entry of {Wp}P\np=1 and will be\nde\ufb01ned shortly, has the form of an (in\ufb01nite) weighted sum of scaled Gaussians:\n\np=1 or {Vp}P\n\n(cid:90)\n\np(\u03bbij) =\n\nN (0, \u03b3\u22121\n\nij ) p(\u03b3ij) d\u03b3ij.\n\n(5)\n\nWe will chose the prior over \u03b3ij in such a way that the resulting marginal prior over the correspond-\ning \u03bbij induces sparsity. A similar approach was followed in the context of sparse nonparametric\nBayesian regression in [4, 5].\n\n2\n\n\u221210\u22125051000.10.20.30.40.50.60.70.80.91!p(! ) a/DQ=0.1a/DQ=1a/DQ=10\f2.1 Compact reformulation of the generative model\n\nBefore discussing the approximate inference scheme, we rewrite the model in a more compact way.\nLet us denote the nth observation, the corresponding latent vector and the means respectively by\n\nxn =(cid:0)x(cid:62)\n\nn1, . . . , x(cid:62)\n\nnP\n\n(cid:1)(cid:62)\n\n,\n\nzn =(cid:0)y(cid:62)\n\nThe generative model can be reformulated as follows:\n\n(cid:1)(cid:62)\n\n,\n\n\u00b5 =(cid:0)\u00b5(cid:62)\n\n1 , . . . , \u00b5(cid:62)\n\nP\n\n(cid:1)(cid:62)\n\n.\n\nn0, y(cid:62)\n\nn1, . . . , y(cid:62)\n\nnP\n\npQp,\n\n\u03a6 \u2208 RQ\u00d7Q, Q = Q0 +(cid:80)\ni \u2208 {1, . . . , D}, j \u2208 {1, . . . , Q}, D =(cid:80)\n\uf8eb\uf8ec\uf8ed \u03c41ID1\n\uf8f6\uf8f7\uf8f8 ,\n\n\u039b \u2208 RD\u00d7Q, \u03a8 \u2208 RD\u00d7D,\n\n\u03a8 =\n\n0\n...\n\n. . .\n...\n. . .\n\n...\n0\n\n\u03c4P IDP\n\npDp,\n\n\uf8f6\uf8f7\uf8f8 .\n\n0\n. . .\n...\n...\n. . . VP\n\n(6)\n(7)\n(8)\n\nzn \u223c N (0, \u03a6\u22121),\n\u03bbij|\u03b3ij \u223c N (0, \u03b3\u22121\nij ),\n\nwhere\n\n\u039b =\n\nxn|zn, \u039b \u223c N (\u039bzn + \u00b5, \u03a8\u22121),\n\n\uf8eb\uf8ec\uf8ed \u039b1\n\n...\n\u039bP\n\n\uf8f6\uf8f7\uf8f8 =\n\n\uf8eb\uf8ec\uf8ed W1 V1\n\n...\nWP\n\n...\n0\n\nNote that we do not assume that the latent spaces are correlated as \u03a6 = diag{\u03a60, \u03a61, . . . , \u03a6P}.\nThis is consistent with the fact the common latent space is modelled independently through y0.\nSubsequently, we will also denote the matrix of the hyperparameters by \u0393 \u2208 RD\u00d7Q, where we set\n(and \ufb01x) \u03b3ij = \u221e for all \u03bbij = 0.\n\n2.2 Sparsity inducing prior over the individual scale variables\n\nWe impose an inverse Gamma prior on the scale variable \u03b3ij:\n\u03b3ij \u223c IG(a/DQ, b),\n\n(9)\nfor all i and j. The shape parameter a and the scale parameter b are non-negative. The marginal prior\non the weight \u03bbij is then in the class of the generalised hyperbolic distributions [3] and is de\ufb01ned in\nterms of the modi\ufb01ed Bessel function of the third kind K\u03c9(\u00b7):\n\np(\u03bbij) =\n\nfor \u03bbij (cid:54)= 0, and\n\n(cid:33) a\n\n2DQ\u2212 1\n\n4\n\n(cid:16)(cid:113)\n\n(cid:17)\n\nK a\n\nDQ\u2212 1\n\n2\n\n2b\u03bb2\nij\n\n(cid:114) 2\n\nb\n\u0393( a\n\n\u03c0\n\na\n\nDQ\n\n(cid:32)\n(cid:40) (cid:113) b\n\nDQ)\n\n\u03bb2\nij\n2b\n\nlim\n\u03bbij\u21920\n\np(\u03bbij) =\n\n2\u03c0\n\n\u0393( a\n\nDQ\u2212 1\n2 )\n\u0393( a\nDQ )\n\u221e\n\na\n\nDQ > 1\n2 ,\notherwise.\n\n(10)\n\n(11)\n\nThe function \u0393(\u00b7) is the (complete) Gamma function.\nThe effective prior on the individual weights is shown in Figure 1(b). Intuitively, the joint distribu-\ntion over the weights is sparsity inducing as it is sharply peaked around zero (and in fact in\ufb01nite for\nsuf\ufb01ciently small a). It favours only a small number of weights to be non-zero if the scale variable\nb is suf\ufb01ciently large. For a more formal discussion in the context of regression we refer to [7].\nIt is interesting to note that for a/DQ = 1 we recover the popular Laplace prior, which is equivalent\nto the (cid:96)1-regulariser or the LASSO [17], and for a/DQ \u2192 0 and b \u2192 0 the resulting prior is\nthe Normal-Jeffreys prior.\nIn fact, the automatic thresholding method described in Appendix A\n\ufb01ts also into the framework de\ufb01ned by (5). However, it corresponds to imposing a \ufb02at prior on\nthe scale variables over the log-scale, which is a limiting case of the Gamma distribution. When\nimposing independent Gamma priors on the scale variables, the effective joint marginal is a product\nof Student-t distributions, which again is sharply peaked around zero and sparsity inducing.\n\n3 Variational approximation\nWe view {zn}N\nn=1 and matrix \u0393 as latent variables, and optimise the parameters \u03b8 = {\u00b5, \u03a6, \u039b, \u03a8}\nby EM. In other words, we view the weight matrix \u039b as a matrix of parameter and estimate the\n\n3\n\n\fN(cid:88)\n\nFq(x1, . . . , xN , \u03b8) = \u2212 N(cid:88)\n\nentries by maximum a posteriori (MAP) learning. The other parameters are estimated by maximum\nlikelihood (ML).\nThe variational free energy is given by\n\n(12)\nwhere (cid:104)\u00b7(cid:105)q denotes the expectation with respect to the variational distribution q and H[\u00b7] is the differ-\nential entropy. Since the Kullback-Leibler divergence (KL) is non-negative, the negative free energy\nis a lower bound to log-marginal likelihood:\n\n(cid:104)ln p(xn, zn, \u0393|\u03b8)(cid:105)q \u2212 H[q(z1, . . . , zN , \u0393)],\n\nn=1\n\nln p(xn|\u03b8) = \u2212Fq({xn}, \u03b8) + KL [q({zn}, \u0393)(cid:107)p({zn}, \u0393)|{xn}, \u03b8)] (cid:62) \u2212Fq({xn}, \u03b8).\n\nn=1\n\n(13)\nInterestingly it is not required to make a factorised approximation of the the joint posterior q to \ufb01nd\na tractable solution. Indeed, the posterior q factorises naturally given the data and the weights, such\nthat the posteriors we will obtain in the E-step are exact.\nThe variational EM \ufb01nds maximum likelihood estimates for the parameters by cycling through the\nfollowing two steps until convergence:\n\n1. The posterior over the latent variables are computed for \ufb01xed parameters by minimising\n\nthe KL in (13). It can be shown that the variational posteriors are given by\n\nq(z1, . . . , zN ) \u221d N(cid:89)\n\nn=1\n\nN(cid:88)\n\n\u03b8\n\nn=1\n\ne(cid:104)ln p(xn,zn,\u0393|\u03b8)(cid:105)q(\u0393) ,\n\n(14)\n\nq(\u0393) \u221d e(cid:104)ln p(xn,zn|\u0393,\u03b8)(cid:105)q(z1,...,zN )p(\u0393).\n\n(15)\n2. The variational free energy (12) is minimised wrt the parameters for \ufb01xed q. This leads\nin effect to type II ML estimates for the paramteres and is equivalent to maximising the\nexpected complete log-likelihood:\n\n\u03b8 \u2190 argmax\n\n(cid:104)ln p(xn, zn, \u0393|\u03b8)(cid:105)q .\n\n(16)\n\nDepending on the initialisation, the variational EM algorithm converges to a local maximum of\nthe log-marginal likelihood. The convergence can be checked by monitoring the variational lower\nbound, which monotonically increases during the optimisation. The explicit expression of the vari-\national bound is here omitted due to a lack of space\n\n3.1 Posterior of the latent vectors\n\nThe joint posterior of the latent vectors factorises into N posteriors due to the fact the observations\nare independent. Hence, the posterior of each low-dimenstional latent vector is given by\n\nwhere \u00afzn = \u00afSn\u039b(cid:62)\u03a8(xn \u2212 \u00b5) is the mean and \u00afSn =(cid:0)\u039b(cid:62)\u03a8\u039b + \u03a6(cid:1)\u22121 is the covariance.\n\nq(zn) = N (\u00afzn, \u00afSn),\n\n(17)\n\n3.2 Posterior of the scale variables\n\nThe inverse Gamma distribution is not conjugate to the exponential family. However, the posterior\nover matrix \u0393 is tractable. It has the form of a product of generalised inverse Gaussian distributions\n(see Appendix B for a formal de\ufb01nition):\n\nq(\u0393) =\n\np(\u03b3ij|\u03bbij) =\n\nN \u2212(\u00af\u03c9ij, \u00af\u03d5ij, \u00af\u03c7ij)\n\n(18)\n\nD(cid:89)\n\nQ(cid:89)\n\ni=1\n\nj=1\n\nwhere \u00af\u03c9ij = \u2212 a\nfactorised form arises from the scale variable being independent conditioned on the weights.\n\nij and \u00af\u03c7ij = 2b are the shape parameters. The\n\n2 is the index and \u00af\u03d5ij = \u03bb2\n\nDQ + 1\n\nD(cid:89)\n\nQ(cid:89)\n\ni=1\n\nj=1\n\n4\n\n\f3.3 Update for the parameters\n\nBased on the properties of the Gaussian and the generalised inverse Gaussian, we can compute the\nvariational lower bound, which can then be maximised. This leads to the following updates:\n\nN(cid:88)\n\nN(cid:88)\nN(cid:88)\n\nn=1\n\n\u03a6\u22121 \u2190 1\nN\n\ndiag{\u00afzn\u00afz(cid:62)\n\nn + \u00afSn},\n\n\u00b5 \u2190 1\nN\n\n(xn \u2212 \u039b\u00afzn),\n\nn=1\n\n\u03bbi \u2190(cid:16)\u00af\u0393i + \u03a8(i, i)\nN(cid:88)\n\nn (cid:105)(cid:17)\u22121\n\nN(cid:88)\n(cid:8)(xnp \u2212 \u00b5p)(cid:62)(xnp \u2212 \u00b5p) \u2212 2(xnp \u2212 \u00b5p)(cid:62)\u039bp\u00afzn +(cid:10)(\u039bpzn)(cid:62)\u039bpzn\n\n(xn(i) \u2212 \u00b5(i))\u00afzn,\n\n(cid:104)znz(cid:62)\n\n\u03a8(i, i)\n\nn=1\n\nn=1\n\n(19)\n\n(20)\n\n(cid:11)(cid:9), (21)\n\n(22)\n\n(23)\n\np \u2190 1\n\u03c4\u22121\nN Dp\n\nn=1\n\nwhere the required expectations are given by\n\n(cid:104)znz(cid:62)\n\nn (cid:105) = \u00afSn + \u00afzn\u00afz(cid:62)\nn ,\n\n\u00af\u0393i = diag(cid:8)(cid:104)\u03b3i1(cid:105), . . . ,(cid:104)\u03b3iQ(cid:105)(cid:9),\n\n(cid:10)(\u039bpzn)(cid:62)\u039bpzn\n(cid:115) \u00af\u03c7ij\n\n(cid:104)\u03b3ij(cid:105) =\n\n(cid:11) = tr{(cid:104)\u00afzn\u00afz(cid:62)\n(cid:0)\u221a\n(cid:1)\nn (cid:105)\u039b(cid:62)\np \u039bp},\n(cid:1) .\n(cid:0)\u221a\n\n\u00af\u03c7ij \u00af\u03d5ij\n\u00af\u03c7ij \u00af\u03d5ij\n\nK\u03c9+1\nK\u03c9\n\n\u00af\u03d5ij\n\nNote that diag{\u00b7} denotes a block-diagonal operation in (19). More importantly, since we are seek-\ning a sparse projection matrix, we do not suffer from the rotational ambiguity problem as is for\nexample the case standard probabilistic PCA.\n\n4 Experiments\n\n4.1 Synthetic denoising experiments\n\nBecause of identi\ufb01ability issues which are subject of ongoing work, we prefer to compare various\nmethods for sparse PCA in a denoising experiment. That is, we assume that the data were generated\nfrom sparse components plus some noise and we compare the various sparse PCA on the denoising\ntask, i.e., on the task of recovering the original data. We generated the data as follows: select\nuniformly at random M = 4 unit norm sparse vectors in P = 10 dimensions with known number\nS = 4 of non zero entries, then generate i.i.d. values of the random variables Z from three possible\ndistributions (Gaussian, Laplacian, uniform), then add isotropic noise of relative standard deviation\n1/2. When the latent variables are Gaussian, our model exactly matches the data and our method\nshould provide a better \ufb01t; however, we consider also situations where the model is misspeci\ufb01ed in\norder to study the robustness of our probabilistic model.\nWe consider our two models: SCA-1 (which uses automatic relevance determination type of sparsity\npriors) and SCA-2 (which uses generalised hyperbolic distribution), where we used 6 latent dimen-\nsions (larger than 4) and \ufb01xed hyperparameters that lead to vague priors. Those two models thus\nhave no free parameters and we compare them to the following methods, which all have two regu-\nlarisation parameters (rank and regularisation): DSPCA [6], the method of Zou [20] and the recent\nmethod of [16] which essentially considers a probabilistic PCA with (cid:96)1-penalty on the weights.\nIn Table 1 we report mean-squared reconstruction error averaged over 10 replications. It can be seen\nthat two proposed probabilistic approaches perform similarly and signi\ufb01cantly outperform other\nsparse PCA methods, even when the model is misspeci\ufb01ed.\n\n4.2 Template attacks\n\nPower consumption and electromagnetic radiation are among the most extensively used side-\nchannels for analysing physically observable cryptographic devices. A common belief is that the\nuseful information for attacking a device is hidden at times where the traces (or time series) have\nlarge variance. Once the relevant samples have been identi\ufb01ed they can be used to construct tem-\nplates, which can then be used to assess if a device is secure. A simple, yet very powerful approach,\nrecently proposed by [1], is to select time samples based on PCA. Figure 2(a) shows the weight\n\n5\n\n\fN SCA-1\n39.9\n100\n36.5\n200\n400\n35.5\nN SCA-1\n39.9\n100\n36.8\n200\n400\n36.4\nN SCA-1\n39.3\n100\n36.5\n200\n300\n35.8\n\nSCA-2\n40.8\n36.8\n35.5\nSCA-2\n40.9\n37.0\n36.4\nSCA-2\n40.3\n36.7\n35.8\n\n50.8\n50.4\n42.5\n\nZou DSPCA L1-PCA\n42.2\n40.8\n39.8\nZou DSPCA L1-PCA\n42.6\n40.9\n40.5\nZou DSPCA L1-PCA\n42.7\n40.2\n40.6\n\n49.8\n48.1\n46.8\n\n48.5\n46.2\n41.0\n\n42.9\n41.4\n40.3\n\n43.6\n41.1\n40.7\n\n43.4\n40.8\n40.9\n\nTable 1: Denoising experiment with sparse PCA (we report mean squared errors): (top) Gaussian\ndistributed latent vectors, (middle) latent vectors generated from the uniform distribution, (bottom)\nlatent vectors generated from the Laplace distribution.\n\n(a) Probabilistic PCA.\n\n(b) Sparse probabilistic PCA (SCA-2).\n\nFigure 2: Power traces and \ufb01rst three principal directions.\n\nassociated to each time sample by the \ufb01rst three principal directions found by PCA. The problem\nwith this approach is that all time samples get a non-zero weights. As a result, the user has to de\ufb01ne\na threshold manually in order to decide whether the information leakage at time t is relevant or not.\nFigure 2(b) shows the weight associated to the time samples by SCA-2 when using a Laplace prior\n(i.e. for a/DQ = 1). It can be observed that one gets a much better picture of where the relevant\ninformation is. Clearly, sparse probabilitic PCA can be viewed as being more robust to spurious\nnoise and provides a more reliable and amenable solution.\n\n5 Conclusion\n\nIn this paper we introduced a general probabilistic model for inferring sparse probabilistic projection\nmatrices. Sparsity was enforced by either imposing an ARD-type prior or by means of the a Normal-\nInverse Gamma prior. Although the inverse Gamma is not conjugate to the exponential family, the\nposterior is tractable as it is a special case of the generalised inverse Gaussian [12], which in turn\nis a conjugate prior to this family. Future work will include the validation of the method on a wide\nrange of applications and in particular as a feature extracting tool.\n\nAcknowledgments\n\nWe are grateful to the PASCAL European network of excellence for partially supporting this work.\n\n6\n\n02040608010012014016018020000.10.2Power020406080100120140160180200\u2212101!1020406080100120140160180200\u2212101!2020406080100120140160180200\u2212101!3t02040608010012014016018020000.10.2Power020406080100120140160180200\u2212101!1020406080100120140160180200\u2212101!2020406080100120140160180200\u2212101!3t\fA Automatic thresholding the weights by ARD\n\nIn this section, we provide the updates for achieving automatic thresholding of projection matrix\nentries in a probabilistic setting. We apply Tipping\u2019s sparse Bayesian theory [8], which is closely\nrelated to ARD [14]. More speci\ufb01cally, we assume the prior over the scale variables is uniform over\na log-scale, which is a limiting case of the Gamma distribution.\nLet us view {zn}N\nvariational EM. The variational free energy is given by\n\nn=1 and \u039b as latent variables and optimise the parameters \u03b8 = {\u00b5, \u03a6, \u03a8, \u0393} by\n\nFq(x1, . . . , xN , \u03b8) = \u2212 N(cid:88)\n\nn=1\n\n(cid:104)ln p(xn, zn, \u039b|\u03b8)(cid:105)q \u2212 H[q(z1, . . . , zN , \u039b)].\n\n(24)\n\nIn order to \ufb01nd a tractable solution, we further have to assume that the approximate posterior q has\na factorised form. We can then compute the posterior of the low-dimenstional latent vectors:\n\nq(zn) = N (\u00afzn, \u00afSn),\n\n\u03a8(xn \u2212 \u00b5) and \u00afSn =(cid:0) \u00af\u039b\nD(cid:89)\n\n(cid:62)\n\n\u03a8 \u00af\u039b +(cid:80)\nD(cid:89)\n\ni\u03a8(i, i) \u00af\u03a3i + \u03a6(cid:1)\u22121. And the posterior of\n\n(25)\n\n(cid:62)\nwhere \u00afzn = \u00afSn \u00af\u039b\nthe weights is given by\n\nwhere \u00af\u03bbi = \u00af\u03a3i\u03a8(i, i)(cid:80)\npartially factorised form(cid:81)\n\nq(\u039b) =\n\nq(\u03bbi) =\n\nn(xn(i) \u2212 \u00b5(i))\u00afzn and \u00af\u03a3i =(cid:0)\u0393i + \u03a8(i, i)(cid:80)\n\ni=1\n\ni=1\n\nN (\u00af\u03bbi, \u00af\u03a3i),\n\n(26)\n\nn }(cid:1)\u22121. The\n\nn{\u00afSn + \u00afzn\u00afz(cid:62)\n\ni q(\u03bbi) arises naturally. Note also that the update for the mean weights\nhas the same form as in (20). Finally, the updates for the parameters are found by maximising the\nnegative free energy, which corresponds to performing type II ML for the scaling variables. This\nyields\n\nn=1\n\nN(cid:88)\n\ndiag(cid:8)\u00afzn\u00afz(cid:62)\n\nN(cid:88)\n(xn \u2212 \u00af\u039b\u00afzn), \u03a6\u22121 \u2190 1\nN(cid:88)\n(cid:8)(xnp \u2212 \u00b5p)(cid:62)(xnp \u2212 \u00b5p) \u2212 2(xnp \u2212 \u00b5p)(cid:62) \u00af\u039bp\u00afzn +(cid:10)(\u039bpzn)(cid:62)\u039bpzn\nN\nij + \u00af\u03a3i(j, j) and(cid:10)(\u039bpzn)(cid:62)\u039bpzn\n\n(cid:11) = tr{(\u00afzn\u00afz(cid:62)\n\nn + \u00afSn)(cid:80)\n\n\u03b3ij \u2190 (cid:104)\u03bb2\n\nn + \u00afSn\n\nij(cid:105)\u22121,\n\n(\u00af\u03bbip\n\n(cid:62)\n\u00af\u03bb\nip\n\n(cid:9),\n\nn=1\n\nn=1\n\nip\n\n+ \u00af\u03a3ip)}.\n\n(27)\n\n(cid:11)(cid:9), (28)\n\n\u00b5 \u2190 1\nN\n\np \u2190 1\n\u03c4\u22121\nN Dp\nij(cid:105) = \u00af\u03bb2\n\nwhere (cid:104)\u03bb2\n\nB Generalised inverse Gaussian distribution\n\nThe Generalised inverse Gaussian distribution is in the class of generalised hyperbolic distributions.\nIt is de\ufb01ned as follows [12, 11]:\n\n\u221a\ny \u223c N \u2212(\u03c9, \u03c7, \u03c6) = \u03c7\u2212\u03c9(\n\u03c7\u03c6)\u03c9\n\u221a\n\u03c7\u03c6)\n2K\u03c9(\n\ny\u03c9\u22121e\u2212 1\n\n2 (\u03c7y\u22121+\u03c6y),\n\n(29)\n\nwhere y > 0 and K\u03c9(\u00b7) is the modi\ufb01ed Bessel function of the third kind1 with index \u03c9.\nThe following expectations are useful [12]:\n\n(cid:114) \u03c7\n\n\u03c6\n\nR\u03c9((cid:112)\u03c7\u03c6),\n\n(cid:104)y(cid:105) =\n\n(cid:115)\n\n\u03c6\n\u03c7\n\nR\u2212\u03c9((cid:112)\u03c7\u03c6),\n\n(cid:104)y\u22121(cid:105) =\n\n(cid:104)ln y(cid:105) = ln \u03c9 + d ln K\u03c9(\nd\u03c9\n\n\u221a\n\u03c7\u03c6)\n\n,\n\n(30)\n\nwhere R\u03c9(\u00b7) \u2261 K\u03c9+1(\u00b7)/K\u03c9(\u00b7).\n\n1The modi\ufb01ed Bessel function of the third kind is known under various names. In particular, it is also known\nas the modi\ufb01ed Bessel function of the second kind (cf. E. W. Weisstein: \u201dModi\ufb01ed Bessel Function of the Sec-\nond Kind.\u201d From MathWorld: http://mathworld.wolfram.com/Modi\ufb01edBesselFunctionoftheSecondKind.html).\n\n7\n\n\fInverse Gamma distribution\n\nWhen \u03c6 = 0 and \u03c9 < 0, the generalised inverse Gaussian distribution reduces to the inverse Gamma\ndistribution:\n\nIG(a, b) = ba\n\nIt is straightforward to verify this result by posing a = \u2212\u03c9 and b = \u03c7/2, and noting that\n\na, b > 0.\n\nx ,\n\n\u0393(a) x\u2212a\u22121e\u2212 b\nK\u03c9(y) = \u0393(\u2212\u03c9)2\u2212\u03c9\u22121y\u03c9\n\n(31)\n\n(32)\n\nfor \u03c9 < 0.\n\nlim\ny\u21920\n\nReferences\n[1] C. Archambeau, E. Peeters, F.-X. Standaert, and J.-J. Quisquater. Template attacks in principal subspaces.\nIn L. Goubin and M. Matsui, editors, 8th International Workshop on Cryptographic Hardware and Em-\nbedded Systems(CHES), volume 4249 of Lecture Notes in Computer Science, pages 1\u201314. Springer, 2006.\n[2] F. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical\n\nReport 688, Department of Statistics, University of California, Berkeley, 2005.\n\n[3] O. Barndorff-Nielsen and R. Stelzer. Absolute moments of generalized hyperbolic distributions and\napproximate scaling of normal inverse Gaussian L\u00b4evy processes. Scandinavian Journal of Statistics,\n32(4):617\u2013637, 2005.\n\n[4] P. J. Brown and J. E. Grif\ufb01n. Bayesian adaptive lassos with non-convex penalization. Technical Report\n\nCRiSM 07-02, Department of Statistics, University of Warwick, 2007.\n\n[5] F. Caron and A. Doucet. Sparse bayesian nonparametric regression. In 25th International Conference on\n\nMachine Learning (ICML). ACM, 2008.\n\n[6] A. d\u2019Aspremont, E. L. Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA\n\nusing semide\ufb01nite programming. SIAM Review, 49(3):434\u201348, 2007.\n\n[7] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal\n\nof the American Statistical Association, 96:1348\u20131360, 2001.\n\n[8] A. C. Faul and M. E. Tipping. Analysis of sparse Bayesian learning. In T. G. Dietterich, S. Becker, and\nZ. Ghahramani, editors, Advances in Neural Information Processing Systems 14 (NIPS), pages 383\u2013389.\nThe MIT Press, 2002.\n\n[9] Z. Ghahramani and G. E. Hinton. The EM algorithm for mixtures of factor analyzers. Technical Report\n\nCRG-TR-96-1, Department of Computer Science, University of Toronto, 1996.\n\n[10] D. Hardoon and J. Shawe-Taylor. Sparse canonical correlation analysis. Technical report, PASCAL\n\nEPrints, 2007.\n\n[11] Wenbo Hu. Calibration of multivariate generalized hyperbolic distributions using the EM algorithm, with\napplications in risk management, portfolio optimization and portfolio credit risk. PhD thesis, Florida State\nUniversity, United States of America, 2005.\n\n[12] B. J\u00f8rgensen. Statistical Properties of the Generalized Inverse Gaussian Distribution. Springer-Verlag,\n\n1982.\n\n[13] A. Klami and S. Kaski. Local dependent components.\n\nIn Z. Ghahramani, editor, 24th International\n\nConference on Machine Learning (ICML), pages 425\u2013432. Omnipress, 2007.\n\n[14] D. J. C. MacKay. Bayesian methods for backprop networks.\n\nIn E. Domany, J.L. van Hemmen, and\n\nK. Schulten, editors, Models of Neural Networks, III, pages 211\u2013254. 1994.\n\n[15] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse, and other\n\nvariants. In M. I. Jordan, editor, Learning in Graphical Models, pages 355\u2013368. The MIT press, 1998.\n\n[16] C. D. Sigg and J. M. Buhmann. Expectation-maximization for sparse and non-negative PCA. In 25th\n\nInternational Conference on Machine Learning (ICML). ACM, 2008.\n\n[17] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society\n\nB, 58:267\u2013288, 1996.\n\n[18] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the Royal\n\nStatistical Society B, 61:611\u2013622, 1999.\n\n[19] D. Torres, D. Turnbull, B. K. Sriperumbudur, L. Barrington, and G.Lanckriet. Finding musically mean-\n\ningful words using sparse CCA. In NIPS workshop on Music, Brain and Cognition, 2007.\n\n[20] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Journal of Computational and\n\nGraphical Statistics, 15(2):265\u2013286, 2006.\n\n8\n\n\f", "award": [], "sourceid": 984, "authors": [{"given_name": "C\u00e9dric", "family_name": "Archambeau", "institution": null}, {"given_name": "Francis", "family_name": "Bach", "institution": null}]}