Machine Learning with Squared-Loss Mutual Information
Abstract
:1. Introduction
2. Definition and Estimation of SMI
2.1. Definition of SMI
2.2. Least-Squares Estimation of SMI
2.2.1. SMI Approximation via Direct Density-Ratio Estimation
2.2.2. Convergence Analysis
2.3. Practical Implementation of LSMI
2.3.1. LSMI for Linear-in-Parameter Models
2.3.2. Design of Basis Functions
2.3.3. Model Selection by Cross-Validation
3. SMI-Based Machine Learning
3.1. Independence Testing
3.1.1. Introduction
3.1.2. Independence Testing with SMI
3.2. Supervised Feature Selection
3.2.1. Introduction
3.2.2. Feature Selection with SMI
3.3. Supervised Feature Extraction
3.3.1. Introduction
3.3.2. Sufficient Dimension Reduction with SMI
3.3.3. Gradient-Based Subspace Search
- U is updated to ascend the gradient of with respect to U.
- U is projected onto the feasible region specified by .
3.3.4. Heuristic Subspace Search
3.4. Canonical Dependency Analysis
3.4.1. Introduction
3.4.2. Canonical Dependency Analysis with SMI
3.5. Independent Component Analysis
3.5.1. Introduction
3.5.2. Independent Component Analysis with SMI
3.5.3. Gradient-Based Demixing Matrix Search
3.5.4. Natural Gradient Demixing Matrix Search
3.6. Cross-Domain Object Matching
3.6.1. Introduction
3.6.2. Cross-Domain Object Matching with SMI
3.7. Clustering
3.7.1. Introduction
3.7.2. Clustering with SMI
3.8. Causal Direction Estimation
3.8.1. Introduction
3.8.2. Dependence Minimizing Regression with SMI
3.8.3. Causal Direction Inference by LSIR
- If and , the causal model is chosen.
- If and , the causal model is selected.
- If , perhaps there is no causal relation between X and Y or our modeling assumption is not correct (e.g., an unobserved confounding variable exists).
- If , perhaps our modeling assumption is not correct or it is not possible to identify a causal direction (i.e., X, Y, and E are Gaussian random variables).
- If , we conclude that X causes Y.
- Otherwise, we conclude that Y causes X.
4. Conclusions
Acknowledgements
References
- Shannon, C. A mathematical theory of communication. AT&T Tech. J. 1948, 27, 379–423. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2006. [Google Scholar]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Fraser, A.M.; Swinney, H.L. Independent coordinates for strange attractors from mutual information. Phys. Rev. A 1986, 33, 1134–1140. [Google Scholar] [CrossRef] [PubMed]
- Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
- Darbellay, G.A.; Vajda, I. Estimation of the information by an adaptive partitioning of the observation space. IEEE Trans. Inf. Theory 1999, 45, 1315–1321. [Google Scholar] [CrossRef]
- Wang, Q.; Kulkarmi, S.R.; Verdú, S. Divergence estimation of continuous distributions based on data-dependent partitions. IEEE Trans. Inf. Theory 2005, 51, 3064–3074. [Google Scholar] [CrossRef]
- Silva, J.; Narayanan, S. Universal Consistency of Data-Driven Partitions for Divergence Estimation. In Proceedings of IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 2021–2025.
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
- Khan, S.; Bandyopadhyay, S.; Ganguly, A.; Saigal, S. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 2007, 76, 026209. [Google Scholar] [CrossRef] [PubMed]
- Pérez-Cruz, F. Kullback-Leibler Divergence Estimation of Continuous Distributions. In Proceedings of IEEE International Symposium on Information Theory, Toronto, Canada, 6–11 July 2008; pp. 1666–1670.
- Van Hulle, M.M. Edgeworth approximation of multivariate differential entropy. Neural Comput. 2005, 17, 1903–1910. [Google Scholar] [CrossRef] [PubMed]
- Suzuki, T.; Sugiyama, M.; Sese, J.; Kanamori, T. Approximating Mutual Information by Maximum Likelihood Density Ratio Estimation. In Proceedings of ECML-PKDD2008 Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery 2008 (FSDM2008), 2008; Saeys, Y., Liu, H., Inza, I., Wehenkel, L., de Peer, Y.V., Eds.; Volume 4, JMLR Workshop and Conference Proceedings. pp. 5–20.
- Sugiyama, M.; Suzuki, T.; Nakajima, S.; Kashima, H.; von Bünau, P.; Kawanabe, M. Direct importance estimation for covariate shift adaptation. Ann. I. Stat. Math. 2008, 60, 699–746. [Google Scholar] [CrossRef]
- Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
- Sugiyama, M.; Suzuki, T.; Kanamori, T. Density Ratio Estimation in Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
- Sugiyama, M.; Suzuki, T.; Kanamori, T. Density ratio matching under the bregman divergence: A unified framework of density ratio estimation. Ann. I. Stat. Math. 2012, 64, 1009–1044. [Google Scholar] [CrossRef]
- Suzuki, T.; Sugiyama, M.; Kanamori, T.; Sese, J. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinf. 2009, 10, S52:1–S52:12. [Google Scholar] [CrossRef] [PubMed]
- Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philos. Mag. Series 5 1900, 50, 157–175. [Google Scholar] [CrossRef]
- Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Series B 1966, 28, 131–142. [Google Scholar]
- Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
- Kanamori, T.; Hido, S.; Sugiyama, M. A least-squares approach to direct importance estimation. J. Mach. Learn. Res. 2009, 10, 1391–1445. [Google Scholar]
- Kanamori, T.; Suzuki, T.; Sugiyama, M. Statistical Analysis of kernel-based least-squares density-ratio estimation. Mach. Learn. 2012, 86, 335–367. [Google Scholar] [CrossRef]
- Kanamori, T.; Suzuki, T.; Sugiyama, M. Computational complexity of kernel-based density-ratio estimation: A condition number analysis. 2009; arXiv:0912.2800. [Google Scholar]
- Sugiyama, M.; Suzuki, T. Least-squares independence test. IEICE T. Inf. Syst. 2011, E94-D, 1333–1336. [Google Scholar] [CrossRef]
- Jitkrittum, W.; Hachiya, H.; Sugiyama, M. Feature Selection via ℓ1-Penalized Squared-Loss Mutual Information. Technical Report 1210.1960, arXiv. 2012. [Google Scholar]
- Suzuki, T.; Sugiyama, M. Sufficient dimension reduction via squared-loss mutual information estimation. Available online: sugiyama-www.cs.titech.ac.jp/.../AISTATS2010b.pdf (accessed on 26 December 2012).
- Yamada, M.; Niu, G.; Takagi, J.; Sugiyama, M. Computationally Efficient Sufficient Dimension Reduction via Squared-Loss Mutual Information. In Proceedings of the Third Asian Conference on Machine Learning (ACML2011); Hsu, C.N., Lee, W.S., Eds.; 2011; Volume 20, JMLR Workshop and Conference Proceedings . pp. 247–262. [Google Scholar]
- Karasuyama, M.; Sugiyama. Canonical dependency analysis based on squared-loss mutual information. Neural Netw. 2012, 34, 46–55. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Suzuki, T.; Sugiyama, M. Least-squares independent component analysis. Neural Comput. 2011, 23, 284–301. [Google Scholar] [CrossRef] [PubMed]
- Yamada, M.; Sugiyama, M. Cross-Domain Object Matching with Model Selection. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS2011), 2011; Gordon, G., Dunson, D., Dudík, M., Eds.; Volume 15, JMLR Workshop and Conference Proceedings . pp. 807–815.
- Sugiyama, M.; Yamada, M.; Kimura, M.; Hachiya, H. On Information-Maximization Clustering: Tuning Parameter Selection and Analytic Solution. In Proceedings of 28th International Conference on Machine Learning (ICML2011), 2011; Getoor, L., Scheffer, T., Eds.; pp. 65–72.
- Kimura, M.; Sugiyama, M. Dependence-maximization clustering with least-squares mutual information. J. Adv. Comput. Intell. Intell. Inf. 2011, 15, 800–805. [Google Scholar]
- Yamada, M.; Sugiyama, M. Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI2010); The AAAI Press: Atlanta, Georgia, USA, 2010; pp. 643–648. [Google Scholar]
- Van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes with Applications to Statistics; Springer: New York, NY, USA, 1996. [Google Scholar]
- Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, MA, USA, 2000. [Google Scholar]
- Aronszajn, N. Theory of reproducing kernels. T. Am. Math. Soc. 1950, 68, 337–404. [Google Scholar] [CrossRef]
- Least-Squares Mutual Information (LSMI). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSMI/ (accessed on 7 December 2012).
- Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R. Least angle regression. Ann. Stat. 2004, 32, 407–499. [Google Scholar]
- Hastie, T.; Rosset, S.; Tibshirani, R.; Zhu, J. The entire regularization path for the support vector machine. J. Mach. Learn. Res. 2004, 5, 1391–1415. [Google Scholar]
- Gärtner, T. A survey of kernels for structured data. SIGKDD Explor. 2003, 5, S268–S275. [Google Scholar] [CrossRef]
- Sarwar, B.; Karypis, G.; Konstan, J.; Reidl, J. Item-Based Collaborative Filtering Recommendation Algorithms. In Proceedings of the 10th International Conference on World Wide Web (WWW2001), Hong Kong, China, 1–5 May 2001; pp. 285–295.
- Gretton, A.; Fukumizu, K.; Teo, C.H.; Song, L.; Schölkopf, B.; Smola, A. A Kernel Statistical Test of Independence. In Advances in Neural Information Processing Systems 20; Platt, J.C., Koller, D., Singer, Y., Roweis, S., Eds.; MIT Press: Cambridge, MA, USA, 2008; pp. 585–592. [Google Scholar]
- Steinwart, I. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2001, 2, 67–93. [Google Scholar]
- Schölkopf, B.; Smola, A.J. Learning with Kernels; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; Chapman & Hall/CRC: New York, NY, USA, 1993. [Google Scholar]
- Least-Squares Independence Test (LSIT). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSIT/ (accessed on 7 December 2012).
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Tibshirani, R. Regression shrinkage and subset selection with the lasso. J. R. Stat. Soc. Series B 1996, 58, 267–288. [Google Scholar]
- Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
- Tomioka, R.; Suzuki, T.; Sugiyama, M. Super-linear convergence of dual augmented Lagrangian algorithm for sparsity regularized estimation. J. Mach. Learn. Res. 2011, 12, 1537–1586. [Google Scholar]
- ℓ1-Ball. Available online: http://wittawat.com/software/l1lsmi/ (accessed on 7 December).
- McCallum, A.; Roweis, S. Efficient Projections onto the ℓ1-Ball for Learning in High Dimensions. In Proceedings of the 25th Annual International Conference on Machine Learning (ICML2008), Helsinki, Finland, 5–9 July 2008; pp. 272–279.
- Cook, R.D. Regression Graphics: Ideas for Studying Regressions through Graphics; Wiley: New York, NY, USA, 1998. [Google Scholar]
- Li, K. Sliced inverse regression for dimension reduction. J. Am. Stat. Assoc. 1991, 86, 316–342. [Google Scholar] [CrossRef]
- Li, K. On principal hessian directions for data visualization and dimension reduction: another application of Stein’s lemma. J. Am. Stat. Assoc. 1992, 87, 1025–1039. [Google Scholar] [CrossRef]
- Cook, R.D. SAVE: A method for dimension reduction and graphics in regression. Commun. Stat. Theory 2000, 29, 2109–2121. [Google Scholar] [CrossRef]
- Fukumizu, K.; Bach, F.R.; Jordan, M.I. Kernel dimension reduction in regression. Ann. Stat. 2009, 37, 1871–1905. [Google Scholar] [CrossRef]
- Golub, G.H.; Loan, C.F.V. Matrix Computations, 2nd ed.; Johns Hopkins University Press: Baltimore, MD, USA, 1989. [Google Scholar]
- Nishimori, Y.; Akaho, S. Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing 2005, 67, 106–135. [Google Scholar] [CrossRef]
- Amari, S. Natural gradient works efficiently in learning. Neural Comput. 1998, 10, 251–276. [Google Scholar] [CrossRef]
- Edelman, A.; Arias, T.A.; Smith, S.T. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix. Anal. A. 1998, 20, 303–353. [Google Scholar] [CrossRef]
- Patriksson, M. Nonlinear Programming and Variational Inequality Problems; Kluwer Academic: Dordrecht, The Netherlands, 1999. [Google Scholar]
- Least-Squares Dimensionality Reduction (LSDR). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/LSDR/ (accessed on 7 December 2012).
- Epanechnikov, V. Nonparametric estimates of a multivariate probability density. Theor. Probab. Appl. 1969, 14, 153–158. [Google Scholar] [CrossRef]
- Sufficient Component Analysis (SCA). Available online: http://sugiyama-www.cs.titech.ac.jp/~yamada/sca.html (accessed on 7 December 2012).
- Hotelling, H. Relations between two sets of variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
- Becker, S.; Hinton, G.E. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 1992, 355, 161–163. [Google Scholar] [CrossRef] [PubMed]
- Fyfe, C.; Lai, P.L. Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst. 2000, 10, 365–377. [Google Scholar]
- Akaho, S. A Kernel Method For Canonical Correlation Analysis. In Proceedings of the International Meeting of the Psychometric Society, Osaka, Japan, 15–19 July 2001.
- Gestel, T.V.; Suykens, J.; Brabanter, J.D.; Moor, B.D.; Vandewalle, J. Kernel Canonical Correlation Analysis and Least Squares Support Vector Machines. In Proceedings of the International Conference on Artificial Neural Networks; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2130, Lecture Notes in Computer Science . pp. 384–389. [Google Scholar]
- Breiman, L.; Friedman, J.H. Estimating optimal transformations for multiple regression and correlation. J. Am. Stat. Assoc. 1985, 80, 580–598. [Google Scholar] [CrossRef]
- Bach, F.; Jordan, M.I. Kernel independent component analysis. J. Mach. Learn. Res. 2002, 3, 1–48. [Google Scholar]
- Yin, X. Canonical correlation analysis based on information theory. J. Multivariate Anal. 2004, 91, 161–176. [Google Scholar] [CrossRef]
- Härdle, W.; Müller, M.; Sperlich, S.; Werwatz, A. Nonparametric and Semiparametric Models; Springer: Berlin, Germany, 2004. [Google Scholar]
- Least-Squares Canonical Dependency Analysis (LSCDA). Available online: http://www.bic.kyoto-u.ac.jp/pathway/krsym/software/LSCDA/index.html (accessed on 7 December 2012).
- Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis; Wiley: New York, NY, USA, 2001. [Google Scholar]
- Amari, S.; Cichocki, A.; Yang, H.H. A New Learning Algorithm for Blind Signal Separation. Advances in Neural Information Processing Systems 8; Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., Eds.; The MIT Press: Cambridge, MA, USA, 1996; pp. 757–763. [Google Scholar]
- Van Hulle, M.M. Sequential fixed-point ICA based on mutual information minimization. Neural Comput. 2008, 20, 1344–1365. [Google Scholar] [CrossRef] [PubMed]
- Jutten, C.; Herault, J. Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Process. 1991, 24, 1–10. [Google Scholar] [CrossRef]
- Hyvärinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE T. Neural Networ. 1999, 10, 626. [Google Scholar] [CrossRef] [PubMed]
- Least-squares Independent Component Analysis. Available online: http://www.simplex.t.u-tokyo.ac.jp/~s-taiji/software/LICA/index.html (accessed on 7 December 2012).
- Jebara, T. Kernelized Sorting, Permutation and Alignment for Minimum Volume PCA. In Proceedings of the 17th Annual Conference on Learning Theory (COLT2004), Banff, Canada, 1–4 July 2004; pp. 609–623.
- Gretton, A.; Bousquet, O.; Smola, A.; Schölkopf, B. Measuring Statistical Dependence with Hilbert-Schmidt Norms. In Algorithmic Learning Theory; Jain, S., Simon, H.U., Tomita, E., Eds.; Springer-Verlag: Berlin, Germany, 2005; Lecture Notes in Artificial Intelligence; pp. 63–77. [Google Scholar]
- Quadrianto, N.; Smola, A.J.; Song, L.; Tuytelaars, T. Kernelized sorting. IEEE Trans. Patt. Anal. 2010, 32, 1809–1821. [Google Scholar] [CrossRef] [PubMed]
- Jagarlamudi, J.; Juarez, S.; Daumé, H., III. Kernelized Sorting for Natural Language Processing. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI2010), Atlanta, Georgia, USA, 11–15 July 2010; pp. 1020–1025.
- Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Least-Squares Object Matching (LSOM). Available online: http://sugiyama-www.cs.titech.ac.jp/~yamada/lsom.html (accessed on 7 December 2012).
- MacQueen, J.B. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability; University of California Press: Berkeley, CA, USA, 1967; Vol. 1, pp. 281–297. [Google Scholar]
- Girolami, M. Mercer kernel-based clustering in feature space. IEEE Trans. Neural Networ. 2002, 13, 780–784. [Google Scholar] [CrossRef] [PubMed]
- Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. 2000, 22, 888–905. [Google Scholar]
- Ng, A.Y.; Jordan, M.I.; Weiss, Y. On Spectral Clustering: Analysis and An Algorithm. Advances in Neural Information Processing Systems 14; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; pp. 849–856. [Google Scholar]
- Fukunaga, K.; Hostetler, L.D. The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Inf. Theory 1975, 21, 32–40. [Google Scholar] [CrossRef]
- Carreira-Perpiñán, M.A. Fast Nonparametric Clustering with Gaussian Blurring Mean-Shift. In Proceedings of 23rd International Conference on Machine Learning (ICML2006), Pittsburgh, Pennsylvania, USA, 25–29 June 2006; Cohen, W., Moore, A., Eds.; pp. 153–160.
- Xu, L.; Neufeld, J.; Larson, B.; Schuurmans, D. Maximum Margin Clustering. Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 1537–1544. [Google Scholar]
- Bach, F.; Harchaoui, Z. DIFFRAC: A Discriminative and Flexible Framework for Clustering. In Advances in Neural Information Processing Systems 20; Platt, J.C., Koller, D., Singer, Y., Roweis, S., Eds.; MIT Press: Cambridge, MA, USA, 2008; pp. 49–56. [Google Scholar]
- Song, L.; Smola, A.; Gretton, A.; Borgwardt, K. A Dependence Maximization View of Clustering. In Proceedings of the 24th Annual International Conference on Machine Learning (ICML2007), Corvallis, Oregon, USA, 20–24 June 2007; Ghahramani, Z., Ed.; pp. 815–822.
- Faivishevsky, L.; Goldberger, J. A Nonparametric Information Theoretic Clustering Algorithm. In Proceedings of 27th International Conference on Machine Learning (ICML2010), Haifa, Israel, 21–24 June 2010; Joachims, A.T., Fürnkranz, J., Eds.; pp. 351–358.
- Agakov, F.; Barber, D. Kernelized Infomax Clustering. In Advances in Neural Information Processing Systems 18; Weiss, Y., Schölkopf, B., Platt, J., Eds.; MIT Press: Cambridge, MA, USA, 2006; pp. 17–24. [Google Scholar]
- Gomes, R.; Krause, A.; Perona, P. Discriminative Clustering by Regularized Information Maximization. In Advances in Neural Information Processing Systems 23; Lafferty, J., Williams, C.K.I., Zemel, R., Shawe-Taylor, J., Culotta, A., Eds.; 2010; pp. 766–774. [Google Scholar]
- Zelnik-Manor, L.; Perona, P. Self-Tuning Spectral Clustering. In Advances in Neural Information Processing Systems 17; Saul, L.K., Weiss, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 1601–1608. [Google Scholar]
- SMI-based Clustering (SMIC). Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/SMIC/ (accessed on 7 December 2012).
- Horn, R.A.; Johnson, C.A. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
- Pearl, J. Causality: Models, Reasoning and Inference; Cambridge University Press: New York, NY, USA, 2000. [Google Scholar]
- Geiger, D.; Heckerman, D. Learning Gaussian Networks. In Proceedings of the 10th Annual Conference on Uncertainty in Artificial Intelligence (UAI1994), Seattle, Washington, USA, 29–31 July 1994; pp. 235–243.
- Shimizu, S.; Hoyer, P.O.; Hyvärinen, A.; Kerminen, A.J. A linear non-gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 2006, 7, 2003–2030. [Google Scholar]
- Hoyer, P.O.; Janzing, D.; Mooij, J.M.; Peters, J.; Schölkopf, B. Nonlinear Causal Discovery with Additive Noise Models. In Advances in Neural Information Processing Systems 21; Koller, D., Schuurmans, D., Bengio, Y., Bottou, L., Eds.; MIT Press: Cambridge, MA, USA, 2009; pp. 689–696. [Google Scholar]
- Mooij, J.; Janzing, D.; Peters, J.; Schölkopf, B. Regression by Dependence Minimization and Its Application to Causal Inference in Additive Noise Models. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML2009), Montreal, Canada Jun, 14–18, 2009; pp. 745–752.
- Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
- Least-Squares Independence Regression (LSIR). Availble online: http://sugiyama-www.cs.titech.ac.jp/~yamada/lsir.html (accessed on 7 December 2012).
- Sugiyama, M.; Kawanabe, M. Machine Learning in Non-Stationary Environments: Introduction to Covariate Shift Adaptation; MIT Press: Cambridge, Massachusetts, USA, 2012. [Google Scholar]
- Hido, S.; Tsuboi, Y.; Kashima, H.; Sugiyama, M.; Kanamori, T. Statistical outlier detection using direct density ratio estimation. Knowl. Inf. Syst. 2011, 26, 309–336. [Google Scholar] [CrossRef]
- Kawahara, Y.; Sugiyama, M. Sequential change-point detection based on direct density-ratio estimation. Stat. Anal. Data Min. 2012, 5, 114–127. [Google Scholar] [CrossRef]
- Liu, S.; Yamada, M.; Collier, N.; Sugiyama, M. Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation. In Structural, Syntactic, and Statistical Pattern Recognition; Gimel’farb, G., Hancock, E., Imiya, A., Kuijper, A., Kudo, M., Omachi, S., Windeatt, T., Yamada, K., Eds.; Springer: Berlin, Germany, 2012; Volume 7626, Lecture Notes in Computer Science ; pp. 363–372. [Google Scholar]
- Langford, J.; Pineau, J. Semi-Supervised Learning of Class Balance under Class-Prior Change by Distribution Matching. In Proceedings of 29th International Conference on Machine Learning (ICML2012), Edinburgh, Scotland, 26 June–1 July 2012; pp. 823–830.
- Sugiyama, M.; Suzuki, T.; Itoh, Y.; Kanamori, T.; Kimura, M. Least-squares two-sample test. Neural Netw. 2011, 24, 735–751. [Google Scholar] [CrossRef] [PubMed]
- Kanamori, T.; Suzuki, T.; Sugiyama, M. f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Trans. Inf. Theory 2012, 58, 708–720. [Google Scholar] [CrossRef]
- Sugiyama, M. Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Trans. Inf. Syst. 2010, E93-D, 2690–2701. [Google Scholar] [CrossRef]
- Sugiyama, M.; Hachiya, H.; Yamada, M.; Simm, J.; Nam, H. Least-Squares Probabilistic Classifier: A Computationally Efficient Alternative to Kernel Logistic Regression. In Proceedings of International Workshop on Statistical Machine Learning for Speech Processing (IWSML2012), Kyoto, Japan, Mar. 31. 2012; pp. 1–10.
- Sugiyama, M.; Takeuchi, I.; Suzuki, T.; Kanamori, T.; Hachiya, H.; Okanohara, D. Least-squares conditional density estimation. IEICE Trans. Inf. Syst. 2010, E93-D, 583–594. [Google Scholar] [CrossRef]
- Sugiyama, M.; Kawanabe, M.; Chui, P.L. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Netw. 2010, 23, 44–59. [Google Scholar] [CrossRef] [PubMed]
- Sugiyama, M.; Yamada, M.; von Bünau, P.; Suzuki, T.; Kanamori, T.; Kawanabe, M. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Netw. 2011, 24, 183–198. [Google Scholar] [CrossRef] [PubMed]
- Yamada, M.; Sugiyama, M. Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis. In Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence (AAAI2011); The AAAI Press: San Francisco, California, USA, 2011; pp. 549–554. [Google Scholar]
- Yamada, M.; Suzuki, T.; Kanamori, T.; Hachiya, H.; Sugiyama, M. Relative Density-Ratio Estimation for Robust Distribution Comparison. In Advances in Neural Information Processing Systems 24; Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., Weinberger, K.Q., Eds.; 2011; pp. 594–602. [Google Scholar]
- Sugiyama, M.; Suzuki, T.; Kanamori, T.; Du Plessis, M.C.; Liu, S.; Takeuchi, I. Density-Difference Estimation. Advances in Neural Information Processing Systems 25, 2012. [Google Scholar]
- Software. Available online: http://sugiyama-www.cs.titech.ac.jp/~sugi/software/ (accessed on 7 December 2012).
© 2013 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Share and Cite
Sugiyama, M. Machine Learning with Squared-Loss Mutual Information. Entropy 2013, 15, 80-112. https://doi.org/10.3390/e15010080
Sugiyama M. Machine Learning with Squared-Loss Mutual Information. Entropy. 2013; 15(1):80-112. https://doi.org/10.3390/e15010080
Chicago/Turabian StyleSugiyama, Masashi. 2013. "Machine Learning with Squared-Loss Mutual Information" Entropy 15, no. 1: 80-112. https://doi.org/10.3390/e15010080
APA StyleSugiyama, M. (2013). Machine Learning with Squared-Loss Mutual Information. Entropy, 15(1), 80-112. https://doi.org/10.3390/e15010080