Bregman divergence

In mathematics, specifically statistics and information geometry, a Bregman divergence or Bregman distance is a measure of difference between two points, defined in terms of a strictly convex function; they form an important class of divergences. When the points are interpreted as probability distributions – notably as either values of the parameter of a parametric model or as a data set of observed values – the resulting distance is a statistical distance. The most basic Bregman divergence is the squared Euclidean distance.

Bregman divergences are similar to metrics, but satisfy neither the triangle inequality (ever) nor symmetry (in general). However, they satisfy a generalization of the Pythagorean theorem, and in information geometry the corresponding statistical manifold is interpreted as a (dually) flat manifold. This allows many techniques of optimization theory to be generalized to Bregman divergences, geometrically as generalizations of least squares.

Bregman divergences are named after Lev M. Bregman, who introduced the concept in 1967.

Definition

Let $F:\Omega \to \mathbb {R}$ be a continuously-differentiable, strictly convex function defined on a closed convex set $\Omega$ .

The Bregman distance associated with F for points $p,q\in \Omega$ is the difference between the value of F at point p and the value of the first-order Taylor expansion of F around point q evaluated at point p:

D_{F}(p,q)=F(p)-F(q)-\langle \nabla F(q),p-q\rangle .

Properties

Non-negativity: $D_{F}(p,q)\geq 0$ for all p, q. This is a consequence of the convexity of F.
Convexity: $D_{F}(p,q)$ is convex in its first argument, but not necessarily in the second argument (see ^[1])
Linearity: If we think of the Bregman distance as an operator on the function F, then it is linear with respect to non-negative coefficients. In other words, for $F_{1},F_{2}$ strictly convex and differentiable, and $\lambda \geq 0$ ,

D_{F_{1}+\lambda F_{2}}(p,q)=D_{F_{1}}(p,q)+\lambda D_{F_{2}}(p,q)

Duality: The function F has a convex conjugate $F^{*}$ . The Bregman distance defined with respect to $F^{*}$ has an interesting relationship to $D_{F}(p,q)$

D_{F^{*}}(p^{*},q^{*})=D_{F}(q,p)

Here,

p^{*}=\nabla F(p)

and

q^{*}=\nabla F(q)

are the dual points corresponding to p and q.

Mean as minimizer: A key result about Bregman divergences is that, given a random vector, the mean vector minimizes the expected Bregman divergence from the random vector. This result generalizes the textbook result that the mean of a set minimizes total squared error to elements in the set. This result was proved for the vector case by (Banerjee et al. 2005), and extended to the case of functions/distributions by (Frigyik et al. 2008). This result is important because it further justifies using a mean as a representative of a random set, particularly in Bayesian estimation.

Law of cosines:^[2]

For any $p,q,z$

D_{F}(p,q)=D_{F}(p,z)+D_{F}(z,q)-(p-z)^{T}(\nabla F(q)-\nabla F(z))

Generalized Pythagorean Theorem:^[3]

Consider the "Bregman projection" of $q$ onto a convex set $\Omega$ : $P_{\Omega }(q)={\text{argmin}}_{\omega \in \Omega }D_{F}(\omega ,q)$ . The Bregman divergence is an obtuse triangle in the sense

D_{F}(p,q)\geq D_{F}(p,P_{\Omega }(q))+D_{F}(P_{\Omega }(q),q).

Examples

Squared Euclidean distance $D_{F}(x,y)=\|x-y\|^{2}$ is the canonical example of a Bregman distance, generated by the convex function $F(x)=\|x\|^{2}$
The squared Mahalanobis distance, $D_{F}(x,y)={\tfrac {1}{2}}(x-y)^{T}Q(x-y)$ which is generated by the convex function $F(x)={\tfrac {1}{2}}x^{T}Qx$ . This can be thought of as a generalization of the above squared Euclidean distance.
The generalized Kullback–Leibler divergence

D_{F}(p,q)=\sum _{i}p(i)\log {\frac {p(i)}{q(i)}}-\sum p(i)+\sum q(i)

is generated by the negative entropy function

F(p)=\sum _{i}p(i)\log p(i)

The Itakura–Saito distance,

D_{F}(p,q)=\sum _{i}\left({\frac {p(i)}{q(i)}}-\log {\frac {p(i)}{q(i)}}-1\right)

is generated by the convex function

F(p)=-\sum _{i}\log p(i)

Generalizing projective duality

A key tool in computational geometry is the idea of projective duality, which maps points to hyperplanes and vice versa, while preserving incidence and above-below relationships. There are numerous analytical forms of the projective dual: one common form maps the point $p=(p_{1},\ldots p_{d})$ to the hyperplane $x_{d+1}=\sum _{1}^{d}2p_{i}x_{i}$ . This mapping can be interpreted (identifying the hyperplane with its normal) as the convex conjugate mapping that takes the point p to its dual point $p^{*}=\nabla F(p)$ , where F defines the d-dimensional paraboloid $x_{d+1}=\sum x_{i}^{2}$ .

If we now replace the paraboloid by an arbitrary convex function, we obtain a different dual mapping that retains the incidence and above-below properties of the standard projective dual. This implies that natural dual concepts in computational geometry like Voronoi diagrams and Delaunay triangulations retain their meaning in distance spaces defined by an arbitrary Bregman divergence. Thus, algorithms from "normal" geometry extend directly to these spaces (Boissonnat, Nielsen and Nock, 2010)

Generalization of Bregman divergences

Bregman divergences can be interpreted as limit cases of skewed Jensen divergences (see Nielsen and Boltz, 2011). Jensen divergences can be generalized using comparative convexity, and limit cases of these skewed Jensen divergences generalizations yields generalized Bregman divergence (see Nielsen and Nock, 2017). The Bregman chord divergence^[4] is obtained by taking a chord instead of a tangent line.

Bregman divergence on other objects

Bregman divergences can also be defined between matrices, between functions, and between measures (distributions). Bregman divergences between matrices include the Stein's loss and von Neumann entropy. Bregman divergences between functions include total squared error, relative entropy, and squared bias; see the references by Frigyik et al. below for definitions and properties. Similarly Bregman divergences have also been defined over sets, through a submodular set function which is known as the discrete analog of a convex function. The submodular Bregman divergences subsume a number of discrete distance measures, like the Hamming distance, precision and recall, mutual information and some other set based distance measures (see Iyer & Bilmes, 2012) for more details and properties of the submodular Bregman.)

For a list of common matrix Bregman divergences, see Table 15.1 in.^[5]

Applications

In machine learning, Bregman divergences are used to calculate the bi-tempered logistic loss, performing better than the softmax function with noisy datasets.^[6]

Bregman divergence is used in the formulation of mirror descent, which includes optimization algorithms used in machine learning such as gradient descent and the hedge algorithm.

References

^ "Joint and separate convexity of the Bregman Distance", by H. Bauschke and J. Borwein, in D. Butnariu, Y. Censor, and S. Reich, editors, Inherently Parallel Algorithms in Feasibility and Optimization and their Applications, Elsevier 2001
^ https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf
^ https://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf
^ Nielsen, Frank; Nock, Richard (2019). "The Bregman Chord Divergence". Geometric Science of Information. Lecture Notes in Computer Science. Vol. 11712. pp. 299–308. arXiv:1810.09113. doi:10.1007/978-3-030-26980-7_31. ISBN 978-3-030-26979-1. S2CID 53046425.
^ "Matrix Information Geometry", R. Nock, B. Magdalou, E. Briys and F. Nielsen, pdf, from this book
^ Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996. pdf

Banerjee, Arindam; Merugu, Srujana; Dhillon, Inderjit S.; Ghosh, Joydeep (2005). "Clustering with Bregman divergences". Journal of Machine Learning Research. 6: 1705–1749.
Bregman, L. M. (1967). "The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming". USSR Computational Mathematics and Mathematical Physics. 7 (3): 200–217. doi:10.1016/0041-5553(67)90040-7.
Frigyik, Bela A.; Srivastava, Santosh; Gupta, Maya R. (2008). "Functional Bregman Divergences and Bayesian Estimation of Distributions" (PDF). IEEE Transactions on Information Theory. 54 (11): 5130–5139. arXiv:cs/0611123. doi:10.1109/TIT.2008.929943. S2CID 1254. Archived from the original (PDF) on 2010-08-12.
Iyer, Rishabh.; Bilmes, Jeff (2012). "Submodular-Bregman divergences and Lovász-Bregman divergences with Applications". Conference on Neural Information Processing Systems.
Frigyik, Bela A.; Srivastava, Santosh; Gupta, Maya R. (2008). An Introduction to Functional Derivatives (PDF). UWEE Tech Report 2008-0001. University of Washington, Dept. of Electrical Engineering. Archived from the original (PDF) on 2017-02-17. Retrieved 2014-03-20.
Harremoës, Peter (2017). "Divergence and Sufficiency for Convex Optimization". Entropy. 19 (5): 206. arXiv:1701.01010. Bibcode:2017Entrp..19..206H. doi:10.3390/e19050206.
Nielsen, Frank; Nock, Richard (2009). "The dual Voronoi diagrams with respect to representational Bregman divergences" (PDF). Proc. 6th International Symposium on Voronoi Diagrams. IEEE. doi:10.1109/ISVD.2009.15.
Nielsen, Frank; Nock, Richard (2007). "On the Centroids of Symmetrized Bregman Divergences". arXiv:0711.3242 [cs.CG].
Nielsen, Frank; Boissonnat, Jean-Daniel; Nock, Richard (2007). "On Visualizing Bregman Voronoi diagrams". Proc. 23rd ACM Symposium on Computational Geometry (video track). doi:10.1145/1247069.1247089.^{[permanent dead link]}
Boissonnat, Jean-Daniel; Nielsen, Frank; Nock, Richard (2010). "Bregman Voronoi Diagrams". Discrete and Computational Geometry. 44 (2): 281–307. arXiv:0709.2196. doi:10.1007/s00454-010-9256-1. S2CID 1327029.
Nielsen, Frank; Nock, Richard (2006). "On approximating the smallest enclosing Bregman Balls". Proc. 22nd ACM Symposium on Computational Geometry. pp. 485–486. doi:10.1145/1137856.1137931.
Nielsen, Frank; Boltz, Sylvain (2011). "The Burbea-Rao and Bhattacharyya centroids". IEEE Transactions on Information Theory. 57 (8): 5455–5466. arXiv:1004.5049. doi:10.1109/TIT.2011.2159046. S2CID 14238708.
Nielsen, Frank; Nock, Richard (2017). "Generalizing Skew Jensen Divergences and Bregman Divergences With Comparative Convexity". IEEE Signal Processing Letters. 24 (8): 1123–1127. arXiv:1702.04877. Bibcode:2017ISPL...24.1123N. doi:10.1109/LSP.2017.2712195. S2CID 31899023.

[1] "Joint and separate convexity of the Bregman Distance", by H. Bauschke and J. Borwein, in D. Butnariu, Y. Censor, and S. Reich, editors, Inherently Parallel Algorithms in Feasibility and Optimization and their Applications, Elsevier 2001

[2] ttps://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf

[3] ttps://www.cs.utexas.edu/users/inderjit/Talks/bregtut.pdf

[4] Nielsen, Frank; Nock, Richard (2019). "The Bregman Chord Divergence". Geometric Science of Information. Lecture Notes in Computer Science. Vol. 11712. pp. 299–308. arXiv:1810.09113. doi:10.1007/978-3-030-26980-7_31. ISBN 978-3-030-26979-1. S2CID 53046425.

[5] "Matrix Information Geometry", R. Nock, B. Magdalou, E. Briys and F. Nielsen, pdf, from this book

[6] Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren (2019). "Robust Bi-Tempered Logistic Loss Based on Bregman Divergences". Conference on Neural Information Processing Systems. pp. 14987-14996. pdf

[1]

[2]

[3]

[4]

[5]

[6]