Generalized effective numbers and entropies

This text defines generalized entropies and some of their properties. In the following we define P(I) to be the space of probability distributions on the set I. For simplicity, we take I to be finite and its size written |I|. Given a probability distribution p, it’s value at the point i will be denoted by p_i.

1 Size estimates and averaging

Given a known probability distribution p in I, given a random sample i, we can estimate the size of I by \frac{1}{p_1}. In the case of an uniform distribution, this estimate is exact and equal to |I|. Now, given a sequence of iid samples i_1,...,i_n, we can obtain a better estimate of the size through a generalized f-mean:

f^{-1}\left[\frac{1}{n}\sum_{k=1}^n f\left( \frac{1}{p_{i_k}} \right) \right]

which, because of the law of large numbers, converges to:

f^{-1}\left[\sum_i p_i f\left( \frac{1}{p_i} \right) \right]

This is well defined for f that are injective and continuous. The application of f^{-1} in both cases is justified because they are convex combinations of elements in the image of f.

This will be the starting point of our explanation.

2 Effective numbers

We start by defining the f-effective number, which will be our main concept:

Definition: Generalized Effective Number

Let f be a continuous, injective real function. Then the f-effective number of the probability distribution p \in P(I) is defined as:

N_f(p) = f^{-1}\left[ S_f(p)) \right] = f^{-1}\left[\sum_i p_i f\left( \frac{1}{p_i} \right) \right]

For any such f, N_f acts the same over the uniform distribution, where every p_i = 1/|I|:

\begin{align*} N_f &= f^{-1}\left[\sum_i \frac{1}{|I|} f\left( |I| \right) \right] \\ &= f^{-1}\left[f\left( |I| \right) \right] \\ &= |I| \end{align*}

Different can generate the same effective number. For any two constants a and b, if g(x) = a f(x) + b then N_g = N_f.

3 Product decomposition

If p and q are independent probability distributions in I and J, then pq defined by pq_{ij} = p_i q_j is the resulting probability distribution in I \times J.

For the uniform distribution, we have N_f(pq) = |I||J| = N_f(p) N_f(q). However this is not true for arbitrary f and p,q. This is true however for a family of functions. Besides the example of f = \ln, which implies the classical Gibbs-Boltzmann effective number:

N_f(p) = \exp \left[ -\sum_i p_i \ln p_i \right]

for functions f(x) = x^{1-\alpha} this also holds:

N_f(p) = \left[ \sum_i p_i^{\alpha} \right]^{\frac{1}{1-\alpha}}

From now on, we will focus on these families of f.

4 Logarithms and exponentials

As mentioned before, many functions can define the same effective number, and the same is the fase for the power functions. In order to choose canonical members from these families of power functions, we define the \alpha-logarithms and \alpha-exponentials:

Definition: Generalized Logarithm

Let \alpha be a real number. Then the \alpha-logarithm is defined as:

l_\alpha(x) = \frac{x^{1-\alpha}-1}{1 - \alpha}

A first important property of this logarithm is that:

\lim_{\alpha \to 1} l_\alpha(x) = \ln(x)

meaning that by defining l_1 = \ln we can unify all of these functions in a single family.

In general, these logarithms behave differently from the standard logarithm. First, they do not transform products into sums:

l_q(xy) = l_q(x) + l_q(y) + (1-q) \, l_q(x) \, l_q(y)

and their relation to inverses is different, creating a connection between \alpha and 2 - \alpha:

l_\alpha\left(\frac{1}{x}\right) = - l_{2 - \alpha}(x)

Their derivatives are also power function, but with different exponents:

\frac{d l_\alpha}{d x}(x) = x^{-\alpha}

For all of these logarithms, l_\alpha(1) = 0 and l'_\alpha(1) = 1. In the domain [1,\infty), for \alpha \leq 1, the image of l_\alpha is [0,\infty), and for \alpha > 1 the image is [0,\frac{1}{\alpha - 1}]. For \alpha > 0 these logarithms are concave. These are their graphs for a few selected \alpha:

The inverses of these \alpha-logarithms are the \alpha-exponentials:

Definition: Generalized Exponential

Let \alpha be a real number. Then the \alpha-exponential is defined as:

e_\alpha(x) = [(1-\alpha)x + 1]^{\frac{1}{1-\alpha}}

Similarly to the logarithm we have that:

\lim_{\alpha \to 1} e_\alpha(x) = \exp(x)

meaning that by defining e_1 = \exp we unify all these effective numbers that preserve products into a single family.

The behavior of these exponentials related to products is different than standard exponentials:

e_\alpha(x)e_\alpha(y) = e_\alpha(x + y + (1 - \alpha)xy)

and their relation to inverses is different, creating a connection between and 2 - \alpha, like for logarithms:

\frac{1}{e_\alpha(x)} = e_{2 - \alpha}(-x)

As for derivatives, their are powers of the function itself:

\frac{d e_\alpha}{d x}(x) = e_\alpha(x)^\alpha

Reversely, we have that:

\frac{1}{2-\alpha}\frac{d e_\alpha^{2-\alpha}}{d x}(x) = e_\alpha(x) Showing yet another connection between \alpha and 2-\alpha.

For all of these exponentials, e_\alpha(0) = 1 and e_\alpha(0) = 1. For \alpha \leq 1, the domain of is [0,\infty), and for \alpha > 1 the domain is [0,\frac{1}{\alpha - 1}]. For \alpha > 0, these logarithms are convex. These are their graphs for a few selected \alpha:

If we define the \alpha-sum as:

x \oplus_\alpha y = l_\alpha(e_\alpha(x)e_\alpha(y)) = x + y + (1 - \alpha)xy

which implies that \oplus_1 = +, then we can find additiveness-like properties:

\begin{align*} l_\alpha(xy) &= l_\alpha(x) \oplus_\alpha l_\alpha(y) \\ e_\alpha(x)e_\alpha(y) &= e_\alpha(x \oplus_\alpha y) \end{align*}

which will be useful later.

5 Hill numbers

In the literature, the effective numbers N_\alpha = N_{l_\alpha} are called Hill numbers:

Definition: Hill numbers

The Hill number of order \alpha is defined as:

N_\alpha(p) = e_\alpha\left[\sum_i p_i l_\alpha\left( \frac{1}{p_i} \right) \right] = \left( \sum_i p_i^{\alpha} \right)^{\frac{1}{1-\alpha}}

for \alpha \neq 1 and otherwise as:

N_1(p) = \exp\left[\sum_i p_i \ln\left( \frac{1}{p_i} \right) \right] = \prod_i p_i^{-p_i}

For Bernouilli distributions they look like this:

As you can see, all of them agree on the value for the uniform distribution.

The higher , the smaller the estimate. So for \alpha > 1, we expect the effective state to actually be smaller, indicating “attractive correlations”. For \alpha < 1 we see the opposite, “repulsive correlations”, a bigger space state.

The function N_\alpha is only concave for 0 < \alpha < 2, so this will be the interval of \alpha that we will use for applications.

6 Entropies

Normally the entropy is defined as the main subject, but here it is defined in terms of effective numbers:

Definition: Generalized Entropy

Let f be a continuous, injective real function. Then the f-entropy of the probability distribution p \in P(I) is defined as:

S_f(p) = f(N_f) = \sum_i p_i f\left( \frac{1}{p_i} \right)

In general, different f will generate different entropies, which differs from effective numbers. The main use of entropies is that they are simpler to calculate, because they do not involve f^{-1}.

7 Tsallis entropies

In the literature, the entropies S_\alpha = S_{l_\alpha} are called Tsallis entropies:

Definition: Tsallis entropy

The Tsallis entropy of order \alpha is defined as:

N_\alpha(p) = e_\alpha\left[\sum_i p_i l_\alpha\left( \frac{1}{p_i} \right) \right] = \left( \sum_i p_i^{\alpha} \right)^{\frac{1}{1-\alpha}}

for \alpha \neq 1 and otherwise as:

S_1(p) = \sum_i p_i \ln\left( \frac{1}{p_i} \right)

For Bernouilli distributions they look like this:

S_\alpha is concave for all \alpha > 0.

Now we are ready to study how entropies decompose. Using the identities S_f(p) = l_\alpha(N_\alpha(p)) and N_f(p) = e_\alpha(S_\alpha(p)) we have:

\begin{align*} S_\alpha(pq) &= l_\alpha [ N_\alpha(pq) ] \\ &= l_\alpha [ N_\alpha(p) N_\alpha(q) ] \\ &= l_\alpha [ e_\alpha(S_\alpha(p)) e_\alpha(S_\alpha(q)) ] \\ &= S_\alpha(p) \oplus_\alpha S_\alpha(q) \\ &= S_\alpha(p) + S_\alpha(q) + (1 - \alpha)S_\alpha(p) S_\alpha(q) \end{align*}

which reduces to the traditional sum decomposition formula for the regular entropy, but means that this family of entropies is non-extensive in general.

8 Conditional decomposition

Let p be a probability distribution in I \times J. Let p^I be the marginal distributions in I, and for each i, p^i the conditional distribution in J. Then:

\begin{align*} N_\alpha(p) &= \left[ \sum_{ij} p_{ij}^{\alpha} \right]^{\frac{1}{1-\alpha}} \\ &= \left[ \sum_{ij} (p^i_j p^I_i)^{\alpha} \right]^{\frac{1}{1-\alpha}} \\ &= \left[ \sum_i (p^I_i)^{\alpha} \sum_j (p^i_j)^{\alpha} \right]^{\frac{1}{1-\alpha}} \\ &= \left[ \sum_i (p^I_i)^{\alpha} N_\alpha(p^i)^{1-\alpha} \right]^{\frac{1}{1-\alpha}} \\ &= \left[ \sum_i p^I_i \left(\frac{ N_\alpha(p^i)}{p^I_i}\right)^{1-\alpha} \right]^{\frac{1}{1-\alpha}} \\ &= N_\alpha(p^I,N_\alpha(p^\square)) \end{align*}

\begin{align*} S_\alpha(p) &= \sum_{ij} p_{ij} l_\alpha\left(\frac{1}{p_{ij}}\right) \\ &= \sum_{ij} p^i_j p^I_i l_\alpha\left(\frac{1}{p^i_jp^I_i}\right) \\ &= \sum_{i} p^I_i l_\alpha\left(\frac{1}{p^I_i}\right) + \sum_{ij} p^i_j p^I_i l_\alpha\left(\frac{1}{p^i_j}\right) + (1-\alpha)\sum_{ij} p^i_j p^I_i l_\alpha\left(\frac{1}{p^i_j}\right) l_\alpha\left(\frac{1}{p^I_i}\right) \\ &= S_\alpha(p^I) + \sum_i p^I_i S_\alpha(p^i) + (1-\alpha)\sum_{i} p^I_i l_\alpha\left(\frac{1}{p^I_i}\right) S_\alpha(p^i) \\ &= S_\alpha(p^I) + S_\alpha(p|p^I) + (1-\alpha)\sum_{i} p^I_i l_\alpha\left(\frac{1}{p^I_i}\right) S_\alpha(p^i) \end{align*}

9 Gibbs’ inequality and divergence

For any f that is concave and such that f(1) = 0 and f'(1) = 1, we can derive Gibbs’ inequality. Counting only i where q_i > 0

\sum_i p_i f\left( \frac{q_i}{p_i} \right) \leq \sum_i p_i \left( \frac{q_i}{p_i} - 1\right) \leq 0

Applying this to \alpha-logarithms, and the inverse relations, we have that:

\sum_i p_i l_\alpha\left( \frac{p_i}{q_i} \right) \geq 0

Therefore, we can generalize Kullback-Leibler divergences:

Definition: Generalized divergence

The generalized divergence of order \alpha is defined as:

D_\alpha(p||q) = \sum_i p_i l_\alpha\left( \frac{p_i}{q_i} \right)

Their decomposition of products into sums is also not linear, albeit simple:

\begin{align*} D_\alpha(pq||rs) &= \sum_{ij} p_iq_j l_\alpha\left( \frac{p_iq_j}{r_is_j} \right) \\ &= \sum_i p_i l_\alpha\left( \frac{p_i}{r_i} \right) + \sum_j p_j l_\alpha\left( \frac{q_j}{s_j} \right) + (1 - \alpha) \sum_{ij} p_i q_j l_\alpha\left( \frac{p_i}{r_i} \right)l_\alpha\left( \frac{q_j}{s_j} \right) \\ &= D_\alpha(p||r) + D_\alpha(q||s) + (1-\alpha)D_\alpha(p||r)D_\alpha(q||s) \\ &= D_\alpha(p||r) \oplus_\alpha D_\alpha(q||s) \end{align*}

10 Evaluating independence

We can measure D(p||p^I \times p^J), this is called the mutual information, it is 0 if the distribution is a product of independent variables.

11 Stuff

I noticed that if the process is deterministic, effectively |X^N| = |X| so |X^N|^{1/N} goes to 1. The idea is to use instead |X^N|^{1/N^\alpha}, and then \alpha = 0 would be the right choice here. I still don’t know how to reconcile this with Hölder means.

Jackson q-derivative