VII. Ergodic Theory

version of 3 June 1992

By POINCARé's theorem, the interesting property of recurrence is generic for many dynamical systems, including the motions of the molecules of gas in an insulated bottle (assuming the validity of classical physics), which has a state space of huge dimensionality, say 1025 or much more. One of the most basic and familiar physical facts is that when left alone, macroscopic matter tends to thermodynamic equilibrium, and this happens largely independently of initial conditions or even of the composition of the matter. In the early days of the kinetic theory of gases, J. CLARK MAXWELL and L. BOLTZMANN hoped to explain this striking phenomenon by appealing to recurrence. BOLTZMANN enunciated the ergodic hypothesis, the original form of which posits that the trajectory of any given every initial condition passes through every point of the surface in phase space having the same total energy as the initial condition. This portion of phase space is known as the energy shell, and because of conservation of energy, we may treat the energy shell as the state space. If the ergodic hypothesis were true, BOLTZMANN felt, then the time averages of physical observables would tend, as t ->[[infinity]], to a spatial average, the average of the observable over the energy shell. That average is what we would recognize as the thermodynamic equilibrium for the observable. (Caution: Equilibrium has a different meaning here than in section II. Thermodynamic equilibrium certainly does not mean that molecules are at rest, which would only happen at absolute zero temperature.) EHRENFEST later realized that the original form of the ergodic hypothesis was probably impossible for physical systems, and modified it to something more in the spirit of POINCARé recurrence.

Recall one of our simplest models, the circle rotation [[phi]]([[theta]]) = [[theta]] + [[alpha]] mod 2[[pi]], with [[alpha]] an irrational multiple of 2[[pi]]. This model is known from a theorem due to P. BOHL, W. SIERPINSKI, and H. WEYL [ARNOLD-AVEZ, 1968] to have an equidistribution property of the form desired by BOLTZMANN:

Theorem VII.1. Consider the circle rotation with [[alpha]]/2[[pi]] irrational. Given any arc A and any initial angle [[theta]]o, let N(A,k) denote the number of iterates [[phi]]m([[theta]]o) lying in the arc A for m <= k. Then

limk->[[infinity]](N(A,k)/k) = length(A)/2[[pi]]. (7.1)

Exercise VII.1. The usual proof of this theorem is indirect, using FOURIER series. Outline a direct proof of this theorem by deriving a contradiction as follows: Show that if the theorem were false, then there would be two arcs A1 and A2 with length(A1) < length (A2) but

a( ,lim,k->[[infinity]]) f(N(l(A1,k)),k) > a( ,lim,k->[[infinity]]) f(N(l(A2,k)),k) .

Next use JACOBI's theorem (Exercise III.2) to show that for some fixed finite M, [[phi]]M(A1) Ì A2, and argue that this implies that

a( ,lim,k->[[infinity]]) f(N(l(A1,k)),k) <= a( ,lim,k->[[infinity]]) f(N(l(A2,k)),k) .

It follows that the limit depends only on the length of the arc. Finish the proof by showing that the dependence is a strict proportionality.

Exercise VII.2. How does the equidistribution theorem extend to free motion on a torus?

Now consider what happens when we average some continuous function f([[theta]]) over time, i.e., allowing [[theta]] to evolve into [[phi]]k([[theta]]). Because the sampling points [[phi]]k([[theta]]) are equidistributed, we discover that:

lima( ,N->[[infinity]]) b(f(1,N))isu(k=0,N-1, f([[phi]]k([[theta]]))) = b(f(1,2[[pi]]))i(S1,, f([[theta]])d[[theta]]). (7.2)

In other words, the time-average equals the phase-average. This is the property that BOLTZMANN hoped to derive from his hypothesis for systems of many degrees of freedom.

While initially attractive, the ergodic hypothesis leads to some troubling paradoxes, notably ZERMELO's paradox: If the gas is initially released from one corner of an insulated bottle, POINCARé's recurrence theorem predicts that, according to classical physics, at some time all the molecules will suddenly rush back into the corner of the bottle. According to the ergodic hypothesis, some analogous phenomenon should eventually happen for almost every trajectory. But this has never happened in all of history! The difficulty is that the time required for recurrence in a typical dynamical system of more than very low dimensionality is vast. Following [THIRRING, 1980], imagine ten clocks running at a frequency of roughly 1 HERTZ, i.e., 1 revolution per second, but suppose that the clocks are not quite accurate, and the precise frequencies are incommensurate (have irrational ratios). This dynamical system is mathematically identical to the free motion on a ten-dimensional torus, with irrational speeds, and, as should have been discovered in Exercise VII.2, every trajectory is recurrent and has an equidistribution property. Let us estimate the time it takes for all the clocks to return to the initial position in synchronism within 1% of a second. The state space can be divided into (1%)-10 = 1020 cells of this margin of error, so it should take 1020 seconds to sample them all and return. The universe is believed to be less than 1018 seconds old.

Exercise VII.3. Estimate the recurrence time for a mole of hydrogen atoms at room temperature confined to a one-liter bottle. Feel free to make simplifying assumptions and crude approximations.

Paradoxes like this have caused led to the recognition that the important fact is not the original form of BOLTZMANN's hypothesis, but rather that time-averages can be replaced by phase averages. Recurrence and ergodicity (the ability to equate time-averages and phase averages) depend on the existence of invariant measures. Normally, we shall assume that there is an invariant measure u on the state space S, i.e., as in (4.5),

u(U) = u([[phi]]t-1(U)), (7.3)

and that u(S) < [[infinity]]. For convenience let us normalize the measure so that u(S) = 1, i.e., u is a probability measure. Many state spaces have infinite measure, but it may be possible to restrict to finite subregions; for instance if conservation of energy applies, motions take place on subsurfaces of fixed energy, which are typically of finite measure.

Digression on invariant measures. In the abstract, it is usually not difficult to produce invariant measures. A theorem of KOLMOGOROV (Kolmogocentsrov) and KRYLOV (Krylocentsv) states that if S is a compact measure space and F maps S to itself continuously, then there exists an invariant measure on S for F. Unfortunately, this measure can be unpleasantly singular. A dynamical system may, and usually does, have more than one invariant measure. For example, a point measure (= delta function) supported at a fixed point of an iterated mapping is clearly invariant under the dynamical system, as is a sum of equal point measures supported at the points of an n-cycle. Thus, for the cat map, not only is dxdy an invariant probability measure, so are d(x)d(y)dxdy and the measure assigning the weight 1/4 to each of the points (1/3,1/3), (2/3,0), (2/3,2/3), and (1/3,0), viz.,

f(1,4) bbc{(d(x-1/3)d(y-1/3) + d(x-2/3)d(y) + d(x-2/3)d(y-2/3) + d(x-1/3)d(y))dx dy.

Often, different invariant measures will be mutually singular, meaning that the sets of positive measure according to one of them are sets of measure zero according to the other, but there are degenerate cases where this is not true. For the identity mapping, all measures are obviously preserved. For a less trivial example, consider the simple harmonic oscillator. In appropriate units its motion is a uniform rotation in the phase plane, so every measure that is independent of the angular variable in the phase plane will be invariant. For many purposes, it is desirable to know that the invariant measure is smooth at least along the unstable directions. Such measures are called SRB measures after Sinai (Sinacentsi[[sterling]]), Ruelle, and Bowen. Under many circumstances, SRB measures can be generated directly from the dynamics by looking at the trajectories beginning with an initial open set (weighted according to the LEBESGUE measure), known as the physical measure.

For a differentiable dynamical system

f(dz,dt) = V(z), (7.4)

a formal calculation in the spirit of LIOUVILLE's Theorem shows that the invariant measures of the form du = w(z)dz satisfy

div(wV) = 0. (7.5)

Of course the invariant measure might easily be singular. When the coefficients are nice, equation (7.5) has many independent solutions for the density, or weight, w(z). A partial differential equation of this sort is usually solved by the method of characteristics, and, of course, the characteristic equations for it are precisely (7.4). The invariant measures are, roughly speaking, arbitrary transversely to the trajectories.

More abstractly, an absolutely integrable density will be a fixed point of a certain operator, called the FROBENIUS-PERRON operator, i.e.,

and any measurable set U,

(7.6)

This discussion above raises the hope that along trajectories, or in one dimension, the dynamics determines a natural invariant measure, and this is frequently the case.

Example. Let [[Phi]]1 = F4 = 4x(1-x). Then the FROBENIUS-PERRON operator can be calculated by taking the sets U to be intervals of the form [0,x) and differentiating with respect to x:

An invariant density must thus satisfy the functional equation which was analyzed by ULAM and VON NEUMANN, who found the solution This is the same invariant density that is obtained by transforming F4 to the dyadic map and then "pulling back" the Lebesgue measure which is invariant under the latter.

In more general situations, (7.2) is replaced by

lima( ,N->[[infinity]]) b(f(1,N))isu(k=0,N-1, f([[phi]]k(z))) = b(f(1,u(S)))i(S,, f(z)du(z)), (7.7)

for a discrete dynamical system , and by

lima( ,T->[[infinity]]) b(f(1,T))i(0,T, f([[phi]]t(z)))dt = b(f(1,u(S)))i(S,, f(z)du(z)), (7.8)

for a continuous dynamical system.

If f is a function of z[[propersubset]]S, let us refer to its time mean, i.e., the left side of (7.7) or (7.8), as f*, and its phase mean, i.e., the right side of (7.7) or (7.8), as f^. Their general relationship is described by a theorem of BIRKHOFF and KHINCHIN (Xicentsnhin):

Theorem VII.2. Let S be a finite-measure space, and assume that f is absolutely integrable and measure preserving. Then

a) f*(z) exists for a.e. z, and f*(z) is absolutely integrable;

b) For a.e. z, and all times t, f*([[phi]]t(z)) = f*(z), i.e., the time mean is time-invariant other than a null set of z.

c) (f*)^ = f^. (7.9)

In this theorem, the condition of absolute integrability can be replaced by square-integrability. The variant of this theorem where the convergence of the time-mean is asserted in the integrated (L1 or L2) sense is due to VON NEUMANN.

Definition. A dynamical system with u(S) < [[infinity]] is ergodic iff f* = f^ a.e.

The use of the word ergodic has evolved somewhat since BOLTZMANN's day, and on occasion it is still used, for example, to refer to the density of a trajectory (cf. section V). In addition, several modern sources use apparently different definitions which are in fact equivalent to the one used here, according to the following theorem:

Theorem VII.3. A dynamical system with u(S) =1 is ergodic if and only if

a) the only invariant subsets I Ì S have measure u(I) = 0 or

1; or, equivalently,

b) the only invariant real-valued functions f on S are constant a.e.

The value of this theorem is that properties a) and b) are often easier to check than the equality of time and phase means. In this theorem, invariant means invariant in time n or t, while constant means constant in z. Think of real-valued functions on S as physical observables of the system.

Proof. a) implies b): Assume a) and that a function f is time-invariant. If f is invariant, then the set U[[alpha]] = {z: f(z) >= [[alpha]]} is invariant for any real number [[alpha]], and therefore has measure 0 or 1. Now, as [[alpha]]->-[[infinity]], u(U[[alpha]]) = 1, whereas as [[alpha]]->+[[infinity]], u(U[[alpha]]) = 0. Since U[[alpha]]Ì U[[beta]] for [[alpha]]>[[beta]], there is a single critical value [[alpha]]c where the measure changes from 1 to 0, and, clearly, f(z) = [[alpha]] a.e.

b) implies a): Assume b) and that a set U is time-invariant. Then the indicator function for U, i.e.,

[[chi]]U := blc{(a(l(1, x[[propersubset]]U), ,l(0, x[[reflexsubset]]U)))

must also be invariant. The only way this function can be constant a.e. is for u(U) to be 0 or 1.

Ergodic implies b): Suppose that f is invariant under an ergodic transformation. Invariance implies that f(x) = f*(x), but ergodicity implies that f*(x) is a constant, viz., f^, a.e.

b) implies ergodic: By the BIRKHOFF-KHINCHIN theorem, for any f(x), f*(x) is time-invariant. Assuming b), f*(x) is constant a.e. This constant must be its average, i.e., f*^ = f^.

x( )

Properties a) and b) are pretty easy to see for simple models such as the circle rotations with [[alpha]]/2[[pi]] irrational. There are often nontrivial invariant sets of zero measure; for instance an n-cycle is a finite number of points, and thus normally has measure zero. We shall see that the cat map is ergodic, and, as we know, the points with rational coördinates are periodic. Although the rational points form a dense set, the measure of the rationals is zero, so that fact does not contradict the theorem.

An area with perhaps surprisingly many applications of ergodic dynamics is number theory [FURSTENBERG, 1981]. To get the flavor of this, let's investigate the distribution of first digits of the numbers 2n, n = 0, 1, ...[following ARNOLD-AVEZ, 1968]. These digits are 1, 2, 4, 8, 1, 3, 6, 1, 2, 5, 1, 2, 4, 8, 1, .... For large n, what is the probability that the first digit is, say, 7? Is it less than that of 8? The calculator reveals that the first 7 occurs for 246. The condition that the first digit of 2n is k is that

k 10r < 2n < (k+1)10r,

or, equivalently,

r + log10(k)< n[[alpha]] < r + log10(k+1), (7.10)

where [[alpha]] is the irrational number log10(2) [[congruent]] .30103. The interval [0,1) is the non-overlapping union of the intervals [log10(1), log10(2)), [log10(2), log10(3)), ..., [log10(9), log10(10)), so we can unambiguously calculate mod 1:

log10(k) < n[[alpha]] mod 1 < log10(k+1)

The numbers {n[[alpha]]} mod 1 correspond to irrational rotations of the point 0 on a circle, scaled to have circumference 1 rather than 2[[pi]], and are thus equidistributed in [0,1) by the theorem of BOHL et al. It follows that the the limiting probability of a first digit being k is

log10(k+1) - log10(k) = log10b(f(k+1,k)).

Exercise VII.4. Find the distribution of the first digits of an in base b arithmetic, where a and b are fixed integers.

In 1938, BENFORD observed that the first digits of many disparate tables of data are distributed according to this same law. BENFORD checked the surface areas of rivers, street addresses of prominent citizens, etc. It is something of a mystery why BENFORD's rule should hold so universally, but ergodicity is probably a large part of the explanation.

Ergodicity has important consequences for the notion of a LYAPUNOV exponent. As we know, in one dimension, by Equation (6.2), the LYAPUNOV exponent is a time average,

[[lambda]](x) = lim b(f(1,N)) isu(k=0,N-1, ln|F'(xk)|) .

Exercise VII.5. Show that the LYAPUNOV exponent for the quadratic mapping F4 is ln2, using a direct calculation with the invariant measure

If the derivative is bounded away from 0 and [[infinity]], then the BIRKHOFF-KHINCHIN theorem guarantees that the LYAPUNOV exponent is defined for a.e. x. If the dynamical system is ergodic, then [[lambda]](x) is independent of x a.e. These facts have been extended to the many-dimensional case by the multiplicative ergodic theorem of OSELEDEC (Oseledecentsc):

Theorem VII.4. Suppose that <S, [[phi]]t, u> is an ergodic dynamical system, and T(z) is an nxn matrix-valued function of z satisfying

Let Tk(z) := T([[phi]]k-1(z))T([[phi]]k-2(z)) ... T(z). Then For a.e. z, the limit

exists. The logarithms of its eigenvalues are the LYAPUNOV exponents.

In this theorem, log+(x) = max(0, ln(x)). A variant of the theorem states that for a.e. z, there is a nested sequence of subspaces Rn   E(1)(z)   ... of the tangent space Rn at z, such that if v [[propersubset]] E(j)\E(j+1), v != 0, then

The notation is such that the LYAPUNOV exponents are enumerated in decreasing order: [[lambda]](1) > [[lambda]](2) > ....

The example of the circle rotations shows that there is nothing particularly chaotic about a merely ergodic motion, and since I criticized POINCARÉ recurrence earlier because of the extraordinarily long times required for the recurrence to take place, the alert reader will have been asking for some comparable estimates in the context of ergodic theory. The notion of a correlation function makes it possible to speak of a characteristic time in this sense:

Definition. Given an ergodic probability measure u, the correlation function for two sets of positive measure U and V is

C(t;U,V) := u([[phi]]t-1(U)[[intersection]]V) - u(U)u(V).

The word autocorrelation function is often used when U=V. If a system is topologically transitive, it often suffices to consider only the autocorrelation functions. A related quantity is the time autocorrelation function,

C(t,T;U) := u([[phi]]t+T(U)[[intersection]][[phi]]t(U)) - u([[phi]]t(U))2.

The term power spectrum refers to the FOURIER transform of a correlation function.

We can speak of a characteristic time [[tau]] for the decay of correlations if, as is frequently the case, |C(t;U,V)| <= const. exp(-t/[[tau]]). The existence of a [[tau]] of the right order of magnitude is actually what is required for thermodynamic equilibrium. For the circle rotations, there is no exponential upper bound, so in effect [[tau]]=[[infinity]].

The minimal desideratum for a statistical understanding of a dynamical system as t->[[infinity]] is that the correlations tend to 0 in that limit. This is known as the mixing property, a term which also originated in statistical mechanics, in the work of GIBBS. It can be thought of as a measure-theoretic replacement for the concept of topological transitivity. We suppose without loss of generality that u(S) = 1, in order to simplify matters slightly:

Definition. A dynamical system is mixing if for any pair of measurable sets U and V,

limt->[[infinity]] u([[phi]]t-1(U) [[intersection]] V) = u(U) u(V). (7.11)

Again, if the dynamical system is invertible, (7.11) can be replaced with

limt->[[infinity]] u(U [[intersection]] [[phi]]t(V)) = u(U) u(V). (7.12)

Mixing can be thought of as a statement of statistical independence after long times. The conditional probability of an event U occurring at time t given event V at time 0 is

f(u([[phi]]t-1(U) [[intersection]] V),u(V)).

According to (7.11) this probability is asymptotically the same as u(U), the probability of event U. Since the flow is measure-preserving, u(U) = u([[phi]]t-1(U)), and

f(u([[phi]]t-1(U) [[intersection]] V),u(V)) -> u([[phi]]t-1(U))

In other words, if a system is mixing, information about initial conditions becomes worthless for making predictions after long times.

Suppose that a dynamical system is mixing and U = V is invariant. Then we find that u(U) = u(U)2, so u(U) = 0 or 1. The theorem stated above then shows that the dynamical system is ergodic. Mixing implies ergodic, but not vice versa: For instance, circular rotations are not mixing.

Exercise VII.6. Show that the circle rotation is not mixing.

The cat map is mixing, as is visually demonstrated by looking at the image of some figure in the rectangle 0 <= x,y <= 1 after many iterations, and that fact can be proved by using the Markov partition and symbolic dynamics for the cat map [e.g., ARNOLD-AVEZ, 1968, section 10]. The situation is a bit easier to understand for some of our other models, such as the baker's transformation.

Let us begin with the dyadic mapping, F(x) = 2x mod 1. As we know, in binary arithmetic this mapping is equivalent to the left shift on the digits of x. The usual measure dx is invariant: If I is an interval in [0,1), then F-1(I) consists of two intervals, one to the left of 1/2 and one to the right. Both intervals are exactly half as long as I. Here it is important to use the definition with inverse images (7.3), since the dyadic map is not uniquely invertible. We also observe that the set of x[[propersubset]][0,1) such that r of its binary digits have been specified (0 or 1) is a set of intervals of total length 2-r

Exercise VII.7. Prove this observation.

Suppose now we take two intervals A = [k2-n, (k+1)2-n) and B = [j2-m, (j+1)2-m). A represents a set of numbers where the first n digits have been specified, while B represents a set of numbers where the first m digits have been specified. If p>m, then F-p(A) (i.e., (F[[ring]]p)-1(A)) is a set of numbers with digits between p+1 and p+n specified, so F-p(A)[[intersection]]B has n+m digits specified, and thus has measure 2- n- m = u(A)u(B). The definition of mixing is satisfied for such intervals. Since all BOREL measurable sets can be generated from these basic intervals by complementation and countable union and intersection, it is not hard to see that (7.11) must hold for all A and B.

Example. The transformation (x,y) -> (2x mod 1, 2y mod 1) on the unit square is likewise mixing, by a similar argument.

Example. The mapping x -> frac(1/x) (except 0->0) on [0,1) is mixing with respect to the invariant measure

du = f(1,ln2) f(dx,1+x)

(cf. Exercise IV.11).

Exercise VII.8. Is the baker's transformation mixing? Answer the question by considering the case of rectangles whose corners have coördinates of the form k2-n, and arguing that general sets can be approximated by combinations of such rectangles.

Hint. Why would a baker use this transformation if it weren't mixing? Consider the case of two rectangles of the form

A = {j 2-k < x < (j+1) 2-k, l 2-m < y < (l+1) 2-m}

and

B = {n 2-p < x < (n+1) 2-p, q2-r < y < (q+1) 2-r},

and think about intersections of B with [[phi]]N(A) for large N, which will consist of a large number of horizontal bands of height 2-n-N.

Example. Let us return to the BERNOULLI shift [[sigma]] of section V, choosing for S the symbol sequences [[Sigma]]2 (with no restrictions), and calling the two symbols H and T, for heads and tails. There is a natural probability measure on U Ì [[Sigma]]2 defined as the probability that flipping coins will produce a sequence in U. For example, if U is the set of all sequences with s5 = H, s7 = H, and s9997354638 = T, then the probability is 2-3 = 1/8, since there were three specifications. This should be reminiscent of Exercise VII.7, and in fact the proof that the system <[[sigma]], [[Sigma]]2> is mixing is just the same as for the dyadic map. It makes no difference, by the way, whether we work with the unilateral or bilateral version of [[Sigma]]2.

Suppose, however, that the coin is biased, so that the probability of heads is p != .5 and the probability of tails is q = 1-p. Flipping this coin will also define a probability measure on [[Sigma]]2, and it is not hard to see that it is invariant, ergodic, and even mixing with respect to [[sigma]] by a slight modification of the argument given above. This is disturbing, since if f is any reasonable real-valued function on [[Sigma]]2, we expect the time average of f, which makes no reference to a measure, to be equal to all the different phase averages of f with different choices of p. Obviously those different phase averages cannot always be the same. For example, suppose that we equate H with the numerical value 1 and T with the value 0, and that f(s) = s0.s1 (product of the values of s0 and s1). With the fair coin, p=.5, the expectation value of f is obtained by giving equal weights to the four values H.H = 1, H.T = 0, T.H = 0, and T.T = 0, i.e., i(,, f(s) du.5(s)) = .25. With a biased coin favoring H with p = .9, we get .81.1 + .09.0 + .09.0 + .01.0 = .81.

Exercise VII.9. What's going on here?

A useful observation is that if two dynamical systems are topologically conjugate, and one of them has an invariant measure, then so does the other. If the one system is ergodic, or mixing, then so is the other, with respect to the right measure. Thus, suppose that S is a homeomorphism between S and T and the dynamical systems for x[[propersubset]]S and y[[propersubset]]T are related by

[[phi]]t(x) = S-1([[psi]]t(S(x)).

If there is a measure d[[nu]] on T it can be pulled back to define a measure d[[rho]] on S by the formula. (The ability to pull back a measure does not actually require a homeomorphism.)

[[rho]](U) := [[nu]](S(U)). (7.13)

In particular, the quadratic mapping for coupling u > 2+r(5) is mixing with respect to the invariant measure corresponding to assigning probability p to the interval IR to the right of and the probability q:=1-p to the interval IL to the left of . The resulting measures are singular with respect to LEBESGUE measure and are supported on the fractal set [[Lambda]].

Another use of this observation is to point the way to a proof that the cat map is mixing. Once we have established a nontrivial symbolic dynamics for a system, it will be mixing with respect to any measure that is topologically conjugate to one of the mixing measures on the BERNOULLI system. As a general rule, in the absence of a symbolic dynamics it is extremely difficult to prove that a realistic system is mixing.

It is fairly easy, by the way, to generate an impression of measures like the balanced measure on [[Lambda]] for the quadratic mapping on a computer screen. The function ux(1-x) has two inverses, one assigning values in IR and the other values in IL to x[[propersubset]][[Lambda]]. If one begins at a randomly chosen point and iterates the two inverses randomly with probabilities p and q, and plots the iterates, they will almost certainly distribute themselves according to the desired measure. Problems will obviously occur if one starts off on an n-cycle, but it is not actually necessary to start off in [[Lambda]], since the inverses will attract other points onto [[Lambda]]. In order to get a reasonable picture, one should not plot the first hundred iterates, or however many iterates it takes to get points within a small error of [[Lambda]].

Exercise VII.10. Carry out this computer analysis.

Complicated dynamical systems typically have structures exhibiting one or more of the following signs of chaos :

a) A positive LYAPUNOV exponent;

b) The mixing property;

c) Fractal geometry.

Ordinarily, a dynamical system regarded as chaotic will exhibit at least two of these signs, and different scientists often use the word "chaotic" in different senses, although examples of [GREBOGI et al., 1984] have convinced people that fractal geometry can occur in a non-chaotic setting, whereas many examples regarded as chaotic, such as the cat map, do not have obvious fractal structures. A working definition of chaos, in the spirit of DEVANEY's definition, might be a) and b) together. To see that a) is independent of b), and unconvincing by itself, observe that the dynamical system on R defined by the equation of motion dx/dt = 2x has a positive LYAPUNOV exponent (=2). An example of GIRSANOV (Girsacentsnov) [ARNOLD-AVEZ, 1968, p. 45] shows that a system can be mixing without having a positive LYAPUNOV exponent. One problem with using the mixing property as the criterion for chaos, is that it is as a rule very difficult to prove for realistic systems.

Exercise VII.11. Consider the dynamical system on S = {0 <= z < [[infinity]]} defined by

F(z) = blc{(a(l(f(2z,1-z2), 0 <= z < 1), , l(f(z2-1,2z), 1 <= z < [[infinity]]))).

This is ergodic with respect to the measure du = f(2,[[pi]]) f(dz,1+z2) . (Take this as given.)

a) Find the Lyapunov exponent for some point z0 != 0 of your own choosing.

b) Does almost every point have the same Lyapunov exponent? Justify your answer, and, if possible, find [[lambda]].

c) Verify or disprove that this transformation is chaotic (in the sense of Devaney).

Hint. tan(2[[theta]]) = f(2tan[[theta]],1-tan2[[theta]]).

The entropy of a physical system is another notion having its origins in thermodynamics and statistical mechanics, which was used as the central concept of information theory by SHANNON and WEAVER [1949] and was brought into general dynamical systems theory by KOLMOGOROV (Kolmogocentsrov). Suppose that we wish to quantify the amount of information obtained by measuring a system. Intuitively, information is additive; the information gained by making two independent measurements is the sum of the information gained by making each one. Moreover, if we make a measurement and are lucky enough to obtain an improbable outcome, that means that we have gained more information than if we obtain a more probably outcome; the police have an easier time if they are certain the criminal is a seven foot tall left-handed woman with flaming red hair than if the description matches dozens of people in any K-Mart. Suppose now that we measure two independent events with probabilities P1 and P2, so that their joint probability is P1P2. What familiar function could describe the information gained by these measurements? The answer is -  ln(P), where P is the probability of the event measured. Since P <= 1, this is positive, and since -  ln(P1P2) = -  ln(P1) + (-  ln(P2)), it is additive.

Now suppose that an experiment has k possible distinct outcomes, with probabilities p1...pk. Without knowing the outcomes of an experiment in advance, the expected increase of information would be

Definition. Let z(t) := - t ln(t) for 0 < t <= 1, and z(0) := 0, and let [[alpha]] denote a finite measurable partition of S, i.e., a collection of nonintersecting subsets Aj whose union is all of S. The entropy of the partition is

(7.14)

Remarks. Think of the partition as an exhaustive specification of the possible outcomes of an experiment. Many sources prefer to use log2 rather than ln, so that the rate of information produced is measured in binary bits. Historically, the entropy as defined in thermodynamics is an integral of the reciprocal of the temperature with respect to the heat content, and appears to have little to do with this function. When trying to justify thermodynamics on the basis of the theory of statistical physics, however, BOLTZMANN showed that the entropy was equivalent to his H function, which is an integral over velocity space of f(v,t) ln(f(v,t))

The definition given above has no dynamics in it. If we want quantify the information gained by watching the outcome of an experiment again and again as the dynamical system runs, we define an entropy as follows:

If [[alpha]] and [[beta]] are two partitions of S, we let [[alpha]][[logicalor]][[beta]] be the partition consisting of all possible intersections of the form Aj[[intersection]]Bk, Aj[[propersubset]][[alpha]], Bk[[propersubset]][[beta]]. This is a refinement of either original partition. The information value of such a refined partition corresponding to multiple experiments is greater than that of either individual experiment.

Definition. The entropy of the partition [[alpha]] with respect to the dynamics [[phi]]1 is defined as

(7.15)

In words, this is the long-time average of the rate of increase of information as one continues to observe the dynamics. A theorem in [ARNOLD-AVEZ, 1968] guarantees the convergence of this limit. One final improvement can be made, viz., to make an optimal choice of the experimental set-up:

Definition. The entropy h([[phi]]1,u) of the dynamical system is defined as the supremum of h([[alpha]],[[phi]]1,u) over all finite measurable partitions.

In some cases the supremum may be avoided, if one can find a special sort of partition, known as a generating partition.

Example. Consider the BERNOULLI system <[[sigma]],[[Sigma]]N, u> where u is the measure independently assigning probability pk to the N possible values of the symbol value for coördinate sk, p1+...+pN = 1. Let us call the N symbols the numbers 1...N, and suppose that our initial partition {Ak} consists of all elements of [[Sigma]]N with s1 = 1; s1 = 2; etc. The entropy of this partition is

whereas if the partition is refined by taking intersections with [[phi]]s(-1,1)(Ak), we get

When the partition is dynamically refined n times, we get exactly n times , so the entropy of the partition with respect to the dynamics is . It turns out that this partition is the optimal one, so the entropy of the system has this same value.

Notice that if all pk= 1/N, then h = ln(N), in particular the entropy of [[Sigma]]2 with the usual fair-coin measure is ln(2). Entropy is an invariant under topological conjugacy, provided that one measure is the pull-back of the other as in (7.13). One interesting consequence of this is that different BERNOULLI systems such as <[[sigma]],[[Sigma]]2> with the fair-coin measure and, say, <[[sigma]],[[Sigma]]3> with the three symbols likewise equally probable, are not topologically equivalent.

We have seen that the quadratic mapping F4 is topologically conjugate to the dyadic map, with an invariant measure which is the pull-back of the usual LEBESGUE measure. From the measure-theoretic point of view, the dyadic map is just like <[[sigma]],[[Sigma]]2,ufair coin>, so the entropy is ln(2). This is, curiously, the same as the LYAPUNOV exponent. Indeed, a formula due to PESIN (Pecentssin),

(7.16)

has been proved to hold under wide circumstances. Specifically, suppose that u is an SRB measure:

Definition. Suppose that u is an ergodic measure and U Ì S with u(U) > 0 can be written as a disjoint union of pieces U[[alpha]] of unstable manifolds V[[alpha]]. If all of the conditional probability measures obtained by restricting u to U[[alpha]] are absolutely continuous with respect to the LEBESGUE measure for V[[alpha]], then u is an SRB measure.

If there is more than one choice of U, it turns out that this definition does not depend on the choice.

Theorem VII.5 (LEDRAPPIER and YOUNG). If F is a smooth (C2) diffeomorphism on a compact set S and u is an ergodic measure, then (7.16) holds if and only if u is an SRB measure.

Unfortunately, there are some reasonable examples where no SRB measure exists.

In addition to being related to the LYAPUNOV exponents via (7.16), the entropy of a system is also connected with the notion of fractal dimension. If u is an invariant probability measure for a dynamical system, the information dimension can be defined as the infimum of the HAUSDORFF dimensions of all sets A of full measure, u(A) = 1.