*Advanced course « Economics of
Inequality » (Master APE, M2 year)*

Thomas Piketty

Année universitaire 2009-2010

Academic year 2009-2010

**Course Notes D : **

**Pareto interpolation techniques for
income and wealth distribution**

__1. The interpolation problem__

Very often, available income or
wealth distribution data takes of “grouped data”:

Income
(or wealth or wage…) brackets : [s_{1 };s_{2}],...,
[s_{i };s_{i+1}],..., [s_{p };+µ]

N_{i}
= number of individuals (or households or taxpayers…) between s_{i }
and s_{i+1}

N* =
total population

f_{i }=
N_{i}/N* = proportion of
population between s_{i } and s_{i+1}

N_{i}*=
N_{i}+N_{i+1}+..+N_{p} = total number of individuals
above s_{i }

p_{i }=
N_{i}*/N* = proportion of
population above s_{i }

Y_{i} = total income of population between s_{i }
and s_{i+1}

y_{i }=
Y_{i}/N_{i} = average income of population between s_{i }
and s_{i+1}

y_{i}^{*}_{
}= (Y_{i}+...+Y_{p})/N_{i}* = average income of
population above s_{i }

Sometime there is only information
available on the N_{i} , not on the Y_{i}

Even when
micro data is available, sample size is very often insufficient, especially for
the top of the distribution.

In both
cases, one needs to make assumptions about the functional form of the
distribution f(y) in order to use properly the available data = the
interpolation problem

__2. Lognormal
distributions__

y follows a lognormal distribution if and only if x = log(y) follows a
normal distribution

I.e. if the density function g(x) can be written:

The density function f(y) of the
lognormal distribution can thus be written:

Where:

Conversely, one can express µ and
σ as:

(no
closed form solution for distribution functions G(x) and F(y), but available on all statistical
software)

Exemple:

With m = 30 000€ and s =
20 000€, one gets me = 24 962€, mo = 17 281€, P90 =
54 297€

= fairly reasonable shape for the
bottom 90% of the distribution

But P90-100 = 74 931€, P99 =
102 315€, P99-100 = 126 646€

= not reasonable for the top 10% of
the distribution (and especially the top 1%)

>>> the problem of the
lognormal distribution is that the upper tail is not fat enough

(i.e. the density function declines
too fast toward 0)

__3. Pareto distributions for top income & top
wealth__

The
density and distribution functions f(y) and F(y) of a Pareto distribution y can
be written as follows :

G(y) = 1-F(y) = (k/y)^{a} (k>0, a>1)

f(y)=ak^{a}/y^{(1+a)}

__Key
property n°1 : ratio average/threshold = constant__

Note
y*(y) the average income (or wealth, or wage) of the population above threshold
y. Then y*(y) can be expressed as follows :

y*(y) = [ ò_{z>y} z f(z)dz ] / [ ò_{z>y} f(z)dz ] =
[ ò_{z>y} dz/z^{a} ] / [ ò_{z>y} dz/z^{(1+a)} ] = ay/(a-1)

I.e.

y*(y)/y = b = a/(a-1)

(and a =
b/(b-1) )

In
practice :

For top
incomes:

For top
wealth:

Higher b
coefficients = fatter upper-tail = higher concentration

__Key
property n°2 : log-linearity__

If one
plots log(1-F(y)) against log(y), one gets a straight line with slope = -a

log(1-F(y)) = log(k) – a log(y)

Note :
if one plots log(f(y)) against log(y), one gets a straight line with slope
-(1+a) : og(f(y)) = log(ak^{a}) – (1+a) log(y)

Intuitive
meaning of coefficient a: if one raises y by 1%, by how many % does the
proportion above y declines? In practice this coefficient rises from the bottom
of the distribution (where it is usually much below 1) to the middle top of the
distribution; the point is that is tends to stabilize around 1,8-2,2 around a
large plateau P90-99,99 (= why Pareto needs to complement lognormal: with full
lognormal coefficient a rises continuously, and b declines continuously to very
low levels), before of course going to infinity (b goes to 1) for the very last
top incomes

__4. Estimating Pareto parameters from grouped
data__

Grouped
data :

Income (or
wealth or wage…) brackets : [s_{1 };s_{2}],..., [s_{i };s_{i+1}],...,
[s_{p };+µ]

N_{i}
= number of individuals (or households or taxpayers…) between s_{i }
and s_{i+1}

N* =
total population

f_{i }=
N_{i}/N* = proportion of
population between s_{i } and s_{i+1}

N_{i}*=
N_{i}+N_{i+1}+..+N_{p} = total number of individuals
above s_{i }

p_{i }=
N_{i}*/N* = proportion of
population above s_{i }

Y_{i} = total income of population between s_{i }
and s_{i+1}

y_{i }=
Y_{i}/N_{i} = average income of population between s_{i }
and s_{i+1}

y_{i}^{*}_{
}= (Y_{i}+...+Y_{p})/N_{i}* = average income of population
above s_{i }

Sometime there is only information
available on the N_{i} , not on the Y_{i} : then we say
that all we have is “frequency information”

Otherwise
we say we have both frequency and income information

__Methodology
M1: Average income interpolation methodology__

(Piketty
2001, 2003, Piketty-Sez 2003)

This
methodology uses both frequency and income information, and relies on property
1.

In order
to estimate P99,5 (say), pick (s_{i},p_{i}) and (s_{i+1},p_{i+1})
such that p_{i+1} < 0,5% < p_{i}), and compute b and k
using the following formulas :

b = y_{i}*
/ s_{i }

a = b /
(b-1)

k = s_{i}p_{i}^{1/a}

Then use
these coefficients to estimate P99,5=k/(0,005^{1/a}) and
P99,5-100=(a/(a-1))P99,5

__Methodology
M2 : Standard log-linear interpolation methodology__

(Pareto (1896), Kuznets (1953), Feenberg et
Poterba (1993))

This
methodology uses solely frequency information, and relies on property 2.

In order
to estimate P99,5 (say), pick (s_{i},p_{i}) and (s_{i+1},p_{i+1})
such that p_{i+1} < 0,5% < p_{i}), and compute a and k using
the following formulas :

a = log(p_{i}/p_{i+1}) /
log(s_{i+1}/s_{i})

b = a /
(a-1)

k = s_{i}p_{i}^{1/a}

Then use
these coefficients to estimate P99,5=k/(0,005^{1/a}) and
P99,5-100=(a/(a-1))P99,5

__Which of these three
methodologies should be used?__

If one has both frequency and income
information, then it is better to use methodology M1

If one has solely frequency information, then
use methodology M2

There also exists more complicated methodologies
taking explicitely into account the fact that Pareto parameters are not
constant (see e.g. Cowell 2000, Zucman 2008). See also Atkinson 2007 for an
interpolation method with explicit upper and lower bounds & confidence
intervals.

Some references on interpolation
techniques:

Atkinson (2007), chapter

D. G. Champernowne and F. Cowell,
“Economic Inequality and Income Distribution”,

Cowell, Measuring Inequalities, electronic
manuscript 2000

Cowell-Mehta, “The estimation and interpolation
of inequality measures”, Review of Economic Studies 1982, vol.49 pp.273-290

Piketty (2001, Appendix B)

__5. Pareto
distributions and top shares formulas__

TO BE COMPLETED