National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

Columbia Genome Center, Columbia University, 1150 St. Nicholas Avenue, Unit 109, New York, NY 10032, USA

Department of Mathematics, Howard University, 2400 Sixth Str., Washington D.C., 20059, USA

Abstract

Background

Power distributions appear in numerous biological, physical and other contexts, which appear to be fundamentally different. In biology, power laws have been claimed to describe the distributions of the connections of enzymes and metabolites in metabolic networks, the number of interactions partners of a given protein, the number of members in paralogous families, and other quantities. In network analysis, power laws imply evolution of the network with preferential attachment, i.e. a greater likelihood of nodes being added to pre-existing hubs. Exploration of different types of evolutionary models in an attempt to determine which of them lead to power law distributions has the potential of revealing non-trivial aspects of genome evolution.

Results

A simple model of evolution of the domain composition of proteomes was developed, with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from non-coding or non-globular sequences or acquisition via horizontal gene transfer). This formalism can be described as a

- b

- d

- i

- m

Conclusions

We show that a straightforward model of genome evolution, which does not explicitly include selection, is sufficient to explain the observed distributions of domain family sizes, in which power laws appear as asymptotic. However, for the model to be compatible with the data, there has to be a precise balance between domain birth, death and innovation rates, and this is likely to be maintained by selection. The developed approach is oriented at a mathematical description of evolution of domain composition of proteomes, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.

Background

Sequencing of numerous genomes from all walks of life, including multiple representatives of diverse lineages of bacteria, archaea and eukaryotes, creates unprecedented opportunities for comparative-genomic studies

Lineage-specific expansions and gene loss events detected as the result of comparative analysis of the domain compositions of different proteomes have been examined mostly at a qualitative level, in terms of the underlying biological phenomena, such as adaptation associated with expansion or coordinated loss of functionally linked sets of genes ^{-γ} where

Power law distributions are scale-free, i.e. the shape of the distribution remains the same regardless of scaling of the analyzed variable. In particular, scale-free behavior has been described for networks of diverse nature, e.g. the metabolic pathways of an organism or infectious contacts during an epidemic spread

However, a recent thorough study suggested that many biological quantities claimed to follow power laws, in fact, are better described by the so-called generalized Pareto function: ^{-γ} where

The importance of the analysis of frequency distributions of domains or proteins lies in the fact that distinct forms of such distributions can be linked to specific models of evolution. Therefore, by exploring the distributions, inferences potentially can be made on the mode and parameters of genome evolution. For this purpose, the connections between domain frequency distributions and evolutionary models need to be explored theoretically within a maximally general class of models. In this work, we undertake such a mathematical analysis using simple models of evolution, which include duplication (birth), elimination (death) and

Results and Discussion

Mathematical theory and model

Fundamental definitions and assumptions

A genome is treated as a "bag" of coding sequence for protein domains, which we simply call ** domains** for brevity. Domains are treated as independently evolving units disregarding the dependence between domains that tend to belong to the same multidomain protein. Each domain is considered to be a member of a

In a finite genome, the maximal number of domains in a family cannot exceed the total number of domains and, in reality, is probably much smaller; let _{i} be the number of domain families in _{i} and increase of _{i+1} by 1). Conversely, death of a domain in a family of class

Domain dynamics and elementary evolutionary events under BDIM.

Domain dynamics and elementary evolutionary events under BDIM.

The formulation of the model

The simple BDIM

Let us formulate the following **independence assumption**: i) all elementary events are independent of each other; ii) the rates of individual domain birth (λ) and death (δ) do not depend on

d_{1}(_{1}(_{2}(

d_{i}(_{i-1}(_{i}(_{i+1}(

d_{N}(_{N-1}(_{N}(

Similar models have been considered previously in several different contexts [33 v. 1, ch. 17, 34]. We will see in 3.2 that the solution of model (2.1) evolves to equilibrium, with a unique distribution of domain family sizes, _{i}~(λ/δ)^{i}/_{i}~1/

The Master BDIM

A more general BDIM emerges when the independence assumption is abandoned. Instead of constructing specific hypotheses regarding the dependence between the elementary events, let us simply suppose that the domain birth and death rates for a family of class _{i} and δ_{i}; in the specific case of the simple BDIM (2.1), λ_{i} = λ_{i} = δ** master BDIM**:

d_{1}(_{1} + δ_{1})_{1}(_{2}_{2}(

d_{i}(_{i-1}_{i-1}(_{i} + δ_{i})_{i}(_{i+1}_{i+1}(

d_{N}(_{N-1}_{N-1}(_{N}_{N} (

Let _{i}(

d_{1}_{1}(

The system (2.2) has an equilibrium solution _{1},..._{N} defined by the equality d_{i}(_{eq} (the total number of domain families at equilibrium). At equilibrium, ν = δ_{1}_{1}, i.e. the processes of innovation and death of

We can rewrite the model (2.2) in terms of the frequency of a domain family of class _{i}(_{i}(

d

Applying this identity to _{i}(

[d_{1}_{1}(

we obtain the following model for frequencies of the domain family (

d_{1}(_{1} + δ_{1})_{1}(_{2}_{2}(_{1}_{1}(_{1}(

d_{i}(_{i-1}_{i-1}(_{i} + δ_{i})_{i}(_{i+1}_{i+1}(_{1}_{1}(_{i}(

d_{N}(_{N-1}_{N-1}(_{N}_{N} (_{1}_{1}(_{N}(

System (2.4) should be solved together with equation (2.3).

The Master BDIM and Markov processes

Let us note that system (2.4) for frequencies is _{eq}. This does not imply d_{i}(_{i}(_{eq}, the master system (2.4) turns into

d _{1}(_{1} + δ_{1}) _{1}(_{2}_{2}(_{eq} (2.5)

d _{i}(_{i-1}_{i-1}(_{i} + δ_{i}) _{i}(_{i+1}_{i+1}(

d _{N}(_{N-1}_{N-1}(_{N}_{N} (

System (2.5) can be rewritten as a matrix equation

** p**(

where ** p**(

_{11} = -(λ_{1} + δ_{1}) + ν/_{eq}, _{21} = δ_{2} + ν/_{eq}, _{s1} = ν/_{eq} for all

_{i-1,i} = λ_{i-1}, _{i,i} = -(λ_{i} + δ_{i}), _{i+1,i} = δ_{i+1}, _{k,i} = 0 for all

_{N-1,N} = λ_{N-1}, _{N,N} = -δ_{N}.

It is easy to see that the sum of elements of each row (except for the first one) of the matrix _{eq} > 0. Therefore the matrix

Thus, neither the initial BDIMs (2.1) or (2.2) nor the equilibrium model (2.5) can be described by any Markov process with continuous time.

**Remark**. If, in system (2.5), ν = 0, then this system turns into a system of state probabilities for a Markov birth and death process with continuous time.

Equilibrium in BDIMs

Equilibrium sizes and frequencies of the domain family system

Let us suppose that the genome had ample time to arrive at a complete equilibrium state, in which not only d_{i}(_{i} satisfy the system

-(λ_{1} + δ_{1}) _{1} + δ_{2}_{2} + ν = 0,

λ_{i-1}_{i-1} - (λ_{i} + δ_{i})_{i} + δ_{i+1}_{i+1} = 0 for 1<

λ_{N-1}_{N-1} - δ_{N}_{N} = 0.

It should be emphasized that the master model does not assign _{eq}; this value has to be computed depending on the model parameters.

The following statement is central for further analysis.

**Proposition 1**. _{1},..._{N}),

_{1} = ν/δ_{1}

_{i} = ν _{j} / _{j} for all

_{j} = 1 for

_{eq} = ν (_{j} / _{j} (3.3)

This proposition ascertains that all evolutionary trajectories of the system (2.2) exponentially (with respect to time) approach the equilibrium state (3.2). The proof is given in the Mathematical Appendix.

**Remark**. Let us denote the ratio of the birth rate to the innovation rate

_{i}_{i}/ν,

and the ratio of the death rate to the innovation rate

_{i}_{i}/ν.

Then, according to Proposition 1, for any BDIM in equilibrium,

_{j} / δ_{j} - _{j}/δ_{j} - 1 = -1.

The principal goal of the treatment that follows is the analysis of the asymptotic behavior of equilibrium frequencies and sizes of domain families (_{1},..._{N}) at large

**Definition**. Let {_{i}}, {_{i}} be sequences of real numbers; let us denote _{i}_{i} if lim _{i}/_{i} = 1 and _{i} ~ _{i} if lim _{i}/_{i} =

Equilibrium frequencies for the simple BDIM

Let us apply Proposition 1 to the simple BDIM (2.1) with λ_{i} = λ_{i} = δ

**Definition**. A simple BDIM is

Let us recall that a random discrete variable ξ has the

^{i}/^{-1},

A random variable ξ has the

_{n} θ^{i} / _{n} = 1/^{j}/

Then, we have

**Proposition 2.**

1)

_{i} = (ν/δ)θ^{i-1}/^{i}/

_{eq} = _{i} = ν/δ ^{j-1}/

_{i} = (1/_{eq})(ν/δ)θ^{i-1}/^{i}/^{j}/

2)

_{eq} = ν/δ

_{i} = ν/δ_{eq}/^{-1} /

The proof is given in the Mathematical Appendix.

Thus, a simple BDIM can have equilibrium frequencies only of the form _{i} = ^{i}/

Simple methods exist for preliminary graphical estimation of the single distribution parameter θ [36 ch. 7, s. 7]. We will prove in the following section that, if we observe a power asymptotic for empirically observed equilibrium frequencies, then (assuming that the system can be described by a BDIM), the rates λ_{i} and δ_{i} should be asymptotically equal at large

Asymptotic behavior of equilibrium frequencies of a Master BDIM: Main Theorems

Let us consider the master BDIM (2.2); we showed in 3.1 that its equilibrium frequencies are the solution of the system

-(λ_{1} + δ_{1})_{1} + δ_{2}_{2} + ν/_{eq} = 0, (3.9)

λ_{i-1}_{i-1} - (λ_{i} + δ_{i})_{i} + δ_{i+1}_{i+1} = 0 for 1<

λ_{N-1}_{N-1} - δ_{N}_{N} = 0.

The following theorem gives all possible types of asymptotic behavior of the equilibrium frequencies and defines the connections between these asymptotics depending on model parameters. In particular, if there is no information on the exact form of dependence of the rates of birth and death of domains on the size of a domain family, the theorem can be used to qualitatively describe the dynamics of the asymptotic behavior of the equilibrium frequencies.

We will prove that the asymptotic behavior of a solution of system (3.9) is completely defined by the asymptotic relation between λ_{i} and δ_{i}. More precisely, let us define a function χ (_{i-1}/δ_{i}; we consider only functions of power growth, i.e. χ (^{s} at

χ (_{i-1}/δ_{i} = ^{s} θ (1+^{2})) (3.10)

where

**Definition**. Let us refer to a BDIM (2.2), (3.10) as

λ_{i-1}/δ_{i} = θ (1+^{2})) at large

λ_{i-1}/δ_{i} = 1 + ^{2})) for large

λ_{i-1}/δ_{i} = 1 + ^{2})) for large

We will show that the first three coefficients, _{i-1}/δ_{I} exactly specify all possible asymptotic behaviors of BDIM equilibrium frequencies.

**Theorem 1. **
_{
i
}

_{i} ~ Γ (^{s}θ^{i}^{a}, where Γ (

_{i} ~ θ^{i}^{a};

_{i} ~ ^{a};

_{i} ~ 1

The proof is given in the Mathematical Appendix. The classification of BDIM according to the order of balance is illustrated in Fig.

Different orders of balance in BDIMs.

Different orders of balance in BDIMs.

Asymptotics of equilibrium distributions for balanced BDIMs of different orders.

Asymptotics of equilibrium distributions for balanced BDIMs of different orders.

It follows from this theorem that, if a BDIM is non-balanced, then its equilibrium frequencies _{i} (and equilibrium family sizes _{i}) increase or decrease extremely fast (hyper-exponentially) with the increase of

Let us recall that a random discrete variable ξ has the ^{r}^{k}_{i}} follows (or asymptotically has) a discrete probabilistic distribution {_{i}} if _{i} ~ _{i} for large enough

**Corollary 1**.

_{i}

The following implication of Theorem 1 is of principal interest.

**Corollary 2**.

**Corollary 3**. _{i-1}/δ_{i} = 1 _{i} = _{i-1}/δ_{i} = 1 + ^{2}),

Rational BDIM

The hierarchy of BDIM types.

The hierarchy of BDIM types.

Let us suppose that the birth and death rates are of the form

λ_{i} = λ _{k})^α_{k}, (4.1)

δ_{i} = δ _{k})^β_{k}

for _{k}, β_{k} are real and _{k}, _{k} are non-negative for all

We will refer to BDIM (2.2.), (4.1) as

It is known that a wide class of mathematical functions can be well approximated by rational functions of the form (4.1) (see, e.g.

Specific cases of the rational BDIM are _{1}, _{1}, where _{1}, _{1} are constants, and polynomial BDIM, if

The following theorem describes all possible asymptotic behaviors of the equilibrium frequencies of a rational BDIM. Let us denote

θ =λ/δ,

η = _{k} - _{k},

ρ = _{k}α_{k} - _{k}β_{k},

β = _{k}.

**Theorem 2**.

_{i}^{η} θ^{i}^{ρ-β} (4.2)

_{k})^β_{k}/_{k})^α_{k}. (4.3)

The proof is given in the Mathematical Appendix.

**Corollary 1**. _{i}}

_{i}^{i}^{ρ-β}. (4.4)

_{i} follow the Pascal distribution with parameters

_{i}

_{i}

**Corollary 2**. _{i} and equilibrium frequencies p_{i} for a rational BMID have the power asymptotics if and only if η = 0 and λ = δ, i.e. the BDIM is second-order balanced, in which case

_{i}^{ρ-β}. (4.5)

Formula (4.4) gives the asymptotics for the equilibrium sizes of domain families _{i} and, accordingly, for the total number of families _{eq}. The exact expressions for these quantities are given in the proofs of Theorem 2 and Lemma (see Mathematical Appendix).

**Proposition 3**.

_{i} of a balanced (first or higher order) rational BDIM are

_{i} = ^{i-1}_{k}))^α_{k}]/_{k}))^β_{k}] for all

_{k}))^β_{k}]/_{k})^α_{k}].

_{eq} = ^{j-1}_{k}))^α_{k}/_{k}))^β_{k}).

For the rational, second-order balanced BDIM, the ratio of the birth rate to the innovation rate is

^{i}_{k})/Γ(1 + _{k})]^α_{k} / [Γ(_{k}) / Γ(1 + _{k})]^β_{k}.

The asymptotic formulas for equilibrium frequencies of rational BDIM could be considered as particular cases of the corresponding formulas of general theorem 1. Proposition 3 allows one to calculate the constants in the corresponding asymptotic formulas for the sizes of domain families for a rational BDIM. If only equilibrium frequencies are analyzed, the values of these constants become irrelevant because they contract. However, if the actual values of _{i} and _{eq} are of interest, the values of the constants are required.

Properties of the main types of rational BDIM

Simple BDIM

As shown above, a simple BDIM can have equilibrium frequencies only of the form _{i} = ^{i}/

We can extract from Proposition 2 some additional information, which could be helpful for estimating the model parameters. It is known that

_{E} + _{E} is the _{E} = 0.5772157...

More precisely, the approximation _{E} + ^{-1}/2 - ^{-2}/12 has an error less then 10^{-6} for

_{eq}_{E}] (5.1)

This means that, in the equilibrium state of the system, the total number of domain families grows only slowly (~ln

Furthermore, according to equation (2.3), in the equilibrium state of a simple BDIM ν/δ = _{1}, so we have

_{eq} / _{1}_{E} (5.2)

Formula (5.1) can be used for estimating the model parameters on the basis of empirical data.

In the more general case λ ≠ δ, we can also obtain an estimate of the rate of innovation ν. If λ < δ (θ < 1), then the series in the right part of (3.5) quickly converges,

^{i-1}/

so -ln(1-θ)/θ is a good approximation for the sum ^{i-1}/

_{eq} = (ν/δ) ^{i-1}/^{i}/

and

ν/δ = _{eq} θ/(-ln(1-θ)). (5.3)

Taking into account that ν/δ = _{1} (2.3), we have a relation

_{eq}/_{1}

which allows the parameter θ to be estimated on the basis of empirical data.

If

^{i}/^{-1}/2 + ^{-2}/12.

where the function

Further, if (1-θ)

^{i}/_{E} -

has an error less then [^{2}/4 and, in this case,

_{eq}/_{1}_{E} -

If (1 - θ)_{eq}/_{1} = ^{i-1}/

-(ln(1-θ)/θ-θ^{N}/[(^{i-1}/^{N}[1/(

For the simple BDIM, the ratio of the rate of duplications to the innovation rate is

_{i}_{i}/ν = ^{i} = θ(1-θ^{N-1})/(1-θ),

so

If the simple BDIM is the 2^{nd} order balanced, θ = 1, then

Thus, for the simple, second-order balanced BDIM, the number of duplications per time unit is

The total number of domains in the equilibrium state for the simple BDIM is

_{i} = ν/λθ(1-θ^{N})/(1-θ).

If a simple BDIM is second-order balanced, then

Linear BDIM

We saw that the assumption of independence of birth and death rates of individual domains on each other and on the size of domain families is incompatible with any power distribution of the equilibrium frequencies with the degree not equal to -1. The simplest case of a BDIM, which can have, depending on the parameters, three types of asymptotic behavior described by Theorem 1 (excluding the first one, hyper-exponential, which corresponds to a non-balanced BDIM; all linear BDIMs are balanced) and, in particular, any power asymptotics, is a model with

λ_{i} = λ (_{i} = δ (

The parameters _{i}/

Dependence of per domain birth and death rates on the domain family size for the second-order balanced linear BDIM.

Dependence of per domain birth and death rates on the domain family size for the second-order balanced linear BDIM.

Corollary 1 of Theorem 2 implies that equilibrium frequencies _{i} of a linear BDIM have asymptotics

_{i} ~ θ^{i}^{a-b-1}, where θ = λ/δ. (5.8)

In particular, if λ ≠ δ and _{i} follow the logarithmic distribution (in this case, the linear BDIM is asymptotically equivalent to the simple BDIM). If λ = δ, the linear BDIM is second-order balanced and the equilibrium frequencies _{i} follow the power distribution

_{i} ~ ^{a-b-1}. (5.9)

Thus, the dependence of the domain frequency on the family size is actually determined by the difference

More detailed information can be obtained using Proposition 4:

_{i} of domain families are

_{i} = ^{i-1}Γ(

_{eq} = ^{j-1}Γ(

_{i} = _{1}ν/δ Γ (

According to (2.3), in the equilibrium state of a linear BDIM, _{1} = ν/δ_{1} = ν/(δ(1 +

Suppose that equilibrium frequencies obtained from empirical data follow the power distribution _{i} ~ ^{-γ}; in this case, -γ is the slope of the empirical curve (ln_{i} versus ln

_{i} =

and

where

For the linear second-order balanced BDIM, the ratio of the birth rate to the innovation rate is

if 1 +

if 1 +

The case 1 +

In this case,

_{1}/(

Accordingly,

The total number of domains in the equilibrium state for a second-order balanced linear BDIM is

If the slope of the asymptote γ = -1, the linear second-ordered BDIM shows the same asymptotic behavior as a simple BDIM (2.1), but behaves differently at small

Quadratic BDIM

The linear BDIM takes into account the dependence of average birth and death rates of individual domains on the size of domain family, but does not imply a specific form of interaction between domains. Let us consider the simplest, pairwise interaction, which leads to λ_{i} ~ ^{2} and/or δ_{i} ~ ^{2}, i.e. one or both rates are polynomials on _{i} ~ _{i} ~ ^{2}), then the corresponding BDIM is non-balanced and equilibrium frequencies have hyper-exponential asymptotics. Thus, let

λ_{i} = λ (^{2} + _{1}_{2}), δ_{i} = δ (^{2} + _{1}_{2}), (5.13)

where _{k}, _{k}, _{i}, δ_{i} are positive for all

λ_{i} = λ (_{1})(_{2}),

δ_{i} = δ (_{1})(_{2})

Then, _{1} = _{1} + _{2}, _{1} = _{1} + _{2}, and

χ (_{i-1}/δ_{i} = θ (1 + (_{1}-_{1}-2)/^{2})),

where θ = λ/δ.

According to theorem 3 and Proposition 3, the quadratic BDIM with rates (5.13) has equilibrium sizes of domain families

_{i} = _{2} ν/δ θ^{i-1} Γ (_{1}) Γ (_{2}) / (Γ (_{1}) (Γ (_{2})) _{2}ν/δ θ^{i-1}^{ρ-2} (5.14)

where ρ = _{1} - _{1} and the constant _{2} = [(Γ (1+_{1}) Γ (1+_{2})] / [Γ (1+_{1}) Γ (1+_{2})], and the total number of domain families at equilibrium

_{eq} = _{2}ν/δ (^{j-1} Γ(_{1}) Γ(_{2}) / (Γ(_{1}) (Γ(_{2})). (5.15)

Note that the asymptotic behavior of frequencies _{i} does not depend on free coefficients _{2}, _{2} in (5.13), but only on θ and _{1}-_{1} (as follows from (5.14)), although the values of _{i} are proportional to the constant _{2}, which could depend on the free coefficients _{2}, _{2}. Let us consider the case _{2} = _{2} = 0 in more detail.

If only square terms are present in the expressions for the birth and death rates, λ_{i} = λ^{2}, δ_{i} = δ^{2}, then _{k} = _{k} = 0, _{2} = 1, _{i} = ν/δ θ^{i-1}/^{2} and _{eq} = ν/δ ^{j-1}^{2}. So at

_{eq}^{j-1} / ^{2} = ν/λ Polylog(2,θ) (5.16)

where Polylog is a special function, Polylog(^{j}/^{k}.

According to (3.2), _{1} = ν/δ_{1}; for this particular case of quadratic BDIM, _{1} = ν/δ and

_{eq}/_{1}

Formula (5.17) allows estimating parameter θ from empirical data if

More precisely, _{eq} = ν/λ ^{j}/^{2} = ν/λ (Polylog(2,θ)-θ^{1+N} LerchPhi(θ,2,1+

If, additionally, θ = 1 (the BDIM is second-order balanced), then

_{i} = (ν/δ)/^{2} = _{1}/^{2} (5.18)

and, at large

_{eq}^{2}/6 _{1}. (5.19)

From formulas (5.8), (5.15), we can extract some additional information, which could be helpful for estimating the model parameters at relatively small

The digamma function φ(z) is logarithmic derivative of Γ(

The function PolyGamma(^{th} derivative of φ(z), PolyGamma(^{n}φ(z)/dz^{n}, such that φ(z)= PolyGamma(0,z).

It is known that

^{2} = π^{2}/6-PolyGamma(1,1+

Thus we have

_{eq} = ν/δ ^{2} = ν/δ [π^{2}/6-PolyGamma(1,1+

_{eq}/_{1} = π^{2}/6-PolyGamma(1,1+

which can be used for estimating unknown parameters of the model.

The values of PolyGamma(1,^{2}+O(1/^{4}).

If linear terms are also present in the quadratic BDIM, λ_{i} = λ (^{2}+_{1}_{i} = δ (^{2}+_{1}

_{i} = _{2}ν/δ θ^{i-1}/_{1})/Γ (_{1})

where _{2} = Γ (1+_{1})/Γ (1+_{1}); _{eq} = Σ_{i} can be computed using special functions. In particular, if the BDIM is second-order balanced, θ = 1, then

_{i} = _{2}ν/δ Γ (_{1}) / (_{1})).

For this variant of the model, _{1} = ν/δ_{1} = ν/(δ(1_{1})), and so

Polynomial BDIMs

The quadratic models take into account the dependence of birth and death rates of individual domains on the simplest, pairwise interactions. If interactions of higher orders are postulated, λ_{i} ~ _{n}(_{i} ~ _{m}(_{n}(_{m}(

λ_{i} = λ_{k}^{m-k}, δ_{i} = δ_{k}^{m-k} (5.21)

where _{k}, _{k} are constants and _{0} = _{0} = 1. We suppose, of course, that _{i-1}/δ_{i} = θ (1+(_{1} - _{1} - ^{2})), where θ = λ/δ. We will suppose that θ ≤ 1.

According to Theorem 3, the polynomial BDIM with rates (5.21) has equilibrium sizes of domain families with

_{i} ~ θ^{i}^{ρ-m} (5.22)

where ρ = _{1} - _{1}.

In particular, if ρ - _{i} follow the Pascal distribution with parameters (ρ -

if ρ - _{i} follow the (truncated) logarithmic distribution;

if ρ - _{i} follow the geometric distribution;

if λ = δ, the polynomial BDIM is second-order balanced and the equilibrium frequencies _{i} follow the power distribution

_{i} ~ ^{ρ-m}. (5.23)

Note that the degree of the power distribution (5.23) depends only on _{1} and _{1}, and does not depend on other coefficients. In particular, if _{1} = _{1}, then _{i} ~ ^{-m}. This relation could be interpreted as follows: if the first two coefficients of polynomial rates λ_{i} and δ_{i} are equal, then the degree of the power distribution (5.19) is equal to the "order of interactions" of domains.

Formula (5.22) can be refined. Let _{k}), _{k}).

Then (see Proposition 3) the equilibrium numbers of domain families _{i} of the polynomial BDIM (5.18) are

_{i} = ^{i-1}_{k})/Γ(_{k})]

where _{k})/Γ(1+_{k})], and the equilibrium total number of domain families

_{eq} = ^{j-1}_{k})/Γ(_{k})].

For the polynomial model _{1} = ν/δ_{1} = ν/(δ _{k}), so

_{eq}/_{1} = ^{j-1}_{k})/Γ(_{k}))/_{k}.

This formula can be used for estimating the model parameters.

For the polynomial second-order balanced BDIM, the ratio of the death rate to the innovation rate is

_{i}_{i}/ν = (_{k})/Γ(1+_{k})) _{k})/Γ(_{k}) =

_{k})/Γ(1+_{k})]/[Γ(_{k})/Γ(1+_{k})].

Approximation of the observed domain family size distributions in prokaryotic and eukaryotic genomes with different BDIMs

Having developed the mathematical theory of BDIMs, we sought to determine which of these models, if any, adequately described the empirical data on domain family size distribution. To identify the domain sets of domains encoded in each of the genomes, the CDD library of position-specific scoring matrices (PSSMs), which includes the domains from the Pfam and SMART databases, was used in RPS-BLAST searches

The results of RPS-BLAST searches against the sets of protein sequences from individual genomes were interpreted as follows: non-overlapping hits to the same protein were treated independently; among overlapping hits, only the strongest one (lowest

Domain families in sequenced genomes and parameters of the best-fit second-order balanced linear BDIM

No. of ORFs in genome

No. of detected domain families

No. of detected domains

No. of ORFs with RPS-BLAST hits

Maximum size of a family

_{1} (_{1})

ν/δ = ν/λ

_{i}_{i}/ν

Sce^{b}

6340

1080

4575

3331

130

420 (436)

1.55

3.27

-2.72

1861.8

[3.28..3.53]

Dme

13605

1405

11734

7262

335

426 (435)

1.62

2.79

-2.17

1648.2

[8.44..15.50]

Cel

20524

1418

17054

11090

662

423 (421)

1.13

2.03

-1.89

1273.0

[16.18..∞ ]

Ath

25854

1405

21238

15006

1535

270 (277)

3.80

4.98

-2.18

1657.7

[17.09..26.89]

Hsa

39883

1681

27844

16755

1151

298 (288)

5.16

6.43

-2.27

2136.2

[17.14..22.88]

Tma

1846

772

1683

1268

97

501 (499)

0.14

2.22

-3.08

1606.4

[1.04..1.06]

Mth

1869

693

1480

1150

43

438 (436)

0.12

2.00

-2.88

1305.3

[1.18..1.27]

Sso

2977

695

1950

1614

81

386 (385)

0.36

2.04

-2.68

1167.8

[1.83..2.00]

Bsu

4100

1002

3413

2502

124

507 (510)

0.48

2.01

-2.53

1534.6

[2.46..2.79]

Eco

4289

1078

3624

2765

140

523 (519)

0.84

2.54

-2.70

1837.0

[2.45..2.61]

a: _{1}(_{1}), the observed (predicted) number of domains in class 1 (represented only once in the genome);

To enable statistical analysis using the χ^{2}-method for the entire range of the data, including the sparsely distributed classes corresponding to large families, the data needed to be combined. For each genome, the observed domain family frequencies were grouped into bins, each containing at least 10 families; typically, bins corresponding to families with small number of members included a single size class (e.g. all single-member families, two-member families etc), whereas bins corresponding to large families may span hundreds of size classes, most of them empty. Theoretical distribution values for a bin combining observed data from _{i} is the predicted number of families in the _{max} is the number of domains in the most abundant of the detected families), was normalized to equal the total number of families detected in the given genome (a requirement for the χ^{2} analysis). χ^{2} values were computed to measure the quality of fit between the observed and the theoretical distributions. The distribution parameters (θ for the simple BDIM, ^{2} value.

The simplest model that resulted in a good fit to the observed domain family size distributions was the second-order balanced linear BDIM (Fig. ^{2}) for this model was >0.05, i.e. no significant difference between the model predictions and the observed data was detected. Considering the first-order balanced linear BDIM, which involves varying the parameter θ, did not result in a significant improvement of fit for any of the analyzed genomes (data not shown). In contrast, a fit to a truncated logarithmic distribution (prediction of a simple BDIM) failed for all genomes (^{2}) < 10^{-5}; Fig. ^{2}) = 0.0013 for ^{2}) < 10^{-5} for the rest of the genomes; Fig.

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the yeast

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the yeast _{i} = 11521Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the fruit fly

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the fruit fly _{i} = 5258Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the nematode worm

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the nematode worm _{i} = 2453Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the thale cress

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the thale cress _{i} = 10750Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM:

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: _{i} = 22030Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the hyperthermophilic bacterium

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the hyperthermophilic bacterium _{i} = 4256Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the thermophilic euryarchaeon

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the thermophilic euryarchaeon _{i} = 2753Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the hyperthermophilic crenarchaeon

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the hyperthermophilic crenarchaeon _{i} = 2714Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the bacterium

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the bacterium _{i} = 3489Γ(

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the bacterium

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the bacterium _{i} = 6776Γ(

Comparison of different approximations of the empirical domain family size distribution:

Comparison of different approximations of the empirical domain family size distribution: _{i} = 6776Γ(_{i} = 528 × 0.87^{i}/_{i} = 602^{-1.76}.

Comparison of different approximations of the empirical domain family size distribution:

Comparison of different approximations of the empirical domain family size distribution: _{i} = 10750Γ(_{i} = 344 × 0.98^{i}/_{i} = 516^{-1.36}.

Fitting the observed domain family size distribution with the second-order balanced linear BDIM resulted in positive values of the parameters

The data used to fit the BDIM typically included 50–60% of the proteins encoded in a given genome (Table

General discussion and conclusions

Here, we presented a complete mathematical description of the size distribution of protein domain families encoded in genomes for simple but not unrealistic models of evolution, which include three types of events: domain duplication (birth), domain elimination (death), and domain innovation. In biological terms, innovation could involve gene acquisition via horizontal gene transfer, emergence of a new domain from a non-coding sequence or a non-globular protein sequence, or major modification of a domain obliterating its connection with a pre-existing family. Innovation via horizontal gene transfer appears to be particularly common in prokaryotes

We showed that birth-death-innovation models (BDIMs) with different levels of complexity lead to readily distinguishable predictions regarding the distribution of domain family sizes in genomes. In particular, we defined the exact analytic conditions that lead, exactly or asymptotically, to power law distributions, which have recently received ample attention, as they were uncovered in various biological and social contexts

Three groups of observations made in this work seem to have the greatest potential of enhancing our understanding of genome evolution and, perhaps, evolution of other complex systems. First, we proved that, within the BDIM framework, there is a unique equilibrium state of the system, which is approached exponentially, with respect to time, from any initial state. In this equilibrium state, the number of domain families in each size class remains constant and follows a unique distribution depending on the type and parameters of the BDIM. In particular, power asymptotics emerges when and only when a BDIM is second-order balanced, i.e. the rates of domain birth and death are asymptotically equal. Since we showed that the observed size distributions of domain families in all analyzed genomes indeed tend to power law asymptotics, the results are compatible with the notion that the genomes are close to a steady state with respect to the domain diversity (_{eq}, the number of distinct domain families at equilibrium, under the using the BDIM convention) and distribution (_{i}). Taking a broader biological perspective, this result might indicate that evolving lineages go through lengthy periods of relative stasis when the level of genomic complexity remains more or less the same. Under this view, the stasis epochs are punctuated by relatively short periods of dramatic changes when the complexity either greatly increases (the emergence of eukaryotes is the most obvious case in point) or decreases (e.g. evolution of parasites). These bursts of evolution might be described as transitions between different BDIMs in the parameter space, with some of the trajectories potentially involving non-balanced BDIMs. The analogy between this emerging picture of genome evolution and the punctuated equilibrium concept of species evolution, which has been developed through analysis of the paleontological record

Second, we showed that the simplest model that adequately describes the observed domain family size distributions is the second-order balanced linear BDIM; in contrast, simple BDIMs do not show a good fit to the observations. This has potentially important implications for the mode of domain family evolution. Simple BDIMs are based on the notion that the likelihood of duplication (birth) or elimination (death) of a domain is uniform across the genome and does not depend on the size or other characteristics of domain families (the independence assumption). Clearly, under the independence assumption, a duplication (birth) as well as elimination (death) of a domain is more likely to occur in a large family than in a small one, but only because the overall probability of such an event is proportional to the number of family members, whereas the birth (death) rate

Third, the BDIM equilibrium condition with respect to the total number of domain families, ν = δ_{1}_{1} (ν is the innovation rate, δ_{1} is the domain death rate for class 1 families, and _{1} is the number of domain families in class 1) allows us to estimate the ratio between domain innovation rate and the domain death and birth rates. Indeed, according to the above and the definition of a second-order linear BDIM, which is the best fit for the actual data, λ = δ = ν/_{1}(1+

As already indicated, BDIMs do not explicitly incorporate selection. However, the present analysis shows that only models with precisely balanced domain birth, death and innovation rates can account for the observed distribution of domain family size in each of the analyzed genomes. It seems likely that the balance between these rates is itself a product of selection. There is little doubt that BDIMs will be eventually replaced by more sophisticated formalisms that will more realistically capture the mechanisms of genome evolution. Nevertheless, even the crude modeling described here seems to reveal several potentially interesting and non-trivial aspects of the evolutionary process.

Mathematical Appendix. Proofs of some statements

Proof of Proposition 1

When system (3.1) is solved consecutively from the last equation to the second one, it becomes obvious that the solution is unique up to a constant multiplier.

Next, if _{i} = _{i-1}λ_{i-1}/δ_{i}, _{i+1} = _{i-1}λ_{i-1}λ_{i}/(δ_{i}δ_{i+1}), then the substitution shows that (_{i-1},_{i},_{i+1}) satisfy the _{2} = _{1}λ_{1}/δ_{2} in the first equation, we get _{1} = ν/δ_{1} and, consequently, _{eq} = _{i}, so we have (3.3).

Since system (2.2) is linear, the equilibrium state (_{1},..._{N}) is asymptotically stable if the real parts of all characteristic values of the matrix

are negative.

The following theorem (see _{ij}|, ^{k}_{k} > 0 for all _{k} is the main minor of the matrix

To apply this theorem, let us consider the

It is easy to see that

det _{n} = -(λ_{n} + δ_{n})det _{n-1} - λ_{n-1}δ_{n} det _{n-2}, (A1)

det _{n} = -δ_{n}det _{n-1} - λ_{n-1}δ_{n} det _{n-2}.

Using these equalities, we can prove that for any

det _{n} = (-1)^{n}δ_{n}δ_{n-1...} δ_{2}δ_{1}.

Indeed,

det _{n} = -δ_{n} det _{n-1}-λ_{n-1}δ_{n} det _{n-2}=

δ_{n}((λ_{n-1}+δ_{n-1}) det _{n-2} + λ_{n-2}δ_{n-1} det _{n-3}) - λ_{n-1}δ_{n} det _{n-2}=

δ_{n}δ_{n-1} (det _{n-2} + λ_{n-2} det _{n-3})= (subsequently using (A1))=

(-1)^{n-2}δ_{n}δ_{n-1...} δ_{3}(det _{2} + λ_{2} det _{1}) = (-1)^{n}δ_{n}δ_{n-1...} δ_{2}δ_{1}.

Further, it is easy to see that for any

det _{n} = det _{n} - λ_{n} det _{n-1}.

Taking into account that _{1} = -(λ_{1} + δ_{1}) < 0 and that the sign of det _{n} coincides with (-1)^{n}, it is easy to prove that

det _{n} > det _{n} if det _{n} > 0 and det _{n} < det _{n} if det _{n} < 0.

Thus, the sign of det _{n} coincides with the sign of det _{n} and so (-1)^{n}_{n} > 0 for all _{N} are negative and so the single equilibrium is asymptotically stable, QED.

Proof of Proposition 2

For simple BDIM (2.1) _{i} = ν _{k} / _{k} = (ν/δ)θ^{i-1}/^{i}/

_{eq} = _{i} = ν/λ ^{i}/

_{i} = _{i}/_{eq} = (θ^{i}/^{j}/

If a simple BDIM is balanced, then θ = 1 and so

_{eq} = ν/λ ^{j}/

_{i} = ν/λ _{eq}/^{-1}.

Proof of Theorem 1

The condition (3.10) can be rewritten as λ_{i-1}/δ_{i} = ^{s}θ(1+^{2})) = ^{s}θ (1+^{2})). Thus, we can choose ^{2})) converge, 0 <^{2})) < ∞. It follows that

_{s-1}/δ_{s}) ~ Γ(^{s} θ^{j}

According to Proposition 1, _{i} = _{i}/_{eq} ~ _{k} / _{k}. So

_{i} ~ _{s-1}/δ_{s}) ~ Γ(^{s} θ^{i}^{s}θ^{i}(

Applying the main asymptotic property of Γ-function, i.e. Γ (^{c} at large

Γ (^{a}, and so _{i} ~ Γ (^{s} θ^{i}^{a}.

Proofs of Corollaries 1–3

If a discrete random variable ξ has the Pascal distribution, then

^{r}^{r-1}^{i} ~ ^{i}^{r-1} for large

and it becomes evident that, for _{j} of the first-order balanced BDIM follow the Pascal distribution with parameters (

If _{i} ~ θ^{i}_{i} follows the truncated logarithmic distribution with parameter θ. If _{j} ~ θ^{i} and _{i} follows the geometric distribution.

Further, _{i} ~ ^{a}, that is the sequence _{i} follows the power distribution with the power

Finally, if λ_{i-1}/δ_{i} = 1 + ^{2}), that is, if θ = 1 and _{i} ~ const; in particular, if λ_{i-1} = δ_{i} for all _{i} = ν for all _{i} = 1/

Proof of Theorem 2

According to Proposition 1, system (3.1) has the unique solution:

_{1} = νδ_{1}, _{i} = ν _{s} / _{s} for all

_{i} = ν/λθ^{i}

Applying the Lemma (see below), we get

_{i}^{i} Γ (^{η}^{ρ-β}, as

where the constant _{k}))^β_{k}] / _{k})^α_{k}].

**Lemma.** Let _{k})^α_{k}, _{k})^β_{k}, where _{k}, _{k} are positive. Let us denote

η = _{k} - _{k}, ρ = _{k} α_{k} - _{k}β_{k}, β = _{k},.

Then with fixed

^{η}

as

_{k}))^β_{k}] / _{k})^α_{k}].

Proof.

_{k})^α_{k} = [Γ(_{k}) / Γ(1+_{k})]^α_{k},

_{k})^β_{k} = [Γ(_{k}) / Γ(1+_{k})]^β_{k}, so

_{k}) / Γ(1+_{k})]^α_{k}}/{_{k})/Γ(1+_{k})]^β_{k}}=

_{k}))^α_{k}]/_{k}))^β_{k}]

where

_{k}))^β_{k}]/_{k})^α_{k}].

Let us use the known asymptotic relation

Γ (^{a} with

Thus we have

_{k}))^α_{k}]/_{k}))^β_{k}]

(Γ(^{η}_{k}) / Γ(_{k}] / _{k}) / Γ(_{k}]

(Γ(^{η}_{k} α_{k}] / _{k}+1)β_{k}]=

(Γ(^{η}

and Lemma is proved.

Proof of Proposition 3

It follows from the proof of the Lemma that

_{i} = ^{i}_{k}))^α_{k}] / _{k}))^β_{k}] for

where _{k}))^β_{k}] / _{k}))^α_{k}].

Let us show that _{1} can be expressed by the same formula if

_{k}))^α_{k}] / _{k}))^β_{k}=

ν/δ (_{k}))^β_{k} / _{k}))^α_{k})) (_{k}))^α_{k} / _{k}))^β_{k}=

ν/δ (_{k}))^β_{k} / _{k}))^β_{k} = ν/δ (_{k}))^β_{k} = _{1}

Thus,

_{eq} = ^{j-1}_{k}))^α_{k}/_{k}))^β_{k}).

**QED.**

Contributions of individual authors

GPK developed most of the mathematical formalism and wrote the draft of the mathematical part of the manuscript; YIW performed the identification of domain in sequenced genomes and the statistical analysis of the resulting distributions and wrote the draft of the corresponding part of the manuscript; FSB proved some of the theorems; AYR largely incepted the work and contributed to the formulation of the models; EVK contributed to the inception of the work and the formulation of the models, gave the biological interpretation of the results, wrote the background and discussion sections and extensively edited the entire manuscript.

Acknowledgements

We thank Alexei Kondrashov, Alexei Ogurtzov, and Vladimir Ponomarev for critical reading of the manuscript and the Koonin group members for helpful discussions.