Asymmetry of Coding Versus Noncoding Strand in Coding Sequences of Different Genomes

(1)

Mary Ann Liebert, Inc.

Asymmetry of Coding Versus Noncoding ^Strand in Coding Sequences of Different Genomes

sTANIsŁAw CEBRAT,I MIRosŁAw R. DUDEK,z PAWEŁ MACKIEWICZ,I MARIA KowALCzvK,l and MAŁGoRZATA FITAl

ABSTRACT

We have used the asymmetry between the coding and noncoding strands in different codon positions of coding sequences of DNA as a parameter to evaluate the coding probability for open reading frames (ORFs). The method enables an approximation of the total number of coding ORFs in the set of analyzed sequences as well as an estimation of the coding proba' bility for the ORFs. The asymmetry observed in the nucleotide composition of codons in coding sequences has been used successfully for analysis of the genomes completed at the time of this analysis.

INTRODUCTION

tfthere

źrre many methods to discriminate between coding and noncoding

DNA

sęquences (Fickett, 1996,

l for

review).

For

nondisrupted genes, one

of

thę better criteria

is

the length

of

an open reading frame

(oRF)'

In the yeast genome project

(SGD, Scccłaromyces

Genome Database), the lower

limit of

an

oRF

length has been sęt at 100 codons.

An

additional criterion is the value of the codon adaptation index

(CAI)

(Sharp and

Li,

1987). It has beęn arbitrarily accepted in

SGD

that

oRFs

shorter than 150 triplets with

CAI <

0.l1

^areconsideręd noncoding (Dujon

ęt

al., L994). It has bęęn also accepted

in SGD

that the longer

oRF of

a

pair of

overlapping

oRFs is

considered coding. Generally, ^thesęcriteria

work well,

but some

oRFs

are shorter than 150 codons with

CAI

< 0.11 and perform already documented coding functions. Such

cri-

teria as

CAI

and codon bias index

(CBI)

(Benetzen and

Hall,

1982) are based on the observation that codon usage

in

protein

coding

sequences does not correspond to codon frequency expected

from

thę nucleotide composition of the genome.

Two

different forces have been suggested to be responsible for this bias. One is translational selection based on relative concentrations of iso-accepting

tRNAs

(Ikemura, 1982). The second

is

mutational pressure that forces a change

in ovęrall

nuclęotide composition

of DNA

^and^especially

influences the third (silent) positions

of

codons (Sueoka, 1988; Sharp

et

a1.,1993, for review).

We

have assumed that a

coding

sequence should reflect

specific

construction

of

the genetic code, non- random (biased) amino acid usage, and

physical

restrictions

of

the

DNA ^(RNA)

molecule. The most fun- damental rule of

DNA

composition is complementarity of ^thenucleotide bases,

A:T

and

G:C.

In a random sequęnce' this rule implicates balance in purine/pyrimidine composition of both

DNA

strands, which is observed in long

DNA

strętches (as in yeast chromosomes) but not in coding sequences organized in operons

(I

phage, for example). Dujon ęt a|. (1994) observed a relativę abundance of purine doublęts

in

coding se-

llnstitute of Microbiology, and 2lnstitute of Theoretical Physics, Wrocław University, Wrocław, Poland.

(2)

quences of yeast chromosome 2. The same can be concluded from the results presented by

Karlin

and Burge (1995) for other genomes.

This asymmeĘ

in purine/pynmidine composition of coding vs noncoding

DNA

sffands has been used to discriminate

ORFs in

the yeast genome (Cebrat

et

al., I997a,b).

The

asymmetry in nucleotide composition of both strands of a protein coding sequence

is

a sum of the

specific

asymmeffy of each position

within

codons. Thuso the asymmetry

in

the first, the second, and the third positions could compensate for each other within a coding sequence, and asymmetry for each position in codons ^{should be} analyzed separately. In some aspects, this method resembles the

CAI

approach, but the results of the analysis arę not correlated with the ręsults using

CAI;

this

will

be discussed later. Rather, our approach encom- passes some rulęs in amino acid composition ^ofproteins and a

highly

sophisticated structure for the genetic code.

MATERIALS AND METHODS

Databases and sofhłare

Ttte Saccharomyces

cerevisiae

genome sequęnces were downloaded September

23,1996,

from genome- ftp.stanford.edu. Information

on

yeast gene function,

ORF

homology, and their presumed functions ^was downloadęd

November

16, 1996,

from

htę://www.mips.biochem.mpg.de. ^Sequences

for all

viruses werę downloaded

May

10, 1997, from ncbi.nlm.nih.gov, sequences for

Escherichia coli

from http://genom4.aist- nara.acjp on

May 9,

1997, for

Haemophilus

influenzae, Mycoplasma genitalium, ^andMethanococcus

jan-

naschii from

hĘ://www.tigr.org

on

May

8, 1997,for Mycoplasma pneumoniae ^fromhttp://www.zmbh.uni- heidelberg.de

on May Io, 1997, for

Synechocysrls

sp.fĘ://ftp.kazusa.orjp on May 8, 1997, and for

Methanobacterium thermoautotrophicum from fĘ://www.genomcorp.com on June

3,

1997.

After

the rę-

ffieval,

data were not updated.

In the analyses, we have considered

all ORFs

found

in

the completely sequenced genomes longer than 1O0 codons stalting with

ATG

and ending with onę of the three univęrsal stop codons.

The software

for all

the analyses was written

by

one of the authors

(M.R.D')

Gr

aphic

^rêprê^sê

ntation of

^se que nc e s

To makę a graphic representation of a sequence in two-dimensional spacę'

we

analyzed the displacement of a

DNA

walker that checked each position within codons separately. For thę

DNA

walk, we used a mod- ified method of Berthelsen etal. ^(1992).For each sequencę' we performed three

DNA

walks, independentĘ,

for

each nuclęotide

position in

codon triplets. The

first walker

starts

from

the

first

nucleotide position

of

the first codon and then jumps every third nucleotide

until

the end of the examined sequence (stop codon) has been reached.

Similarly,

the second and the third walkers start from the second and third nucleotide po-

sitions of

the

first

codon, respectively.

Every jump of

a walker

is

associatęd

with

a

unit

shift

in

two-dimensional space depending on the type of nucleotide visited. The shifts are (0,1) for

G,

(1,0) for

A, (0,-

1) for

C,

and

(-

1,0) for T. Hence, each

DNA

walk represents a history of nucleotide composition of the first, the second, or the third position

of

codons along the

DNA

sequence. The three

walks

together have been called a

spider

and a single

walk

has been called a spider leg.

An

example

of

a spider representing a typ-

ical

gene

in

the yeast genome (the

multicopy

suppressor

of

sin4,

YML109w) is

seen

in Figure la.

In

Fig-

ure

lb,

the sequence coding

for

a hydrophobic protein (vacuolar calcium transpofter protein

YDL128w) is

shown, and

in Figure lc,

a spider representing an intergenic sequęnce of 921 triplets

is

shown. The spiders depict the nucleotide composition

of

the three positions

in

codons, but

it is

also possible to extract ^some numerical information from these plots arrd to charactęrize whole ^setsof

oRFs

by the method.

Distribution of ORFs in a torus projection

For

each

oRF,

we measured (in degrees) the angles of thę vector dętermined

by

the

origin

and the end of the spider legs. In fact, thę angles equal to arcus tangent

(G-cy(A-T)]

have positive values for the

first two

quadrants

of

the

plot

and negative valuęs

for

the

third

and fourth quadrants.

This

has ęnabled us to construct a

plot

where each

oRF is

represented

by

a point whose coordinates are (x) the angle rępresent-

Ż60

(3)

FIG.

l.

Two-dimensional representation of

DNA

walks (spiders) performed for different positions in codons for yeast sequences. (a) An example of:a spider representing ^utypi"a gene in the yeast genome' the sequence coding for a multicopy suppiessor of sinł 1yNfl-tbsw). ¹⁵⁾rne sequence coding for a hydrophobic protein, vacuolar calcium transport

p'oi"l" (ibl-rzsw).

^(c)

A

spider representing an intergenic sąuence 921 triplets long.

ing the first leg and (y) the angle representing the second leg. It is also possible to use the angle of the third

le!

as one,of the two coordinates

oi

as the third coordinate in three-dimensional ^space.

As

both axes of the

pt

t are

, ^al"d ^f'om -

180 degrees

to ł

180 degrees' the plot is,

in

fact, a projection of torus, and its area

is finite. The distributions of differęnt

sęts

of oRFs from different

genomęs are presented

in

Figures

2 and3.

Approxitnation of total ^nurnber of coding ORFs ^{in a} Senome

For

three genomes,

s.

cerevisiae,

H.

influenzae, and

E. coli,

we compared ^thedistributions

of all oRFs

found

in

the genomes wittr the distributions

of ORFs

with already identified functions'

To

compare the distributions of coding ^sequenceswith the distribution of

all ORFs

in the same genome' we analyzed for

ORFs

with known functions the average value and the standard deviatiogs

(SOJjor:gt:t

of the first legs and the.average value for angles of the second legs. The average values,

At

and

Az,

for the set of genes are considered coordinates of the center of the genes' distribution. Next, ^wenormalized ^para-

position ¹ -60 -*':&b'""11'r

(4)

180

90

0

-90

-180

180

90

0

-90 -180

180

90

0

-90

-180

180

90

0

-90 -180

c b

e d

g h

',{

k

-180 -90 0

⁹⁰

¹⁸⁰ -180 -90 ^o ⁹⁰ ¹⁸⁰ -180-90 0 90

¹⁸⁰

FIG. 2.

Distribution of

oRFs

into the projection of torus where (x) is the angle of the first

Ę

of spider and (y) is the angle of the second leg of spider _[anexception is i, where (y) represents the angle of the third leg of the spider].

(a) S. cerevisiae, all ORFs

)100

codons. (b) S. cerevisiae, ORFs with known function. (c) S. cerevisiae, ORFs coding for transmembrane proteins. (d) H. influenzae, all

OMs>100

codons. (e\ H. influenzae, ORFs with known function. (f) S. cerevisiae, intergenic sequences)l0O triplets. (g) E. coli, all ORFs

)

100 codons. (h) E. col/, ORFs with known function . (i) E. coli, ORFs with known function (angles I and 3). (i,k,l) Phage T4 vaccinia, and variola viruses, respectively, all ORFs

)

100 codons.

!ił

"jri.

'. il.',

l:łł

'-l ^tl

ffi ^,,

^tu*

-r ^:.

!jrl

łł..l .'"Y?

'il :I:T;

.. , {Ę)

:ffiłi

'Srij -i::

'łiffi lr*i

_r#i'ł:{:

Ilffi

it'.t,i.ii

ffi ffi ^ffił

^'ł-'li..

^'iffi

^ł-:,'

.tiiBi

l{l{$ł

ilff

'ł_. ; l'1 ':."..]t' . ^'

it"..::t_'!..r1!

.'$

d!

1..

pffii

r.'. .1t_g '4..

ił:it?ri {;,tj#

.łE.,

i..;.';L . .'.i.1

,

.;'..':.1.i' 'tt ' . ",:1'..r.)

liij',# $lłi,jl

ił.i.;i:

iti..,

1i'il;::.i }ijŁ:

't

ffi

'iru 1

(5)

I

'ffii

_r{lł,

\.lrrt' 't.'hri;

180

90

0

-90

-1 80 180

90

0

-90

-180

180

90

0

-90

-180

180

90

0

-90

-180

_-1

₈₀ -90

⁰

⁹⁰ ¹⁸⁰ ^-180 ^-90

⁰

90 ¹⁸⁰ -180 -90

⁰

⁹⁰

¹⁸⁰

FIG. 3.

Distribution of

oRFs

into the projection of torus where (x) is the angle of the first leg of spider and (y) is the angle of the second leg of spider. ^(a)M. pneumoniae, all

oms

^>¹⁰⁰^codons'^(b)

^M'

genitalium' all

oRFs'

angle l' vs angle 2. ^(c)As in (u), to, angte

i

^{vs angle}^{3. (d)}M. jannaschii, all

oRF )

100 codons' ^(e)M' jannascłii'

oRFs

with known functions, angle I

vsLgle

z. (D o. in ^(e),but angle 1 vs angle 3. ^(g)synechocystis sp., all

oRFs

> 100 codons.

(h\

synechocy.stls sp., onns-witrr known function. (i)-synechocysttr sp., ^shorteroRlrs from pairs of overlapping ones. (j) M. tnermotrophicum,

alloRFs

^>¹⁰⁰^codons.^(k\^M.thermotrophicum,longer olurs from pairs of over-

iŃn' ono' ol

^M.thermotrophicum, shorter ORFs from pairs of overlapping ones.

b c a

d e

g h

k

ffi

i,,i,.ffi:

,' ' : l!Ln*.

':it

-., ' ,,|ł

'i'$ii

. t{

J..' .

.ril

I

I :,d

.fi I v:lt

l ji:ri.n l.. :' ił..l ; ..' l l F*.ł' :.'1 ^'.tir.

F.

. ł'.

!' r"''

l.! ^{r. I}

: -- 'ri '- ll.r

'i: _l':ł_.:.1i1

r-.. j l'.ł,

,rt. ." i i T:i;,';'

lit.l.i:

.ri

iffi

_ł.' ^,. ^if.r.fu^'rir.:^.^i.^.,...'^'.,

ffit

:.łr' .!'r 'r

; !'. ..f ' ', *:1,:tl.ti_'lr!.' .l _-

<ji.i .j:-

i;i

ł ...r1...

"t':"

'rj;.I' .'.ł.

:ą

I

:;

(6)

meters

by giving

each

oRF

values

A1

and

A2, which

are equal to thę differences (expressed

in SD)

between the average values and the angle

of

the first leg and the second leg, respectively,

for

a given

ORF.

For

each

individual ORF,

we measured the distance from the center, which is equal

to

Ą:\/ĘTB

To

estimate the number of coding

ORFs

we used the algorithm

\:

^ORF;,,

⁺ (G""t/GiJ(ORFiJl

where

\

^isan assumed maximal number of coding

ORFs

for a given

Ai, ORF;'

is the number of

all ORFs inside

the space

with a

distance

to

the distribution center

Ś Ai, Gi. is

the number

of

genes

(oRFs

with

known

function)

with

a distance to the distribution center =

Ą,

^and

^{Go', is}

^the

^numbei ^of

^genes

^(oRFs

with known function) with a distance to center >

Ą.

^Inthis algorithm, for a given

Ai,

Wo

assimed

that

all oRFs

inside the space determined by

Ą

^are^codingand that the fraction or cóoing

oRFs

outside the space stays

in

the same proportion to the number

of ORFs

inside the space as

it is for

the distribution

of ORFS

with the described coding function.

We

plotted the

\

^values

^vs

^distance

^(AJ. ^To

^avoid^errors,we

cut

off 5?o

of ORFs

with maximum values of

Ą

^and^I07o^with

^minimal

values of

Ai. To

estimate the approximated number of

coding oRFs,

we found the extrapolated va]ue of

N

for

Ą :

^{0. In}^Figure

⁴

^{are shown}plots for S. cerevisiae,

H.

influenzae,

and

>

₁₀₀

E. coli. For

_codons

_long

comparison for ₍₇₂₁₄₀

_ORFs)

S. cerevisiae, _and_the_setwe used as an estimate

_of

6200 coding

oRFs primarily ielected by

for coding

ORFs

the set

MIpS

of

all ORFs (Munich

Information Centre for Protein Sequences) (htĘ://www.mips.biochem.mpgde). _For

E. coli,

wealso prepared the distributions

of

genes and

ORFs

taking into account the angles of the third legs instead

of

the second ones. The approximation

of

total number of coding

ORFs

done for these distributions

is

_{shown in}

Fig.

4c.

Estimation of coding probability for ^an ^ORF

To

estimate the

coding probability for

an

ORF,

we

divided

the whole set

of all ORFs into

classes according to

Ai

Yalues.

For

each class, we counted the number

of

expected coding

ORFs

^(usingthe method described) and the total number

of ORFs

found

in

a given class. The ratio between these values has been assumed as coding

probability for oRFs

locatized

in

a given class. It is possible to use thę coding proba-

bility

values obtained

by

this method

direcfly

or to plot them against

Ą,

^to^make^the

polynomial

approximation, and

to

describe the

probability

as a

function of Ą. ^The

results are

available

at our

WWW

site (angband.microb.uni.wroc.pl).

RESULTS AND DISCUSSION

Because the representation

of

both parameters describing the

ORFs is in

degrees (-f 180), areas

of

the plots seen in Figures

2

and 3 are finite (they are the surfaces of the toruses). In thó case of the yeast gęnome' about 67o

of

this area

includes

about 75vo

of all

coding

oRFs. All

these genes have more

A

than

T in

thę

first

and second positions

of

codons, more

G

than

C in

the

frst

position, and less

G

than

C in

the second position. The

ORFs

with the lower number of

A

than

T

in the second position are

localized

in the plots be-

low

the main

cloud of

genes. In this latter region are genes coding

fór

ffansmembrane proteins (information on the set of these genes was

kindly

supplied by

A.

Goffeau, Catholique

Univęrsitie

de Louvain). The shift of these genes below the main cluster is due to the presence of codons coding for hydrophobic amino acids,

which

are

rich in T

in the second position, and underrepresęntation of codońs for

ńydróphilic

amino acids,

which

are

rich in A in

the second position. Because transmembrane proteins

porr"r,

iryfuophobic spans, they are

relatively rich in T in

the second position.

When comparing different classes

of

yeast

oRFs with

annotation

1-ó in

the

MIPS

base,

it

can be seen that classes

2

and 3 are a

little

more dispersed than

class I

(identified genes have annotation 1

in MIpS),

which suggests that there are only a few noncoding

ORFs

in classes

I,2,

and 3. In the classes above 4 there are many noncoding

ORFs.

In fact, in class 6, only a few coding

ORFs

should be expected (data not shown).

The correlation

coęfficient

between the

CAI

and the distance

from

the center

is

closę to 0. One would 264

(7)

a

23456

distance from the centre of gene distribution

1 1.5 2 2.5 3 3.5

distance from the centre of gene distribution

o 0.5 1 1.5 2 2.5

dlstanco from the centre of gene distribution

FIG. 4.

Approximation of the total number of coding ORFs

)

100 codons in genomes of (a) S. cerevisiae, approximations done for the whole łet of

oRFs )

100 codons and for oRFs published in the SGD database (after preliminary selection), (b) H . influenzae , (c) E. coli , approximations done for distributions prepared for relations between angles of the first and the second legs and relations between angles ofthe first and the third legs.

6 ¹⁸⁰⁰

IL

5

¹⁶⁰⁰

.s B ¹⁴⁰⁰

!

I

^12oo

Ior- o 1000 4l

E

Boo

rr c_o ⁶⁰⁰

+.

.E

_x ^4oo

I

²⁰⁰

tL*o

c

,f ⁶⁰⁰⁰ to

.g

^5ooo

E_o S _o ^4ooo

Lo

E

_:' ^3ooo

tr

E

_(! ^2ooo

.Ex o 1000

eą

o0

f

^Booo

t,

3

^Tooo

.E

E

_(t ⁶⁰⁰⁰

bo 5000

E

b₃ ^4ooo

1' e,_o ³⁰⁰⁰ g

.E

_x ^2ooo I looo CLCL

l!o

approximation in dbase with all ORFs>100 codons: 4718

approximation in published dbase, (after preliminary ORF elimination): 4691

the first and the third positions - 3155

the first and the $econd positions - 2927

(8)

expect such a result because the

CAI

is sęnsitivę to the composition of the third positions of codons, whereas we used parameters measuring the asymmenry of the

fint

two positions. We have observed some correlation (about

-0.4)

between the distance from the center of the distribution and the length of the

ORFs

for the yeast.

As

the distance

is

reciprocal to the coding

probabiliĘ,

a negative correlation should bę expected because

of

two phenomena: (1) noncoding overlapping

ORFs

or random

ORFs

are usually shorter, and they are found far from the center

of

the distribution, and (2)

it

seems that very long

ORFs

could be considered as "averaged smaller

ORFs."

Thus, the SDs for the class of long

ORFs

should be smaller. To prove the last assumption, we calculated the

SDs ofthe

first and the second angles for yeast ORFS longer than 1000 codons and found that they both equal 0.65 of those for the whole set of genes. It

is

obvious that the corręlation between length

of ORFs

and their coding probability cannot be high because the relation between these parameters is not linear.

The number of coding

ORFs in

yeast estimated

by

our method is much lower than that proposed by the

SGD

^program.This number could be underestimated by us

if

(1) the set of already known genes is not sta-

tistically

representative

for

the whole set

of

coding

oRFs in

the yeast genomę' and

it is

too homogeneous to be considered a statistically significant sample, or (2) some

ORFs

with an authentic start translation codon farther downstream from the first

ATG

have been discarded because the noncoding beginning segment

of

the sequence

would

shift the whole

ORF

farther away from the distribution center. The latter may be ffue

for

the shortest

ORFs

because, as we have shown, the longer

ORFs

are situated closer to the distribution center.

on

the other hand, it is less probable for the shorter

oRF

to havę its start translation codon very far downstream from the first

ATG. This

is the reason that there should be only a small number of

ORFs

elim- inated

by

this procedure.

The results obtained by this method suggest that the overwhelming number of yeast

oRFs

with węll-ęs- tablished homologies are coding, with

only

a very few being noncoding.

Among ORFs

with no known homologies, thę noncoding

fraction is

much larger.

As, by definition,

these presumptive noncoding

oRFs

would be counted as orphans, we suggest that the class of orphans is actually much smaller than previously assumed

(Dujon,

1996;

Casari

et a1.,1996).

We

also estimated the number

of

coding

ORFs >

100 codons in the genomes of

E. coli

and

H.

influenzae. These genomes have different organizations ręlative to each other.

We

found that about 857o

of

the nuclęotides

of

the

tI.

influenzae genome

is in oRFs of >100

codons.

only

about Ivo

of

nucleotides are shared

by

overlapping

oRFs.

In the

E. coli

genome, thęre are many overlapping

oRFs (>2000

overlapping

ORFs,

and 11,.57o

of all

nucleotides are

within

overlaps). Nonoverlapping

ORFs

cover about 90Vo

of

the

E. coli

^genome.

Still,

when comparing the wholę set of

oRFs

to protein coding

oRFs,

ln

E. coli, orly

48.47o

of

these

ORFs

are expected to be

coding.In H.

inJluenzae, about 777o

of all ORFs

are expected to be

coding. We

have

also found

that

in

the

E. coli

genome, the composition

of

the

third position in

the codons depends

stongly

on the position

of

the

ORF in

the chromosome.

Using

the

first

and the third angles as parameters

for oRF

disffibution,

followed by

approximation of the number

of coding oRFs in

thę

E. coli

gęnome' we estimate a

slightĘ

higher fraction of

coding oRFs-52.27o vs

48.4?o.

To

estimatę which position

in

the codon

is

the better predictor

for

a protein

coding

function, we examined the distributions

of

the angles

for

the three spider legs representing codon positions

in

S. cerevisiae,

E. coli, H.

influenzae, and

M' jannaschii (Fig.

5). It can be seęn that the

frst

position is the best predictor

for all

the examinęd genomes.

For all

examined genomes' the average values

of

angles

for

the

frst

positions are betweęn

0

and 90 degrees'

Even if

the second parameter is not so predictive, the

fust

parameter causes genes to

form

a nźuTow

ring on

a torus. Thę third position seems to be a better predictor than the second for genomes of

E.

coli,

Mycoplasma

(not shown), and M.

jannascłii

(compare also the pairs of distributions:

Fig.

2h and

i, Fig.

3b and c,

Fig.

3e and

0. ^We

also found that for mitochondrial genomes the third position seems to be a better predictor than the second one (data not shown). Note that using the third position as one of the parameters does not correspond to

CAI

or thę method of

Mclnemey

(1997) because (1) mutation pressure exploits the transition mechanism, and the changes

in _[A-T]

and

[G-C] in

the third

positions of

codons

in

the

coding

sffand that we measured cannot result

from

transition but rather trans- version, and (2)

half of

the substitutions

in

the third positions

of

the type

purine<:>pyrimidine

are not silent and cannot be subject to a simple translational selection.

A

coding sequence can generate a noncoding

ORF

in a specific phase (Cebrat and

Dudek,

1996).

As

two of the three stop codons

(TAA,TAG)

have the first two nucleotides

in

a palindromic relationship, they can

Ż66

(9)

FI

b3rr-a

ootro_El

t-o co

-a(, t!_l-

rts

.ło 0 40 ⁸⁰ class of sloPe

FT;ę_IJ ooc oGtt IFo Eo

.J(, IBL IE

FI

s

-zoo_o co ot ts

rhoc

'g

_o ^ro

.EL ra-

5

c

25'

-

s.cer.

- - ----

^E.H.inf.^coli MJan.

O +---'*l**..-"r---""T-

o .lo 8o 12o 160 _160 '12o _80 *ł0

I

cla3! of rlope

FIG. 5.

Distribution of ORFs with known function into classes according to ^thevalues of angles of legs describing three positions in codons. (a) Angles of ^thefirst legs. (b) Angles of the second legs' (c) Angles of ^thethird legs' The width of classes

-20

^degrees.

generate stops in the related phase of the opposite strand'

By

definition' coding sequences do not have stops in frame. The frequency

of iops ⁱⁿ

^ttrereiated phase of the opposite strand is lower, and the probability

of

the

oRF

appearing is higher.

To find

^the

codini oRF ⁱⁿ

^apair of overlapprng

oRFs,

^{it is}necessary to dis- tinguish between tfre

ffier

reading frame and the reading frame

of

the

ORF

^generated

^by

^the

^coding

^se-

quence. Because the method described here shows differences in asymmetry for different positions in codons'

it is

simple to locate the coding frame. It can be seen readily in the cases of synechocystls

sp'

arld

M'

ther- rnotrophicum(Frg.

3g-l). ttreinst

^angles^for^theoverwhelming fraction of

ORFs

of these genomes have a

very

narrow distribution, but

for

a

specific

class

of oRFs,

^this^parameter^{does not}^seemto be predictive

(ORFs

situated in the lower part of the plots).

To

check this, we Jelected

all

overlapping

ORFs

and

divided

this ^{set (of}overlapping

ORFs)

into the shorter

ORFs

in a pair and the longer ones' In Figure

3,g-l'

we show the

distibutions

^forthese three sets of

ORFs. Assuming

^that

in

a pair of overlapping

ORFs

the longer one

is coding

^(which^st^ouid

b"

true in most cases), we can conclude that the points distributed

horizontally

on

plots

g, i,

j,

and I

in

Figure 3 are not coding'

tlj{

i"*"^.J \

"

(10)

The method presented here seems to be universal, as all genomes, even those of viruses (Fig.

2j-l),

show

specific

asymmetry

in coding vs

noncoding sfiands. There are also many other numerical parameters describing spiders that can be used for

ORF

discrimination. One of these parameters is the normalized length

of

spider legs.

ACKNOWLEDGMENTS

We thank Prof.

A.

Goffeau for many long discussions, for encouragement, for supplying the information on transmembrane protein

coding ORFs,

and

for

help

in

understanding the

specific shift

observed

in

the distribution of these

ORFs. This work

was supported

by

a

KBN

^grcntnumber 1016/S/tMi/97.

REFERENCES

BERTHELSEN,

Ch. L.,

GLAZIER,

J.A., and

SKOLNICK,

M.H. (1992). Global fractal dimension of human

DNA

sequences treated as pseudorandom walks. Phys Rev A45t 8902-8913.

BENETZEN,

J.L., and

HALL,

B.D. (1982). Codon selection in yeast. J Biol Chem 257' 3026-3031.

CASARI, G., de

DRUVAR,

A., SANDER, C., and SCHNEIDER, R. (1996). Bioinformatics and the discovery of gene function. Trends Genet I2r 244-255.

CEBRAT,

S., and

DUDEK,

M.R. (1996). Generation of overlapping reading frames. Trends Genet 12, 12.

CEBRAT,

S.,

DUDEK,

M.R., and

MACKIEWICZ,P.

(1997a). The number of coding ORFs in Saccharomyces cerevisiae genome and the mystery of orphans. Yeast 13, P189. (abstract). 18ttl Int Conf Yeast Genetics Mol

Biol

Stel- lenbosch, South Africa.

CEBRAT,

s.,

DUDEK,

M'R., and

RoGowsKA'

A. (1997b).

AsymmeĘ

in nucleotide composition of sense and an- tisense strands as a parameter for discriminating open reading frames as protein coding sequences. J Appl Genetics 38, 1-9.

DUJON, B. (1996). The yeast genome project what did we leam? Trends Genet I2r 263-270.

DUJON, B., and 106 co-authors . (1994). Complete

DNA

sequence of yeast chromosome XI. Nature 369, 371-378.

FICKETT, J.W. (1996). Finding genes by computer: the state of the art. Trends Genet I2r 31.6-320.

IKEMLIRA, T. (1982). Correlation between the abundance of yeast Eansfer

RNAs

and the occurrence of the respec- tive codons in protein genes. J Mol

Biol

t58, 573-597.

KARLIN,

S., and

BURGE,

C. ^(1995).Dinucleotide relative abundance extremes: a genomic signature. Trends Genet Lr,283-290.

MoINERNEY, J.O. (1997). Prokaryotic genome evolution as assessed by multivariate analysis of codon usage pattems.

Microb Comp Genom

2,89-97.

SHARP, P.M., and LI, W.-H. (1987). The codon adaptation index: a measure of directional synonymous codon usage bias and its potential applications. Nucleic Acids Res 15,

l28l'I295.

SHARP, P.M.,

STEMCO,

M., PEDE, J.F., and

LLOYD,

A.T. (1993). Codon usage: mutational bias, translational selection or both? Biochem Soc Trans 21' 835-841.

SUEOKA, N.

^(1988).Directional mutation pressure and neutral molecular evolution. Proc Natl Acad Sci

USA

85t 2653-2657.

Address reprint requests to:

Stanisław

Cebrat

Institute of

Microbiology

Wrocław

University ul. Przybyszew skie go 63 177

5]-l48 Wrocław Poląnd

e - mail : c ebr at @ an gb and.mic r ob.uni.wro c.p l.

268

Asymmetry of Coding Versus Noncoding Strand in Coding Sequences of Different Genomes