Mary Ann Liebert, Inc.
Asymmetry of Coding Versus Noncoding Strand in Coding Sequences of Different Genomes
sTANIsŁAw CEBRAT,I MIRosŁAw R. DUDEK,z PAWEŁ MACKIEWICZ,I MARIA KowALCzvK,l and MAŁGoRZATA FITAl
ABSTRACT
We have used the asymmetry between the coding and noncoding strands in different codon positions of coding sequences of DNA as a parameter to evaluate the coding probability for open reading frames (ORFs). The method enables an approximation of the total number of coding ORFs in the set of analyzed sequences as well as an estimation of the coding proba' bility for the ORFs. The asymmetry observed in the nucleotide composition of codons in coding sequences has been used successfully for analysis of the genomes completed at the time of this analysis.
INTRODUCTION
tfthere
źrre many methods to discriminate between coding and noncodingDNA
sęquences (Fickett, 1996,l for
review).For
nondisrupted genes, oneof
thę better criteriais
the lengthof
an open reading frame(oRF)'
In the yeast genome project(SGD, Scccłaromyces
Genome Database), the lowerlimit of
anoRF
length has been sęt at 100 codons.
An
additional criterion is the value of the codon adaptation index(CAI)
(Sharp andLi,
1987). It has beęn arbitrarily accepted inSGD
thatoRFs
shorter than 150 triplets withCAI <
0.l1
are consideręd noncoding (Dujonęt
al., L994). It has bęęn also acceptedin SGD
that the longeroRF of
apair of
overlappingoRFs is
considered coding. Generally, thesę criteriawork well,
but someoRFs
are shorter than 150 codons with
CAI
< 0.11 and perform already documented coding functions. Suchcri-
teria asCAI
and codon bias index(CBI)
(Benetzen andHall,
1982) are based on the observation that codon usagein
proteincoding
sequences does not correspond to codon frequency expectedfrom
thę nucleotide composition of the genome.Two
different forces have been suggested to be responsible for this bias. One is translational selection based on relative concentrations of iso-acceptingtRNAs
(Ikemura, 1982). The sec- ondis
mutational pressure that forces a changein ovęrall
nuclęotide compositionof DNA
and especiallyinfluences the third (silent) positions
of
codons (Sueoka, 1988; Sharpet
a1.,1993, for review).We
have assumed that acoding
sequence should reflectspecific
constructionof
the genetic code, non- random (biased) amino acid usage, andphysical
restrictionsof
theDNA (RNA)
molecule. The most fun- damental rule ofDNA
composition is complementarity of the nucleotide bases,A:T
andG:C.
In a random sequęnce' this rule implicates balance in purine/pyrimidine composition of bothDNA
strands, which is ob- served in longDNA
strętches (as in yeast chromosomes) but not in coding sequences organized in operons(I
phage, for example). Dujon ęt a|. (1994) observed a relativę abundance of purine doublętsin
coding se-llnstitute of Microbiology, and 2lnstitute of Theoretical Physics, Wrocław University, Wrocław, Poland.
quences of yeast chromosome 2. The same can be concluded from the results presented by
Karlin
and Burge (1995) for other genomes.This asymmeĘ
in purine/pynmidine composition of coding vs noncodingDNA
sffands has been used to discriminate
ORFs in
the yeast genome (Cebratet
al., I997a,b).The
asymmetry in nucleotide composition of both strands of a protein coding sequenceis
a sum of thespecific
asymmeffy of each positionwithin
codons. Thuso the asymmetryin
the first, the second, and the third positions could compensate for each other within a coding sequence, and asymmetry for each position in codons should be analyzed separately. In some aspects, this method resembles theCAI
approach, but the results of the analy- sis arę not correlated with the ręsults usingCAI;
thiswill
be discussed later. Rather, our approach encom- passes some rulęs in amino acid composition of proteins and ahighly
sophisticated structure for the genetic code.MATERIALS AND METHODS
Databases and sofhłare
Ttte Saccharomyces
cerevisiae
genome sequęnces were downloaded September23,1996,
from genome- ftp.stanford.edu. Informationon
yeast gene function,ORF
homology, and their presumed functions was downloadędNovember
16, 1996,from
htę://www.mips.biochem.mpg.de. Sequencesfor all
viruses werę downloadedMay
10, 1997, from ncbi.nlm.nih.gov, sequences forEscherichia coli
from http://genom4.aist- nara.acjp onMay 9,
1997, forHaemophilus
influenzae, Mycoplasma genitalium, and Methanococcusjan-
naschii fromhĘ://www.tigr.org
onMay
8, 1997,for Mycoplasma pneumoniae from http://www.zmbh.uni- heidelberg.deon May Io, 1997, for
Synechocysrlssp.fĘ://ftp.kazusa.orjp on May 8, 1997, and for
Methanobacterium thermoautotrophicum from fĘ://www.genomcorp.com on June3,
1997.After
the rę-ffieval,
data were not updated.In the analyses, we have considered
all ORFs
foundin
the completely sequenced genomes longer than 1O0 codons stalting withATG
and ending with onę of the three univęrsal stop codons.The software
for all
the analyses was writtenby
one of the authors(M.R.D')
Graphic
r epr e s entation of
s e que nc e sTo makę a graphic representation of a sequence in two-dimensional spacę'
we
analyzed the displacement of aDNA
walker that checked each position within codons separately. For thęDNA
walk, we used a mod- ified method of Berthelsen etal. (1992). For each sequencę' we performed threeDNA
walks, independentĘ,for
each nuclęotideposition in
codon triplets. Thefirst walker
startsfrom
thefirst
nucleotide positionof
the first codon and then jumps every third nucleotide
until
the end of the examined sequence (stop codon) has been reached.Similarly,
the second and the third walkers start from the second and third nucleotide po-sitions of
thefirst
codon, respectively.Every jump of
a walkeris
associatędwith
aunit
shiftin
two-di- mensional space depending on the type of nucleotide visited. The shifts are (0,1) forG,
(1,0) forA, (0,-
1) forC,
and(-
1,0) for T. Hence, eachDNA
walk represents a history of nucleotide composition of the first, the second, or the third positionof
codons along theDNA
sequence. The threewalks
together have been called aspider
and a singlewalk
has been called a spider leg.An
exampleof
a spider representing a typ-ical
genein
the yeast genome (themulticopy
suppressorof
sin4,YML109w) is
seenin Figure la.
InFig-
urelb,
the sequence codingfor
a hydrophobic protein (vacuolar calcium transpofter proteinYDL128w) is
shown, andin Figure lc,
a spider representing an intergenic sequęnce of 921 tripletsis
shown. The spiders depict the nucleotide compositionof
the three positionsin
codons, butit is
also possible to extract some numerical information from these plots arrd to charactęrize whole sets ofoRFs
by the method.Distribution of ORFs in a torus projection
For
eachoRF,
we measured (in degrees) the angles of thę vector dęterminedby
theorigin
and the end of the spider legs. In fact, thę angles equal to arcus tangent(G-cy(A-T)]
have positive values for thefirst two
quadrantsof
theplot
and negative valuęsfor
thethird
and fourth quadrants.This
has ęnabled us to construct aplot
where eachoRF is
representedby
a point whose coordinates are (x) the angle rępresent-Ż60
FIG.
l.
Two-dimensional representation ofDNA
walks (spiders) performed for different positions in codons for yeast sequences. (a) An example of:a spider representing u typi"a gene in the yeast genome' the sequence coding for a mul- ticopy suppiessor of sinł 1yNfl-tbsw). 15) rne sequence coding for a hydrophobic protein, vacuolar calcium transportp'oi"l" (ibl-rzsw).
(c)A
spider representing an intergenic sąuence 921 triplets long.ing the first leg and (y) the angle representing the second leg. It is also possible to use the angle of the third
le!
as one,of the two coordinatesoi
as the third coordinate in three-dimensional space.As
both axes of thept
t are, al"d f'om -
180 degreesto ł
180 degrees' the plot is,in
fact, a projection of torus, and its areais finite. The distributions of differęnt
sętsof oRFs from different
genomęs are presentedin
Figures2 and3.
Approxitnation of total nurnber of coding ORFs in a Senome
For
three genomes,s.
cerevisiae,H.
influenzae, andE. coli,
we compared the distributionsof all oRFs
foundin
the genomes wittr the distributionsof ORFs
with already identified functions'To
compare the distributions of coding sequences with the distribution ofall ORFs
in the same genome' we analyzed forORFs
with known functions the average value and the standard deviatiogs(SOJjor:gt:t
of the first legs and the.average value for angles of the second legs. The average values,
At
andAz,
for the set of genes are considered coordinates of the center of the genes' distribution. Next, we normalized para-position 1 -60 -*':&b'""11'r
180
90
0
-90
-180
18090
0
-90 -180
18090
0
-90
-180
18090
0
-90 -180
c b
e d
g h
',{
k
-180 -90 0
90180 -180 -90 o 90 180 -180-90 0 90
180FIG. 2.
Distribution ofoRFs
into the projection of torus where (x) is the angle of the firstĘ
of spider and (y) is the angle of the second leg of spider [an exception is i, where (y) represents the angle of the third leg of the spider].(a) S. cerevisiae, all ORFs
)100
codons. (b) S. cerevisiae, ORFs with known function. (c) S. cerevisiae, ORFs cod- ing for transmembrane proteins. (d) H. influenzae, allOMs>100
codons. (e\ H. influenzae, ORFs with known func- tion. (f) S. cerevisiae, intergenic sequences)l0O triplets. (g) E. coli, all ORFs)
100 codons. (h) E. col/, ORFs with known function . (i) E. coli, ORFs with known function (angles I and 3). (i,k,l) Phage T4 vaccinia, and variola viruses, respectively, all ORFs)
100 codons.!ił
"jri.
'. il.',
l:łł
'-l t lffi ,,
tu*-r :.
!jrl
łł..l .'"Y?
'il :I:T;
.. , {Ę)
:ffiłi
'Srij -i::
'łiffi lr*i
r#i'ł:{:Ilffi
it'.t,i.ii
ffi ffi ffił
'ł-'li..'iffi
ł-:,'.tiiBi
l{l{$ł
ilff
'ł_. ; l'1 ':."..]t' . 'it"..::t'!..r1!
.'$
d!
1..
pffii
r.'. .1t_g '4..
ił:it?ri {;,tj#
.łE.,
i..;.';L . .'.i.1
,
.;'..':.1.i' 'tt ' . ",:1'..r.)
liij',# $lłi,jl
ił.i.;i:iti..,
1i'il;::.i }ijŁ:
't
ffi
'iru 1
I
'ffii
r{lł,\.lrrt' 't.'hri;
180
90
0
-90
-1 80 180
90
0
-90
-180
18090
0
-90
-180
18090
0
-90
-180
-180 -90
090 180 -180 -90
090 180 -180 -90
090
180FIG. 3.
Distribution ofoRFs
into the projection of torus where (x) is the angle of the first leg of spider and (y) is the angle of the second leg of spider. (a) M. pneumoniae, alloms
> 100 codons' (b)M'
genitalium' alloRFs'
angle l' vs angle 2. (c) As in (u), to, angtei
vs angle 3. (d) M. jannaschii, alloRF )
100 codons' (e) M' jannascłii'oRFs
with known functions, angle IvsLgle
z. (D o. in (e), but angle 1 vs angle 3. (g) synechocystis sp., alloRFs
> 100 codons.(h\
synechocy.stls sp., onns-witrr known function. (i)-synechocysttr sp., shorter oRlrs from pairs of overlap- ping ones. (j) M. tnermotrophicum,alloRFs
> 100 codons. (k\ M. thermotrophicum,longer olurs from pairs of over-iŃn' ono' ol
M. thermotrophicum, shorter ORFs from pairs of overlapping ones.b c a
d e
g h
k
ffi
i,,i,.ffi:
,' ' : l!Ln*.
':it
-., ' ,,|ł
'i'$ii
. t{
J..' .
.ril
I
I :,d.fi I v:lt
l ji:ri.n l.. :' ił..l ; ..' l l F*.ł' :.'1 '.tir.
F.
. ł'.
!' r"''
l.! r. I
: -- 'ri '- ll.r
'i: l' :ł.:.1i1
r-.. j l'.ł,
,rt. ." i i T:i;,';'
lit.l.i:
.ri
iffi
ł.' ,. if.r.fu'rir.: . i. .,...''.,ffit
:.łr' .!'r 'r
; !'. ..f ' ', *:1,:tl.ti' lr!.' .l -
<ji.i .j:-
i;i
ł ...r1..."t':"
'rj;.I' .'.ł.
:ą
I
:;
meters
by giving
eachoRF
valuesA1
andA2, which
are equal to thę differences (expressedin SD)
be- tween the average values and the angleof
the first leg and the second leg, respectively,for
a givenORF.
For
eachindividual ORF,
we measured the distance from the center, which is equalto
Ą:\/ĘTB
To
estimate the number of codingORFs
we used the algorithm\:
ORF;,,+ (G""t/GiJ(ORFiJl
where
\
is an assumed maximal number of codingORFs
for a givenAi, ORF;'
is the number ofall ORFs inside
the spacewith a
distanceto
the distribution centerŚ Ai, Gi. is
the numberof
genes(oRFs
withknown
function)with
a distance to the distribution center =Ą,
andGo', is
thenumbei of
genes(oRFs
with known function) with a distance to center >
Ą.
In this algorithm, for a givenAi,
Woassimed
thatall oRFs
inside the space determined byĄ
are coding and that the fraction or cóoingoRFs
outside the space staysin
the same proportion to the numberof ORFs
inside the space asit is for
the distributionof ORFS
with the described coding function.We
plotted the\
valuesvs
distance(AJ. To
avoid errors, wecut
off 5?oof ORFs
with maximum val- ues ofĄ
and I07o withminimal
values ofAi. To
estimate the approximated number ofcoding oRFs,
we found the extrapolated va]ue ofN
forĄ :
0. In Figure4
are shown plots for S. cerevisiae,H.
influenzae,and
>
100E. coli. For
codonslong
comparison for (72140ORFs)
S. cerevisiae, and the set we used as an estimateof
6200 codingoRFs primarily ielected by
for codingORFs
the setMIpS
ofall ORFs (Munich
Information Centre for Protein Sequences) (htĘ://www.mips.biochem.mpgde). ForE. coli,
wealso prepared the distributionsof
genes andORFs
taking into account the angles of the third legs insteadof
the second ones. The approximationof
total number of codingORFs
done for these distributionsis
shown inFig.
4c.Estimation of coding probability for an ORF
To
estimate thecoding probability for
anORF,
wedivided
the whole setof all ORFs into
classes ac- cording toAi
Yalues.For
each class, we counted the numberof
expected codingORFs
(using the method described) and the total numberof ORFs
foundin
a given class. The ratio between these values has been assumed as codingprobability for oRFs
locatizedin
a given class. It is possible to use thę coding proba-bility
values obtainedby
this methoddirecfly
or to plot them againstĄ,
to make thepolynomial
approxi- mation, andto
describe theprobability
as afunction of Ą. The
results areavailable
at ourWWW
site (angband.microb.uni.wroc.pl).RESULTS AND DISCUSSION
Because the representation
of
both parameters describing theORFs is in
degrees (-f 180), areasof
the plots seen in Figures2
and 3 are finite (they are the surfaces of the toruses). In thó case of the yeast gęnome' about 67oof
this areaincludes
about 75voof all
codingoRFs. All
these genes have moreA
thanT in
thęfirst
and second positionsof
codons, moreG
thanC in
thefrst
position, and lessG
thanC in
the second position. TheORFs
with the lower number ofA
thanT
in the second position arelocalized
in the plots be-low
the maincloud of
genes. In this latter region are genes codingfór
ffansmembrane proteins (informa- tion on the set of these genes waskindly
supplied byA.
Goffeau, CatholiqueUnivęrsitie
de Louvain). The shift of these genes below the main cluster is due to the presence of codons coding for hydrophobic amino acids,which
arerich in T
in the second position, and underrepresęntation of codońs forńydróphilic
amino acids,which
arerich in A in
the second position. Because transmembrane proteinsporr"r,
iryfuophobic spans, they arerelatively rich in T in
the second position.When comparing different classes
of
yeastoRFs with
annotation1-ó in
theMIPS
base,it
can be seen that classes2
and 3 are alittle
more dispersed thanclass I
(identified genes have annotation 1in MIpS),
which suggests that there are only a few noncodingORFs
in classesI,2,
and 3. In the classes above 4 there are many noncodingORFs.
In fact, in class 6, only a few codingORFs
should be expected (data not shown).The correlation
coęfficient
between theCAI
and the distancefrom
the centeris
closę to 0. One would 264a
23456
distance from the centre of gene distribution
1 1.5 2 2.5 3 3.5
distance from the centre of gene distribution
o 0.5 1 1.5 2 2.5
dlstanco from the centre of gene distribution
FIG. 4.
Approximation of the total number of coding ORFs)
100 codons in genomes of (a) S. cerevisiae, approxi- mations done for the whole łet ofoRFs )
100 codons and for oRFs published in the SGD database (after preliminary selection), (b) H . influenzae , (c) E. coli , approximations done for distributions prepared for relations between angles of the first and the second legs and relations between angles ofthe first and the third legs.6 1800
IL
5
1600.s B 1400
!
I
12ooIor- o 1000 4l
E
Boorr co 600
+.
.E
x 4ooI
200tL*o
c
,f 6000 to
.g
5oooEo S o 4ooo
Lo
E
:' 3oootr
E
(! 2ooo.Ex o 1000
eą
o0f
Booot,
3
Tooo.E
E
(t 6000bo 5000
E
b3 4ooo1' e,o 3000 g
.E
x 2ooo I looo CLCLl!o
approximation in dbase with all ORFs>100 codons: 4718
approximation in published dbase, (after preliminary ORF elimination): 4691
the first and the third positions - 3155
the first and the $econd positions - 2927
expect such a result because the
CAI
is sęnsitivę to the composition of the third positions of codons, whereas we used parameters measuring the asymmenry of thefint
two positions. We have observed some correlation (about-0.4)
between the distance from the center of the distribution and the length of theORFs
for the yeast.As
the distanceis
reciprocal to the codingprobabiliĘ,
a negative correlation should bę expected becauseof
two phenomena: (1) noncoding overlapping
ORFs
or randomORFs
are usually shorter, and they are found far from the centerof
the distribution, and (2)it
seems that very longORFs
could be considered as "averaged smallerORFs."
Thus, the SDs for the class of longORFs
should be smaller. To prove the last assumption, we calculated theSDs ofthe
first and the second angles for yeast ORFS longer than 1000 codons and found that they both equal 0.65 of those for the whole set of genes. Itis
obvious that the corręlation between lengthof ORFs
and their coding probability cannot be high because the relation between these parameters is not linear.The number of coding
ORFs in
yeast estimatedby
our method is much lower than that proposed by theSGD
program. This number could be underestimated by usif
(1) the set of already known genes is not sta-tistically
representativefor
the whole setof
codingoRFs in
the yeast genomę' andit is
too homogeneous to be considered a statistically significant sample, or (2) someORFs
with an authentic start translation codon farther downstream from the firstATG
have been discarded because the noncoding beginning segmentof
the sequence
would
shift the wholeORF
farther away from the distribution center. The latter may be ffuefor
the shortestORFs
because, as we have shown, the longerORFs
are situated closer to the distribution center.on
the other hand, it is less probable for the shorteroRF
to havę its start translation codon very far downstream from the firstATG. This
is the reason that there should be only a small number ofORFs
elim- inatedby
this procedure.The results obtained by this method suggest that the overwhelming number of yeast
oRFs
with węll-ęs- tablished homologies are coding, withonly
a very few being noncoding.Among ORFs
with no known ho- mologies, thę noncodingfraction is
much larger.As, by definition,
these presumptive noncodingoRFs
would be counted as orphans, we suggest that the class of orphans is actually much smaller than previously assumed(Dujon,
1996;Casari
et a1.,1996).We
also estimated the numberof
codingORFs >
100 codons in the genomes ofE. coli
andH.
influen- zae. These genomes have different organizations ręlative to each other.We
found that about 857oof
the nuclęotidesof
thetI.
influenzae genomeis in oRFs of >100
codons.only
about Ivoof
nucleotides are sharedby
overlappingoRFs.
In theE. coli
genome, thęre are many overlappingoRFs (>2000
overlap- pingORFs,
and 11,.57oof all
nucleotides arewithin
overlaps). NonoverlappingORFs
cover about 90Voof
the
E. coli
genome.Still,
when comparing the wholę set ofoRFs
to protein codingoRFs,
lnE. coli, orly
48.47o
of
theseORFs
are expected to becoding.In H.
inJluenzae, about 777oof all ORFs
are expected to becoding. We
havealso found
thatin
theE. coli
genome, the compositionof
thethird position in
the codons dependsstongly
on the positionof
theORF in
the chromosome.Using
thefirst
and the third an- gles as parametersfor oRF
disffibution,followed by
approximation of the numberof coding oRFs in
thęE. coli
gęnome' we estimate aslightĘ
higher fraction ofcoding oRFs-52.27o vs
48.4?o.To
estimatę which positionin
the codonis
the better predictorfor
a proteincoding
function, we exam- ined the distributionsof
the anglesfor
the three spider legs representing codon positionsin
S. cerevisiae,E. coli, H.
influenzae, andM' jannaschii (Fig.
5). It can be seęn that thefrst
position is the best predictorfor all
the examinęd genomes.For all
examined genomes' the average valuesof
anglesfor
thefrst
posi- tions are betweęn0
and 90 degrees'Even if
the second parameter is not so predictive, thefust
parameter causes genes toform
a nźuTowring on
a torus. Thę third position seems to be a better predictor than the second for genomes ofE.
coli,Mycoplasma
(not shown), and M.jannascłii
(compare also the pairs of dis- tributions:Fig.
2h andi, Fig.
3b and c,Fig.
3e and0. We
also found that for mitochondrial genomes the third position seems to be a better predictor than the second one (data not shown). Note that using the third position as one of the parameters does not correspond toCAI
or thę method ofMclnemey
(1997) because (1) mutation pressure exploits the transition mechanism, and the changesin [A-T]
and[G-C] in
the thirdpositions of
codonsin
thecoding
sffand that we measured cannot resultfrom
transition but rather trans- version, and (2)half of
the substitutionsin
the third positionsof
the typepurine<:>pyrimidine
are not silent and cannot be subject to a simple translational selection.A
coding sequence can generate a noncodingORF
in a specific phase (Cebrat andDudek,
1996).As
two of the three stop codons(TAA,TAG)
have the first two nucleotidesin
a palindromic relationship, they canŻ66
FI
b3rr-a
ootroEl
t-o co
-a(, t!l-
rts
.ło 0 40 80 class of sloPe
FT;ęIJ ooc oGtt IFo Eo
.J(, IBL IE
FI
s
-zooo co ot ts
rhoc
'g
o ro.EL ra-
5
c
25'
-
s.cer.
- - ----
E. H.inf.coli MJan.O +---'*l**..-"r---""T-
o .lo 8o 12o 160 _160 '12o _80 *ł0
I
cla3! of rlope
FIG. 5.
Distribution of ORFs with known function into classes according to the values of angles of legs describing three positions in codons. (a) Angles of the first legs. (b) Angles of the second legs' (c) Angles of the third legs' The width of classes-20
degrees.generate stops in the related phase of the opposite strand'
By
definition' coding sequences do not have stops in frame. The frequencyof iops in
ttre reiated phase of the opposite strand is lower, and the probabilityof
the
oRF
appearing is higher.To find
thecodini oRF in
a pair of overlapprngoRFs,
it is necessary to dis- tinguish between tfreffier
reading frame and the reading frameof
theORF
generatedby
thecoding
se-quence. Because the method described here shows differences in asymmetry for different positions in codons'
it is
simple to locate the coding frame. It can be seen readily in the cases of synechocystlssp'
arldM'
ther- rnotrophicum(Frg.3g-l). ttreinst
angles for the overwhelming fraction ofORFs
of these genomes have avery
narrow distribution, butfor
aspecific
classof oRFs,
this parameter does not seem to be predictive(ORFs
situated in the lower part of the plots).To
check this, we Jelectedall
overlappingORFs
anddivided
this set (of overlappingORFs)
into the shorterORFs
in a pair and the longer ones' In Figure3,g-l'
we show thedistibutions
for these three sets ofORFs. Assuming
thatin
a pair of overlappingORFs
the longer oneis coding
(which st ouidb"
true in most cases), we can conclude that the points distributedhorizontally
onplots
g, i,j,
and Iin
Figure 3 are not coding'tlj{
i"*"^.J \
"
The method presented here seems to be universal, as all genomes, even those of viruses (Fig.
2j-l),
showspecific
asymmetryin coding vs
noncoding sfiands. There are also many other numerical parameters de- scribing spiders that can be used forORF
discrimination. One of these parameters is the normalized lengthof
spider legs.ACKNOWLEDGMENTS
We thank Prof.
A.
Goffeau for many long discussions, for encouragement, for supplying the information on transmembrane proteincoding ORFs,
andfor
helpin
understanding thespecific shift
observedin
the distribution of theseORFs. This work
was supportedby
aKBN
grcnt number 1016/S/tMi/97.REFERENCES
BERTHELSEN,
Ch. L.,GLAZIER,
J.A., andSKOLNICK,
M.H. (1992). Global fractal dimension of humanDNA
se- quences treated as pseudorandom walks. Phys Rev A45t 8902-8913.BENETZEN,
J.L., andHALL,
B.D. (1982). Codon selection in yeast. J Biol Chem 257' 3026-3031.CASARI, G., de
DRUVAR,
A., SANDER, C., and SCHNEIDER, R. (1996). Bioinformatics and the discovery of gene function. Trends Genet I2r 244-255.CEBRAT,
S., andDUDEK,
M.R. (1996). Generation of overlapping reading frames. Trends Genet 12, 12.CEBRAT,
S.,DUDEK,
M.R., andMACKIEWICZ,P.
(1997a). The number of coding ORFs in Saccharomyces cere- visiae genome and the mystery of orphans. Yeast 13, P189. (abstract). 18ttl Int Conf Yeast Genetics MolBiol
Stel- lenbosch, South Africa.CEBRAT,
s.,DUDEK,
M'R., andRoGowsKA'
A. (1997b).AsymmeĘ
in nucleotide composition of sense and an- tisense strands as a parameter for discriminating open reading frames as protein coding sequences. J Appl Genetics 38, 1-9.DUJON, B. (1996). The yeast genome project what did we leam? Trends Genet I2r 263-270.
DUJON, B., and 106 co-authors . (1994). Complete
DNA
sequence of yeast chromosome XI. Nature 369, 371-378.FICKETT, J.W. (1996). Finding genes by computer: the state of the art. Trends Genet I2r 31.6-320.
IKEMLIRA, T. (1982). Correlation between the abundance of yeast Eansfer
RNAs
and the occurrence of the respec- tive codons in protein genes. J MolBiol
t58, 573-597.KARLIN,
S., andBURGE,
C. (1995). Dinucleotide relative abundance extremes: a genomic signature. Trends Genet Lr,283-290.MoINERNEY, J.O. (1997). Prokaryotic genome evolution as assessed by multivariate analysis of codon usage pattems.
Microb Comp Genom
2,89-97.
SHARP, P.M., and LI, W.-H. (1987). The codon adaptation index: a measure of directional synonymous codon usage bias and its potential applications. Nucleic Acids Res 15,
l28l'I295.
SHARP, P.M.,
STEMCO,
M., PEDE, J.F., andLLOYD,
A.T. (1993). Codon usage: mutational bias, translational se- lection or both? Biochem Soc Trans 21' 835-841.SUEOKA, N.
(1988). Directional mutation pressure and neutral molecular evolution. Proc Natl Acad SciUSA
85t 2653-2657.Address reprint requests to:
Stanisław
Cebrat
Institute ofMicrobiology
Wrocław
University ul. Przybyszew skie go 63 1775]-l48 Wrocław Poląnd
e - mail : c ebr at @ an gb and.mic r ob.uni.wro c.p l.
268