• Nie Znaleziono Wyników

Expression Data

N/A
N/A
Protected

Academic year: 2021

Share "Expression Data"

Copied!
47
0
0

Pełen tekst

(1)

Clustering and Classification In Gene

Expression Data

(2)

Data from Garber et al.

PNAS (98), 2001.

(3)
(4)

• Clustering is an exploratory tool to see who's running with who: Genes and Samples.

• “Unsupervized”

• NOT for classification of samples.

• NOT for identification of differentially expressed genes.

Clustering

(5)

Clustering

• Clustering organizes things that are close into groups.

• What does it mean for two genes to be close?

• What does it mean for two samples to be close?

• Once we know this, how do we define groups?

• Hierarchical and K-Means Clustering

(6)

Distance

• We need a mathematical definition of distance between two points

• What are points?

• If each gene is a point, what is the

mathematical definition of a point?

(7)

Points

• Gene1= (E

11

, E

12

, …, E

1N

)’

• Gene2= (E

21

, E

22

, …, E

2N

)’

• Sample1= (E

11

, E

21

, …, E

G1

)’

• Sample2= (E

12

, E

22

, …, E

G2

)’

• E

gi

=expression gene g, sample i

1 2 . . . . . . . . G

1 2 . . . N

DATA MATRIX

(8)

Most Famous Distance

• Euclidean distance

– Example distance between gene 1 and 2:

– Sqrt of Sum of (E1i -E2i)2, i=1,…,N

• When N is 2, this is distance as we know it:

Baltimore

DC

Distance

When N is 20,000 you have to think abstractly

(9)

Correlation can also be used to compute distance

• Pearson Correlation

• Spearman Correlation

• Uncentered Correlation

• Absolute Value of Correlation

(10)

The difference is that, if you have two vectors X and Y with identical shape, but which are offset relative to each other by a fixed value, they will have a standard Pearson correlation (centered correlation) of 1 but will not have an uncentered correlation of 1.

(11)

The similarity/distance matrices

1 2 ………...G 1

2 . . . . . . . . G

1 2 . . . . . . . . G 1 2

……….N

DATA MATRIX GENE SIMILARITY MATRIX

(12)

The similarity/distance matrices

1 2 …………..N 1

2 . . . . . . . . G

1 2 . . . N 1 2

……….N

DATA MATRIX

SAMPLE SIMILARITY MATRIX

(13)

Gene and Sample Selection

• Do you want all genes included?

• What to do about replicates from the same individual/tumor?

• Genes that contribute noise will affect your results.

• Including all genes: dendrogram can’t all be seen at the same time.

• Perhaps screen the genes?

(14)

Two commonly seen clustering approaches in gene expression data analysis

• Hierarchical clustering

– Dendrogram (red-green picture)

– Allows us to cluster both genes and samples in one picture and see whole dataset

“organized”

• K-means/K-medoids – Partitioning method

– Requires user to define K = # of clusters a priori

– No picture to (over)interpret

(15)

Hierarchical Clustering

• The most overused statistical method in gene expression analysis

• Gives us pretty red-green picture with patterns

• But, pretty picture tends to be pretty unstable.

• Many different ways to perform hierarchical clustering

• Tend to be sensitive to small changes in the data

• Provided with clusters of every size: where to

“cut” the dendrogram is user-determined

(16)

Choose clustering direction

• Agglomerative clustering (bottom-up)

– Starts with as each gene in its own cluster – Joins the two most similar clusters

– Then, joins next two most similar clusters – Continues until all genes are in one cluster

• Divisive clustering (top-down)

– Starts with all genes in one cluster

– Choose split so that genes in the two clusters are most similar (maximize “distance” between clusters) – Find next split in same manner

– Continue until all genes are in single gene clusters

(17)

Choose linkage method (if bottom-up)

• Single Linkage: join clusters whose distance between closest genes is smallest (elliptical)

• Complete Linkage: join

clusters whose distance between furthest genes is smallest

(spherical)

• Average Linkage: join

clusters whose average distance is the smallest.

(18)

Dendrogram Creation + Interpretation

(19)

Dendrogram Creation + Interpretation

(20)

Dendrogram Creation + Interpretation

(21)

Cluster Assignment

(22)

450 relevant genes + 450 “noise” genes. 450 relevant genes.

Simulated Data with 4 clusters: 1-10, 11-20, 21-30, 31-40

(23)

Clustering of Microarray Data

1. Clustering of gene expression profiles (rows) => discovery of co- regulated and functionally related genes(or unrelated genes:

different clusters)

2. Clustering of samples (columns) => identification of sub-types of related samples

3. Two-way clustering => combined sample clustering with gene clustering to identify which genes are the most important for sample clustering

(24)
(25)
(26)

• Two main types of hierarchical clustering.

• – Agglomerative:

• • Start with the points as individual clusters

• • At each step, merge the closest pair of clusters.

• • Until only one cluster (or k clusters) left

• • This requires defining the notion of cluster proximity.

• – Divisive:

• • Start with one, all-inclusive cluster

• • At each step, split a cluster

• • Until each cluster contains a point (or there are k clusters)

• • Need to decide which cluster to split at each step.

(27)
(28)
(29)
(30)
(31)
(32)
(33)
(34)
(35)
(36)
(37)

K-means and K-medoids

• Partitioning Method

• Don’t get pretty picture

• MUST choose number of clusters K a priori

• More of a “black box” because output is most commonly looked at purely as assignments

• Each object (gene or sample) gets assigned to a cluster

• Begin with initial partition

• Iterate so that objects within clusters are most similar

(38)

K-means (continued)

• Euclidean distance most often used

• Spherical clusters.

• Can be hard to choose or figure out K.

• Not unique solution: clustering can depend on initial partition

• No pretty figure to (over)interpret

(39)

K-means Algorithm

1. Choose K centroids at random

2. Make initial partition of objects into k clusters by assigning objects to closest centroid

3. Calculate the centroid (mean) of each of the k clusters.

4. a. For object i, calculate its distance to each of the centroids.

b. Allocate object i to cluster with closest centroid.

c. If object was reallocated, recalculate centroids based on new clusters.

4. Repeat 3 for object i = 1,….N.

5. Repeat 3 and 4 until no reallocations occur.

6. Assess cluster structure for fit and stability

(40)

K-means

• We start with some data

• Interpretation:

– We are showing expression for two

samples for 14 genes – We are showing

expression for two

genes for 14 samples

• This is with 2 genes. Iteration = 0

(41)

K-means

• Choose K centroids

• These are starting values that the user picks.

• There are some data driven ways to do it

Iteration = 0

(42)

K-means

• Make first partition by finding the closest

centroid for each point

• This is where

distance is used

Iteration = 1

(43)

K-means

• Now re-compute the centroids by taking the middle of each cluster

Iteration = 2

(44)

K-means

• Repeat until the

centroids stop moving or until you get tired of waiting

Iteration = 3

(45)

K-means Limitations

• Final results depend on starting values

• How do we chose K? There are methods but not much theory saying what is best.

• Where are the pretty pictures?

(46)

Assessing cluster fit and stability

• Most often ignored.

• Cluster structure is treated as reliable and precise

• Can be VERY sensitive to noise and to outliers

• Homogeneity and Separation

• Cluster Silhouettes: how similar genes within a cluster are to genes in other clusters (Rousseeuw Journal of Computation and Applied Mathematics, 1987)

(47)

Silhouettes

• Silhouette of gene i is defined as:

• a

i

= average distance of gene i to other gene in same cluster

• b

i

= average distance of gene i to genes in its nearest neighbor cluster

s i b a

a b

i i

i i

( )  max( , ) 

Cytaty

Powiązane dokumenty

Entrepreneurship process has been argued as opportunity-driven, creative, and resource-efficient, that could influence income generation of small farmers that

nie daje podstaw do przyjęcia, iż niezbędnym warunkiem nałożenia na właści- ciela lub posiadacza nieruchomości administracyjnej kary pieniężnej za usuwanie drzew lub krzewów

Zestawiono je ze sformułowa- nymi ogólnie zadaniami do realizacji przez Zrzeszenia oraz banki spółdzielcze (por. Zbieżność działań Strategii Rozwoju Obrotu Bezgotówkowego

considered the structure of the basic part of typical fields with an infinite group of square classes including global fields, all purely transcendential extension fields

This result is quite strong and requires several clever usages of (L).. Indeed it was thought that perhaps (X) was strong enough to imply (L) when |G|

In this paper we have introduced a parameterization of distance-based record linkage by means of extending the Euclidean distance, used in standard record linkage, with the

Są to nie tylko fotografie uwięzionych p rofesorów , ale także zdjęcia obozowe z lat oku pacji oraz późniejsze plany obozów, fotografie lu d obó jców

Now, the description of the four-bar linkage knee given in the previous section will be used in order to obtain the parameters (F, M, T, L) from other pa- rameter