Expression Data

(1)

Clustering and Classification In Gene

Expression Data

(2)

Data from Garber et al.

PNAS (98), 2001.

(3)

(4)

• Clustering is an exploratory tool to see who's running with who: Genes and Samples.

• “Unsupervized”

• NOT for classification of samples.

• NOT for identification of differentially expressed genes.

Clustering

(5)

Clustering

• Clustering organizes things that are close into groups.

• What does it mean for two genes to be close?

• What does it mean for two samples to be close?

• Once we know this, how do we define groups?

• Hierarchical and K-Means Clustering

(6)

Distance

• We need a mathematical definition of distance between two points

=expression gene g, sample i

1 2 . . . . . . . . G

1 2 . . . N

DATA MATRIX

(8)

Most Famous Distance

• Euclidean distance

– Example distance between gene 1 and 2:

– Sqrt of Sum of (E_1i-E_2i)², i=1,…,N

• When N is 2, this is distance as we know it:

Baltimore

DC

Distance

When N is 20,000 you have to think abstractly

(9)

Correlation can also be used to compute distance

• Pearson Correlation

• Spearman Correlation

• Uncentered Correlation

• Absolute Value of Correlation

(10)

The difference is that, if you have two vectors X and Y with identical shape, but which are offset relative to each other by a fixed value, they will have a standard Pearson correlation (centered correlation) of 1 but will not have an uncentered correlation of 1.

(11)

The similarity/distance matrices

1 2 ………...G 1

2 . . . . . . . . G

1 2 . . . . . . . . G 1 2

……….N

DATA MATRIX GENE SIMILARITY MATRIX

(12)

The similarity/distance matrices

1 2 …………..N 1

2 . . . . . . . . G

1 2 . . . N 1 2

……….N

DATA MATRIX

SAMPLE SIMILARITY MATRIX

(13)

Gene and Sample Selection

• Do you want all genes included?

• What to do about replicates from the same individual/tumor?

• Genes that contribute noise will affect your results.

• Including all genes: dendrogram can’t all be seen at the same time.

• Perhaps screen the genes?

(14)

Two commonly seen clustering approaches in gene expression data analysis

• Hierarchical clustering

– Dendrogram (red-green picture)

– Allows us to cluster both genes and samples in one picture and see whole dataset

“organized”

• K-means/K-medoids – Partitioning method

– Requires user to define K = # of clusters a priori

– No picture to (over)interpret

(15)

Hierarchical Clustering

• The most overused statistical method in gene expression analysis

• Gives us pretty red-green picture with patterns

• But, pretty picture tends to be pretty unstable.

• Many different ways to perform hierarchical clustering

• Tend to be sensitive to small changes in the data

• Provided with clusters of every size: where to

“cut” the dendrogram is user-determined

(16)

Choose clustering direction

• Agglomerative clustering (bottom-up)

– Starts with as each gene in its own cluster – Joins the two most similar clusters

– Then, joins next two most similar clusters – Continues until all genes are in one cluster

• Divisive clustering (top-down)

– Starts with all genes in one cluster

– Choose split so that genes in the two clusters are most similar (maximize “distance” between clusters) – Find next split in same manner

– Continue until all genes are in single gene clusters

(17)

Choose linkage method (if bottom-up)

• Single Linkage: join clusters whose distance between closest genes is smallest (elliptical)

• Complete Linkage: join

clusters whose distance between furthest genes is smallest

(spherical)

• Average Linkage: join

clusters whose average distance is the smallest.

(18)

Dendrogram Creation + Interpretation

(19)

Dendrogram Creation + Interpretation

(20)

Dendrogram Creation + Interpretation

(21)

Cluster Assignment

(22)

450 relevant genes + 450 “noise” genes. 450 relevant genes.

Simulated Data with 4 clusters: 1-10, 11-20, 21-30, 31-40

(23)

Clustering of Microarray Data

1. Clustering of gene expression profiles (rows) => discovery of co- regulated and functionally related genes(or unrelated genes:

different clusters)

2. Clustering of samples (columns) => identification of sub-types of related samples

3. Two-way clustering => combined sample clustering with gene clustering to identify which genes are the most important for sample clustering

(24)

(25)

(26)

• Two main types of hierarchical clustering.

• – Agglomerative:

• • Start with the points as individual clusters

• • At each step, merge the closest pair of clusters.

• • Until only one cluster (or k clusters) left

• • This requires defining the notion of cluster proximity.

• – Divisive:

• • Start with one, all-inclusive cluster

• • At each step, split a cluster

• • Until each cluster contains a point (or there are k clusters)

• • Need to decide which cluster to split at each step.

(27)

(28)

(29)

(30)

(31)

(32)

(33)

(34)

(35)

(36)

(37)

K-means and K-medoids

• Partitioning Method

• Don’t get pretty picture

• MUST choose number of clusters K a priori

• More of a “black box” because output is most commonly looked at purely as assignments

• Each object (gene or sample) gets assigned to a cluster

• Begin with initial partition

• Iterate so that objects within clusters are most similar

(38)

K-means (continued)

• Euclidean distance most often used

• Spherical clusters.

• Can be hard to choose or figure out K.

• Not unique solution: clustering can depend on initial partition

• No pretty figure to (over)interpret

(39)

K-means Algorithm

1. Choose K centroids at random

2. Make initial partition of objects into k clusters by assigning objects to closest centroid

3. Calculate the centroid (mean) of each of the k clusters.

4. a. For object i, calculate its distance to each of the centroids.

b. Allocate object i to cluster with closest centroid.

c. If object was reallocated, recalculate centroids based on new clusters.

4. Repeat 3 for object i = 1,….N.

5. Repeat 3 and 4 until no reallocations occur.

6. Assess cluster structure for fit and stability

(40)

K-means

• We start with some data

• Interpretation:

– We are showing expression for two

samples for 14 genes – We are showing

expression for two

genes for 14 samples

• This is with 2 genes. Iteration = 0

(41)

K-means

• Choose K centroids

• These are starting values that the user picks.

• There are some data driven ways to do it

Iteration = 0

(42)

K-means

• Make first partition by finding the closest

centroid for each point

• This is where

distance is used

Iteration = 1

(43)

K-means

• Now re-compute the centroids by taking the middle of each cluster

Iteration = 2

(44)

K-means

• Repeat until the

centroids stop moving or until you get tired of waiting

Iteration = 3

(45)

K-means Limitations

• Final results depend on starting values

• How do we chose K? There are methods but not much theory saying what is best.

• Where are the pretty pictures?

(46)

Assessing cluster fit and stability

• Most often ignored.

• Cluster structure is treated as reliable and precise

• Can be VERY sensitive to noise and to outliers

• Homogeneity and Separation

• Cluster Silhouettes: how similar genes within a cluster are to genes in other clusters (Rousseeuw Journal of Computation and Applied Mathematics, 1987)

Expression Data

Clustering and Classification In Gene

Expression Data

Clustering

Clustering

Distance

• We need a mathematical definition of distance between two points

• What are points?

• If each gene is a point, what is the

mathematical definition of a point?

Points

• Gene1= (E

, E

, …, E

)’

• Gene2= (E

, E

, …, E

)’

• Sample1= (E

, E

, …, E

)’

• Sample2= (E

, E

, …, E

)’

• E

=expression gene g, sample i

Most Famous Distance

Correlation can also be used to compute distance

• Pearson Correlation

• Spearman Correlation

• Uncentered Correlation

• Absolute Value of Correlation

The similarity/distance matrices

The similarity/distance matrices

Gene and Sample Selection

Two commonly seen clustering approaches in gene expression data analysis

Hierarchical Clustering

Choose clustering direction

Choose linkage method (if bottom-up)

Dendrogram Creation + Interpretation

Dendrogram Creation + Interpretation

Dendrogram Creation + Interpretation

Cluster Assignment

Clustering of Microarray Data

K-means and K-medoids

K-means (continued)

• Euclidean distance most often used

• Spherical clusters.

• Can be hard to choose or figure out K.

• Not unique solution: clustering can depend on initial partition

• No pretty figure to (over)interpret

K-means Algorithm

K-means

K-means

K-means

K-means

K-means

K-means Limitations

• Final results depend on starting values

• How do we chose K? There are methods but not much theory saying what is best.

• Where are the pretty pictures?

Assessing cluster fit and stability

Silhouettes

• Silhouette of gene i is defined as:

• a

= average distance of gene i to other gene in same cluster

• b

= average distance of gene i to genes in its nearest neighbor cluster

s i b a

a b

( )  max( , ) 