An algorithm to discover spatial-temporal distributions of physical seawater charactristics and a case study in Turkish Seas

(1)

D O I 10.1007/S00773-005-0213-2

MaPinG

S C I f i l l C B

and Technology

A n algorithm to discover spatial-temporal distributions of physical

seawater characteristics and a case study in Turkish seas

D E R Y A B I R A N T and A L P K U T

Department of Computer Engineering, Doltuz Eylul University, 35100, Izmir, Turkey

Abstract Clustering is one of the major data mining methods

to obtain a number of clues about how the physical properties of the water are distributed i n a marine environment. I t is a difficult problem, especially when we consider the task f o r spatial-temporal marine data. This study introduces a new clustering algorithm to discover regions that have similar physical seawater characteristics. I n contrast to the existing density-based clustering algorithms, our algorithm has the ability of discovering clusters according to the nonspatial, spa-tial, and temporal values of the objects. Our algorithm also overcomes three drawbacks of existing clustering algorithms: problems i n the identification of core objects, noise objects, and adjacent clusters. This paper also presents a spatial-tem-poral marine data warehouse system designed f o r storing and clustering physical data f r o m Turkish seas. Special functions were developed f o r data integration, data conversion, query-ing, visualization, analysis, and management. User-friendly interfaces were also developed, allowing relatively inexperi-enced users to operate the system. A s a case study, we show the spatial-temporal distributions of sea surface temperature, sea surface height residual, and significant wave height values i n Turkish seas to demonstrate our algorithm.

Key words Cluster analysis • Spatial-temporal data •

Cluster-ing algorithms • Sea surface temperature • Sea surface height residual • Significant wave height

1 Introduction

Clustering is the process of grouping large data sets according to their similarity. I t is an important data mining technique used for data segmentation, discre-tization of continuous attributes, data reduction, outher detection, noise filtering, pattern recognition, and i m -age processing. I n the field of knowledge discovery i n

Address correspondence to: D . Birant (derya@cs.deu.edu.tr)

Received: September 20, 2005 / Accepted: December 21,2005

databases ( K D D ) , cluster analysis is known as an unsu-pervised learning process because there is no a priori knowledge about the data set. I n this study, we focus on cluster (pattern) analysis on physical data such as sea surface temperature, sea surface height residual, and significant wave height.

Some of the existing clustering algorithms, such as K-Means^ and K-Medoid^ (partitional clustering algo-rithms), CURE3 and B I R C H * (hierarchical clustering algorithms), and COBWEB^ (a model-based clustering algorithm), focus on discovering clusters f r o m ordin-ary data (nonspatial and nontemporal data). Some of the existing spatial clustering algorithms, such as DBSCAN'"'' (a density-based clustering algorithm) and WaveCluster' (a grid-based clustering algorithm), focus on discovering clusters f r o m spatial data. However, sea-water data involve both spatial and temporal dimen-sions. There are currently no density-based clustering algorithms f o r discovering clusters i n spatial-temporal data. The cluster discovery process for spatial-temporal data is more complex than f o r nonspatial and non-temporal data because spatial-non-temporal clustering algo-rithms have to consider the spatial and temporal neighbors of objects to extract useful knowledge. Cluster discovery f r o m spatialtemporal data is a very p r o m -ising subfield of data mining because increasingly large volumes of spatial-temporal data are collected and need to be analyzed. Clustering algorithms designed for spatial-temporal data can be used i n many applications such as geographic information systems, medical imaging, and weather forecasting. I n marine science, clustering is useful f o r understanding of tbe seawater characteristics.

(2)

algorithm can cluster spatial-temporal data according to its nonspatial, spatial, and temporal attributes. Sec-ond, D B S C A N cannot detect some noise points when clusters of different densities exist. Our algorithm solves this problem by assigning to each cluster a density fac-tor. Third, the values of border objects in a cluster may be very different to the values of border objects on the opposite side of the cluster i f the nonspatial values of neighbor objects have little differences and the clusters are adjacent to each other. Our algorithm solves this problem by comparing the average value of a cluster with new incoming values. We chose the D B S C A N al-gorithm because i t has the ability to discover clusters of arbitrary shape, such as linear, concave and oval. Fur-thermore, i n contrast to some clustering algorithms, i t does not require the predetermination of the number of clusters. D B S C A N has proven its abiUty to process very large databases.''''

I n addition to the new clustering algorithm, this ar-ticle also presents a spatial-temporal data warehouse system designed f o r storing and clustering physical data f r o m Turkish seas. The sea surface temperature, the sea surface height residual, the significant wave height, and the wind speed values of four seas (the Black Sea, the Marmara Sea, the Aegean Sea, and the eastern part of the Mediterranean) were collected f f o m different satel-htes to discover regions that have similar seawater char-acteristics. Special functions were developed f o r data integration, data conversion, querying, visualization, analysis, and management. User-friendly interfaces were also developed allowing relatively inexperienced users to operate the system.

As in aU databases, fast access to the data i n spatial databases depends on the availabihty of suitable index-ing methods such as quadtrees^" and R-trees.^^ I n this study, the R-tree spatial index structure was used to speed up the processing of queries. I n addition to the spatial index structure, some filters should also be used to reduce the search space f o r spatial data mining algo-rithms. These filters ahow operations on neighborhood paths by reducing the number of paths actually created. The rest of the article is organized as fohows. Section 2 gives the preliminaries and basic concepts of density-based clustering algorithms. Section 3 describes the drawbacks of existing density-based clustering algo-rithms and our efforts to overcome these problems. Section 4 explains our algorithm in detail. Section 5 presents three apphcations that are implemented to show the spatial-temporal distributions of physical pa-rameter values i n Turkish seas and discusses the cluster analysis results. Finally, a conclusion and some direc-tions f o r future w o r k are given in Section 6.

2 Preliminaries and basic concepts

2.1 Density-based clustering

The problem of clustering can be defined as follows:

Definition 1: Given a database of n data objects D = {o^, O2,. .. ,o„). The process of partitioning D into C = {Cj,

C2,. . . , C^} based on a certain similarity measure is called clustering; C,'s are called clusters, where C,- c D, (i = l,2,..., k), H Q = 0 and Q C , = D.

i=l i=l

The density-based notion is a common approach to clustering. Density-based clustering algorithms are based on the idea that objects which f o r m a dense region should be grouped together into one cluster. They use a fixed threshold value to determine dense regions. They search for regions of high density that are separated by regions of lower density i n a feature space.

Density-based clustering algorithms such as D B S C A N , * OPTICS,i2 D E N C L U E , " Wavecluster,' and CURD^* are to some extent capable of clustering databases.^^'i'' However, since the main objective of a clustering algorithm is to find, clusters, they were devel-oped to discover clusters i n ordinary data (nonspatial and non-temporal data) or spatial data, not to discover clusters i n spatial-temporal data. The clustering task f o r spatial-temporal data requires some extensions. I t has to consider the spatial and temporal neighbors of objects to find clusters properly. Another drawback of existing density-based clustering algorithms is that they capture only certain kinds of noise points when clusters of different densities exist. The detailed description of these problems and our solutions are given i n Sect. 3.

2.2 Basic concepts

D B S C A N was designed to discover arbitrary-shaped clusters in any database D , and at the same time i t can distinguish noise points. M o r e specifically, D B S C A N accepts a radius value Eps (e) based on a user-defined distance measure and a value MinPts f o r the minimum number of points that should occur within the Eps ra-dius. Some concepts and terms to explain the D B S C A N algorithm are defined as follows:''

Definition 2 (neighborhood): Neighborhood is

deter-mined by a distance function (e.g., the Manhattan distance or Euchdean distance) f o r two points p and q, denoted by dist('p,q).

Definition 3 (Eps neighborhood): The Eps

neighbor-hood of a p o i n t p is defined as {q 6 D I dist(p,q) < Eps}.

Definition 4 (core object): A core object refers to a

(3)

' f ' J ' B o l d e r ^ O^) ; O—) Noise Fig. 1. Basic concepts and terms: a p • ?'\ . f ' ï i ' \ ^ \ F'''''vi ' V i ' \ ^ " \ 'Q^'-'ó"»-, density reachable f r o m b p and (7 ' '\ '1 * " * ''-f^dljJf-^ ' ] ^-'O °, O o density connected to each other by . •'•• ...<•,.-•' ,> . - V ^ V ' . - ' C yi / '>P 0 ° 0,1 c border object, core obiect, and ^Q- O O ^ o,i c border object, core object, O Eps = 1cm noise." iïp^, a radius value; M;'«P/j,

MriiRs= 5 (]jg minimum number of points

Definition 5 (directly density reachable): A n object p is

directly density reachable f r o m the object g i f is within the Eps neighborhood of q, and is a core object.

Definition 6 (density reachable): A n object p is density

reachable f r o m the object q with respect to Eps and MinPts i f there is a chain of objects pi,. . .p,„ Pi = q and p„ = p such that p^+i is directly density reachable f r o m Pi with respect to Eps and MinPts, f o r l<i< n,

Pi e D . (Fig. l a ) .

Definition 7 (density connected): A n Object p is density

connected to object q with respect to Eps and MinPts if there is an object o e D such that both p and q are density reachable f r o m o with respect to Eps and MinPts. (Fig. l b ) .

Definifion 8 (density-based cluster): A cluster C is a

nonempty subset of D satisfying the following "maximality" and "connectivity" requirements: 1. V p,q: i f <7 G C and p is density reachable f r o m q

with respect to Eps and MinPts, then p e C. 2. V p,q e C: p is density connected to q with respect

to Eps and MinPts.

Definition 9 (border object): A n object p is a border

object i f i t is not a core object but is density reachable f r o m another core object.

The algorithm starts with the first point p i n database D and retrieves all neighbors of point p w i t h i n distance Eps. I f the total number of these neighbors is greater than MinPts, i.e., i f is a core object, a new cluster is created. The point p and its neighbors are assigned to this new cluster. Then, the algorithm iteratively collects the neighbors within distance Eps f r o m the core points. The process is repeated until ah of the points have been processed.

I n the literature, the D B S C A N algorithm has been used i n many studies. For example, the other popular density-based algorithm OPTICS (ordering points to identify the clustering structure)'^ is based on the con-cepts of D B S C A N and identifies nested clusters and the structure of clusters. A n incremental version of DBSCAN** is also based on the clustering algorithm D B S C A N and is used f o r incremental updates of a clus-tering after insertion of a new object to the database and deletion of an existing object f r o m the database. Based on the f o r m a l nodon of clusters, the incremental algo-r i t h m yields the same algo-results as the nonincalgo-remental

D B S C A N algorithm. The S D B D C (scalable density-based distributed clustering)!^ method also uses D B S C A N on both local sites and global sites to cluster distributed objects. I n this method, D B S C A N is first carried out on each local site. Then, based on these local clustering results, cluster representatives are deter-mined, and based on these local representatives, the standard D B S C A N algorithm is carried out on the global site to construct the distributed clustering. This study proposes the usage of different Eps values f o r each local representative. W e n et al.'' adopted D B S C A N and incremental D B S C A N as the core algorithms of their query clustering tool. They used D B S C A N to cluster frequently asked questions and the most popular topics on a search engine. Finally, Spieth et aU° used D B S C A N to identify solutions f o r the infer-ence of regulatory networks.

3 Problems of existing approaches

3.1 Problem of clustering spatial-temporal data To determine whether a set of points is similar enough to be considered a cluster, we need a distance measure dist(i, j ) that tehs us how far apart points / and are. The most common distance measures used are the Manhattan distance, Euchdean distance, and Minkowski distance. The Euclidean distance is defined as:

, . ( 2 2 2\

disth',/ = s q r t x^^-Xj^ + X n - X j 2 + . . . + -f x^.„

(1)

where / = (x^, x ^ , . . . , x,,) a n d ; = (x^i, x ^ ^ , . . . , Xj,,) are two n-dimensional data objects. For example, the Euchdean distance between the two data objects A ( l , 2) and B(5, 3) is 4.12.

(4)

C.1 C" • • C" - ' - . • • ' 0^ O2

Fig. 2. Example data set containing clusters with different

densities

B(x2, 3^2) are two points (spatial values), f j , ?2 {DayTimeTemperature, NightTimeTemperature) and t^, ?4 are four temperature values of these points, respec-tively (nonspatial values). I n this example, Epsl is used to measure the closeness of two points geographically, whereas Eps2 is used to measure the similarity of tem-perature values. I f A(A-I, y j , t^, t^) and B(x2, y2, h, U) are two points, Epsl and Eps2 are calculated as:

f 2 2^ Epsl = sqrt Xj - + 3'i - 3'2

V /

Eps2 = sqrt\ \t,-tA +^-1^ ( 2 )

3.2 Problem of identifying noise objects

F r o m the view of a clustering algorithm, noise is a set of objects not located in clusters of a database. M o r e for-mally, noise can be defined as follows:

Definition 10 (noise): Let Q , . . . , Q be the clusters of

database D . Then the noise is the set of points i n the database D not belonging to any cluster C„ where / =

1 , . . . , /c, i.e., noise = {p G D I V i : p g Q j

Existing density-based clustering algorithms produce meaningful and adequate results under certain condi-tions, but their results are not satisfactory when clusters of different densities exist. To ihustrate this point, con-sider the example given i n Fig. 2. This is a simple dataset containing 5 2 objects. There are 2 5 objects i n the first cluster CJ, 2 5 objects i n the second cluster C2, and 2 additional noise objects and Oj. I n this example, C2 forms a denser cluster than Q ; i n other words, the densities of the clusters are different. The D B S C A N algorithm identifies only one noise object Oj because, approximately, f o r every object p i n C^, the distance between the object p and its nearest neighbor is greater

than distance between O2 and C2. For this reason, we

can't determine an appropriate value f o r the input pa-rameter Eps. I f the Eps value is less than the distance between 02 and C2, some objects i n Cj are assigned as noise objects. I f the Eps value is greater than the

dis-tance between 02 and C2, the object 02 is not assigned as a noise object.

The example i n Fig. 2 shows that the D B S C A N algo-r i t h m is not satisfactoalgo-ry when clustealgo-rs of diffealgo-rent densities exist. To overcome this problem, we propose a new concept: density factor. We assign to each cluster a density factor, which is the degree of the denshy of the cluster. We begin with the notion of density distance.

Definition 11 (density distance): L e t density_

distancejnax of an object p denote the maximum distance between the object p and its neighbor objects within the radius Eps. Similarly, let density_ distancejnin of an object p denote the minimum dis-tance between the object p and its neighbor objects within the radius Eps.

(i) density_distance_max (p) = max (dist(p,q) I q e D A dist(p,q) < Eps)

(ii) density_distance_min (p) = min {dist(p,q) I q e D A dist(p,q) < Eps}

The density distance of an object p is defined as density_distance_max density_distance_min (p). We define the density factor of a cluster as follows.

Definition 12 (density factor): The density factor of a

cluster C is deflned as:

density _ factor (C^ = ll

^ density _ distance^pj

P e C

C (3)

The density factor of a cluster C captures the degree of the density of the cluster. I f C is a "loose" cluster, density_distance_min would increase and so the density distance i n D e f n . 1 1 would be quite small, thus forcing the density factor of C t o be quite close to 1. Otherwise, if C is a "tight" cluster, density_distancejnin would de-crease and so the density distance i n D e f n . 1 1 would be quite big, thus forcing the density factor of C to be quite close to 0.

3.3 Problem of identifying adjacent clusters

(5)

ÏU.(joritluT» ST^DESCAJI (D, E p s l , E p a S , I l i n P t s , Ae) / / I n p u t s :

/ / D={oi, O i , OT.) S e t of o b j e c t ^

/ / E p s l : IlaKiDun geogcaphical coordinate ( s p a t i a l ) d i s t a n c e v a l u e . / / Ep32 : llaxinun n o n - s p a t i a l d i s t a n c e v a l u e . / / I l i n P t s : I l i n i m m number of p o i n t s u i t h i n E p s l and E p s 2 d i s t a n c e . / / ÜE : T h r e s h o l d v a l u e to be i n c l u d e d i n a c l u s t e r . / / Output: / / C=(Ci, Cl, _ Cjt} S e t of c l u s t e r s C l u s t e r ^ L a b e l = 0 F o r i = l to n / / ( i ) I f O i i s not i n a c l u s t e r Then / / ( i x ) X=Retrieve_Neighbor3(Oi , E p s l , Eps2) / / ( i i i ) I f IXI < I l i n P t s Then Hark Oi as n o i s e / / ( i v ) E l s e / / c o n s t r u c t a neu c l u s t e r (v) CiU3ter_Label = C l u s t e r _ L a b e l + 1 For 3=1 to 1X1 Hark a l l o b j e c t s i n X w i t h c u r r e n t ClU3Cer_Label End F o r P u 3 h ( a l l o b j e c t s i n X) / / ( v i ) l ï h i l e n o t I s E p n t y () CurrentObj = Pop[) Y= R e t r i e v e J I e i g l i b o r s l C u r r e n t O b j , E p s l , Eps2) I f IYI >= I l i n P t s Then F o r ï a i o b j e c t s o i n Y / / ( v i i )

I f (0 i s not narked a s n o i s e or i t i s not i n a c l u s t e r ) arnlj

I C l u s t e r _ A v g ( ) - o . V a l u e l <= Then Hark o TJith c u r r e n t C l u 3 t e r _ L a b e i PU3h(0) End I f E n d F o r End I f Enil I f h i l e E n d I f EmJ I f End F o r End M y o r i t h n • 2 .2 -3 / / - I O .9 . 8 / V , 2 - 2 7 ^ . 1 1 .8 . q /

Fig. 3. Example data set containing adjacent clusters

However, cluster objects should be within a certain dis-tance f r o m the cluster means. We solve this problem by comparing the average value of a cluster with the new incoming value. I f the absolute difference between Cluster_Avg() and Ob]'ect_Value is bigger than the threshold value, Ae, then the new object is not appended to the cluster. Cluster_Avg() refers to the average or mean value of the objects contained in the cluster and Object_Value refers to the nonspatial value of the object such as the temperature value of a location.

4 S T - D B S C A N algorithm

Because of the extensions described above, whereas the D B S C A N algorithm needs two inputs, the new S T - D B S C A N algorithm requires four parameters: Epsl, Eps2, MinPts, and A e . Epsl is the distance pa-rameter f o r spatial attributes (latitude and longitude) and Eps2 is the distance parameter f o r nonspatial at-tributes. A distance metric such as the Euclidean, Man-hattan or Minkowski distance metric can be used f o r Epsl and Eps2. MinPts is the minimum number of points within the Epsl and Eps2 distances of a point. I f a region is dense, then i t should contain more points than MinPts. I n Ester et al.,^ a simple heuristic is pre-sented that is effective i n many cases to detes que the parameters Eps and MinPts. The heuristic suggests MinPts ~ \n(n) where n is the size of the database and Eps must be picked depending on the value of MinPts. The first step of the heuristic method is to determine the distances to the k nearest neighbors f o r each object, where k is equal to MinPts, and then these k distance values should be sorted in descending order. Next, we should determine the threshold point, which is the first "valley" of the sorted graph. We should select Eps be t o less than the distance defined by the first valley. The last parameter, A e , is used to prevent the discovering of combined clusters because of the small differences i n nonspatial values of the neighboring locations.

The algorithm starts with the first point p i n database D and retrieves all points density reachable f r o m p w i t h respect to Epsl and Eps2. I f p is a core object (see D e f n .

Fig. 4. S T - D B S C A N algorithm

4 ) , a cluster is formed. I f p is a border object (see D e f n . 9 ) , no points are density reachable f r o m p and the algo-r i t h m visits the next point of the database. The palgo-rocess is repeated until ah the points have been processed.

A s shown i n Fig. 4 , the algorithm starts w i t h the first point i n database D (i). I f the selected object does not belong to any cluster (ii), the Retrieve_Neighbors func-tion is called (in). A call of Retrieve_Neighbors(obiect, Epsl, Eps2) returns the objects that have a distance less than the Epsl and Eps2 parameters to the selected object. I n other words, the Retrieve_Neighbors function retrieves ah objects density reachable (see D e f n . 6 ) f r o m the selected object w i t h respect to Epsl, Eps2, and MinPts. The result set forms the Eps neighborhood (see D e f n . 3 ) of the selected object. Retrieve_Neighbours(object, Epsl, Eps2) is equal to the intersection of Retrieve_Neighbours(object, Epsl) and Retrieve_Neighbours(object, Eps2). I f the total number of returned points i n the Eps neighborhood is smaUer than the MinPts input, the object is assigned as noise (iv). This means that the selected point has not enough neighbors to be clustered. The points marked as noise may be changed later i f they are not directly density reachable (see D e f n . 5 ) but they are density reachable (see D e f n . 6 ) f r o m some other point of the database. This happens f o r border points of a cluster.

(6)

a nevi' cluster is constructed (v). Then all directly density-reachable neighbors of this core object are also marked as a new cluster label. Next, the algorithm itera-tively collects density-reachable objects f r o m this core object by using a stack (vi). The stack is necessary to find reachable objects f r o m directly density-reachable objects. I f the object is not marked as noise or it is not in a cluster, and the difference between the average value of the cluster and the new incoming value is smaller than A e , i t is placed i n the current cluster (vii). A f t e r processing the selected point, the algorithm selects the next point i n D and continues iteratively u n t ü ah the points have been processed.

When the algorithm searches the neighbors of any object by using the Retrieve_Neighbors function [hne (iii) i n the algorhhm], i t takes into consideration both spatial and temporal neighborhoods. The nonspatial value of an object, such as a temperature value, is com-pared w i t h the nonspatial values of spatial neighbors and also w i t h the values of temporal neighbors (previ-ous day i n the same year, next day i n the same year, and the same day i n other years). I n this way, nonspatial, spatial, and temporal characteristics of data are used in clustering when the algorithm is applied to the table that contains temporal values as well as spatial and nonspatial values.

I f two clusters Q and C2 are very close to each other, a point p may belong to both Cj and C^- I n this case, the point p must be a border point i n both Ci and CJ. The algorithm assigns point p to the cluster discovered first.

The average runtime complexity of the D B S C A N algorithm is 0(n.*logn), where n is the number of ob-jects i n the database. Our modifications do not change the runtime complexity of the algorithm. D B S C A N has proven its abhity to process very large databases.*" Ester et al. (1996, 1998)*'' show that the runtime of other clustering algorithms such as D B C L A S D ^ ' and CLARANS22 is between 1.5 and 3 times the runtime of D B S C A N . This factor increases with increasing size of the database.

5 Application

T o demonstrate the usability of our algorithm, we present three cluster analysis applications by using physical data f r o m Turkish seas collected f r o m satelhtes between 1992 and 2004. The task of clustering i n the first application is to discover the regions that have similar sea surface temperature values. I n the second application, the goal is to identify spatially based parti-tions that have similar sea surface height residual val-ues. The third apphcation includes cluster analysis on significant wave height data.

user inlerfacej ] Application Requirements System Interface I Database Interface Data Integration Environmental data from various saleDSss • Sea Sutface Temperature • Sea Surtscs Heicjhl Residual •Significant Wave HelgH •VMnd Speed Database irJFO BASED • G r i d s • Images • Info TaWes \/IEW BASED • Shape Rtes DATA W A R E H O U S E •Tabies 1 Cluster Analysis and

Distribution Modelling • D a t a exploration •Sp5tial analysis and mcidelfing •Time-series a n a l y s i s ^ modelling T • Mapping • Graphical lllusiration Data Management • Q u e r i e s for tables • Q u e r i e s for display •Data conversion i Outputs

•i Tables VMaps

•t I m a g e s •i ibistraliCTS

-/ Reports

Fig. 5. Schematic diagram of the system

Figure 5 shows the structure of the system. I n the visuahzation part of the study, remotely sensed data on the historical extent of marine areas were used i n a spadal metrics analysis of the geographical f o r m of countries and islands. User-friendly interfaces were de-veloped allowing relatively inexperienced users to oper-ate the system. Special functions were developed f o r data integration, data conversion, querying, visualiza-tion, analysis, and management.

The process of K D D involves several steps such as data integration and selection, data preprocessing and transformation, data mining, and the evaluation of the data mining results. Our efforts at each step are described below.

5.1 Data integration and selection

We designed a spatial data warehouse system that contains information about four seas: the Black Sea, the Marmara Sea, the Aegean Sea, and the eastern Medherranean. These seas surround the countries Tur-key to the north, west, and south; Greece to the east, south, and west; and Cyprus. The geographical coordi-nates of our work area are 30°-47.5° north latitude and 17.0°-42.5° east longitude.

As shown in Fig. 6, the data model contains a central fact table, S T A T I O N S , which interconnects the tables: Sea_Surface_Temperature, Sea_Surface_Height, Wave_Height, and Sea_Winds. The data size is

(7)

Central Fact Table

Sea_S»'rface_Temperature STATIONS Sea_Siirface_HeigtTt StationID StationID StationID Year RegionID Year Motitli Latitude Month Day Longitude Day

DayTime_TempGrature Sea_Surface_Height_Residual I^Jic3hlTifTie_Tempefature ClusterlD ClustetID Wave_Heiglil S e a J M l i d s StationID StationID Year Year Monlh Month Day Day Slgniilrant_Wave„Heitjlit Wind^Speed ClusterlD ZonaLVflnd MendionaLWind ClusterlD

Fig. 6. Schema of spatial-temporal data warehouse

Sea, Marmara Sea, Aegean Sea, or Mediterranean Sea). The last column, ClusterlD, identifies a particular clus-ter of stations that have similar characclus-teristics.

The Sea_Surface_Temperature table contains weekly daytime and nighttime temperature records f r o m 2001 to 2004. The data was provided by National Oceanic and Atmospheric Administration ( N O A A ) satelhtes ( N O A A / A V H R R satehite data web site, http:// podaac.jpl.nasa.gov/). I t contains approximately 1.5 mil-hon rows. Data i n the Sea_Surface_Height table were provided by the Topex/Poseidon satehite (Topex/Poseidon sea-level grids description http:// podaac.jpl.nasa.gov/woce/woce3_topex/topex/docs/ topex_doc.htm) and were collected over five-day peri-ods between 1992 and 2002. The Wave_Height table contains significant wave height values that were col-lected over ten-day periods between 1992 and 2002. Similar to the significant sea surface height values, the significant wave height values were provided by the Topex/Poseidon satellite. The Sea_Winds table contains information about wind speed, zonal wind, and meridi-onal wind. The data were measured daily between 1999 and 2004 and were provided by the Q u i k S C A T satellite ( Q u i k S C A T seawinds, gridded ocean wind vectors, http://podaac.jpl.nasa.gov/products/productl09.html).

A s i n all databases, fast access to raw data i n spatial-temporal databases depends on the structural organiza-tion of the stored informaorganiza-tion and the availability of suitable indexing methods. A well-designed data struc-ture can facilitate the rapid extraction of the desired information f r o m a set of data, and suitable indexing methods can quickly locate single or multiple objects." Well-known spatial indexing techniques include quad¬ trees,!" R-trees," and others, see guting^* f o r an over-view. A n R-tree is a spatial indexing technique that stores information about spatial objects such as object identifiers and the minimum bounding rectangles

(MBRs) of the objects or groups of objects. Each entry of a leaf node is of the f o r m (R, P) where Ris a rectangle that encloses all the objects that can be reached by following the node pointer P. I n our study, we made an improvement to the R-tree indexing method to handle spatial-temporal information. We created some nodes in R-tree for each spatial object and hnked them i n temporal order. During the apphcation of the algo-rithm, this tree is traversed to find the spatial or tem-poral neighbor objects of any object. Two objects are temporal neighbors i f the values of these objects are observed on consecutive days i n the same year or i n the same day i n different years.

5.2 Data preprocessing and transformation

Satellite data generally contain false information and sometimes several values can be missing. We filled miss-ing values w i t h the average of adjacent object values. The missing values were generally located at the coasts of the Aegean Sea, because the Aegean coast is ex-tremely indented with numerous gulfs and inlets.

The maps derived f r o m the N O A A - A V H R R (polar-orbiting advanced very high resolution radiometer) are used to compute sea surface temperatures (SSTs) by applying the multichannel sea surface temperature al-gorithm (MCSST). The latest version of this alal-gorithm uses the following formula i n the calculation of the SST:

SST = a*T4 + b*(T4-T5)*Tf + c* {sec(q) -1

*[T4-T5)-d (4) where q is the satellite zenith angle or the incidence

angle of the incoming radiation based on the horizontal plane of the satehhe and T4 and T5 are the brightness temperatures f r o m A V H R R channels 4 and 5, respec-tively. T f is a first-guess SST estimate (obtained f r o m the 1 k m MCSST A V H R R mosaic SSTs) and a, b, c, and d are empirically derived coefficients.^' These coeffi-cients are predetermined by comparing A V H R R radi-ance values to temperature measurements taken f r o m moored and drifting buoys. For example, the nighttime and daytime equations f o r N O A A - 1 4 Satellite are:

Daytime SST = 0.9506 * T4 + 0.0760 * (T4 - TS) * Ti + 0.6839 * (sec(o) - 1 ) - (74 - 75) - 258.0968 Nighttime SST = 0.9242 *74 + 0.0755 *(74 - 75) * H

+ 0.6040 * (sec(o) - 1 ) * (74 - 75) - 250.4284 (5)

(8)

1 -•• S - i _

5^

Fig.7. a The locations o f 5340 sta-tions, b The results o f cluster

analysis on sea surface

tempera-b ture data

SSHR = SSH - MSS - Tide Effects - Inverse

Barometer (6) where SSH is the sea surface height value and MSS is

the mean sea surface height value. The residual sea surface is defined as the sea surface height minus the mean sea surface and minus known effects, i.e., tides and inverse barometer effects.

The significant wave height f r o m Topex is calculated f r o m ahimeter data based on the shape of a radar pulse after i t bounces o f f the sea surface. A calm sea with low waves returns a sharply defined pulse, whereas a rough sea with high waves returns a stretched pulse. The sig-nificant wave height is the average height of the highest one-third of all waves in a pardcular time period.

5.3 Spatial-temporal clustering

I n this step of the study, our clustering algorithm is applied three times to discover the spatial-temporal distributions of three physical parameters. The fu'st ap-phcation uses Sea_Surface__Temperature data to find the regions that have similar sea surface temperature charac-teristics. The input parameters are designated as Epsl = 3, Eps2 = 0.5, and MinPts = 15. The second applica-tion uses Sea_Surface_Height data to find regions that have similar sea surface height residual values, and the input parameters are designated as Epsl = 3, Epsl = 1, and MinPts = 4. The third application uses the Wave_Height table to find the regions that have similar significant wave height values. For this, the input parameters are assigned as Epsl -1, Eps2 - 0.25, and MinPts = 15. These values f o r the input parameters were determined by using the heuristics given i n Ester et al.^ 5.4 Evaluation ofthe results

The example database contains weekly daytime and nighttime sea surface temperature records that were measured at 5340 stadons between 2001 and 2004, and these stations are shown i n Fig. 7a as black dots. The spatial distribution of temperature i n surface water (30°-47.5°N and 17°-42.5°E) is shown in Fig. 7b. Each

cluster has data points that have simhar sea surface temperature characteristics. Cluster number 1 is bor-dered by Ukraine and Russia; this region is the coldest area. Cluster number 2, bordering Romania and the Ukraine, is the second coldest area. The seawater tem-peratures of other parts of the Black Sea are similar to those of the Marmara Sea. Cluster number 4 covers the north of the Aegean Sea. Cluster number 5 forms a great single cluster covering the eastern Mediterranean. The temperature values of the stations in Cluster 6 also have simhar characteristics. Cluster number 7 is the hottest region because it is the closest area to the equa-tor. I n winter, C5 and C7 clusters can be marked as one cluster because they cannot be distinguished very clearly. I n summer, the C6 cluster becomes smaher. M a n y factors can affect this distribution of seawater temperature. The temperature varies both latitudinally and depth-wise in response to changes in air-sea inter-actions. Heat fiuxes, evaporation, river inflow, and the movement of water and rain all influence the distribu-tion of seawater temperature.

The Topex/Poseidon satellite provides sea surface height residual data as a two-dimensional grid sepa-rated by one degree i n ladtude and longitude. So SSHR values stored in the database are available at 134 stations, shown in Fig. 8a as black dots. The clusters obtained by using the Sea_Surface_Height table are showed in Fig. 8b. Each cluster has data points that have similar sea surface height residual values. Clusters C l -C4 are located i n the Black Sea and cluster C7 is located i n the Aegean Sea. The rest of the clusters are located i n the Mediterranean Sea. Many factors contribute to changes i n sea surface height, including sea eddies, the temperature of the upper layer of seawater, tides, sea currents, and gravity.

(9)

approxi-Fig. 8. a The locations of 134

sta-tions, b The results of cluster analysis on sea surface height

re-b sidual data

mately 0.5 m, the region that is circled with a dashed line (cluster 11) has a wave height of approximately 3.6m. This region has the maximum wave height values.

6 Conclusions and future woric

Clustering is an important method to clarify how the physical properties of the sea are distributed and chang-ing. The main objective of this study was to develop an algorithm to obtain the regions (clusters) that have simi-lar physical parameter values i n a marine environment. W e present a new density-based clustering algorithm, S T - D B S C A N , which is constructed by modifying the D B S C A N algorithm. The first reason f o r this modifica-tion is to be able to discover clusters i n spatial-temporal data. The second modification is necessary to find noise objects when clusters of different densities exist. We introduce the new concept of density factor, assigned to each cluster, which is the degree of the density of the cluster. The third modification provides a comparison of the average value of a cluster w i t h new incoming values. To demonstrate our algorithm, we showed the spatial-temporal distributions of sea surface temperature, sea surface height residual, and significant wave height val-ues in Turkish seas. Experimental resuhs show that our modifications appear to be very promising when applied to physical data f r o m the marine environment. V e r y large databases need extreme computing power. I n

future studies, it is intended to run the algorithm i n parallel to improve the performance. I n addition, more useful heuristics may be f o u n d to determine the input parameters Eps and MinPts.

Acknowledgments. This study was supported by the Scientific Research Projects Directorate of D o k u z E y l u l University. W e thank the Institute of Marine Sciences and Technology at Dokuz E y l u l University for its support.

References

1. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: LeCam L M , Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley 1, pp 281-297

2. Vinod H (1969) Integer programming and the theory of grouping. 1 A m Stat Assoc 64:506-517

3. Guha S, Rastogi R, Shim K (1998) CURE: A n efficient clus-tering algorithms for large databases. In: Haas LM, Tiwary A (eds) Proceedings of the A C M SIGMOD international confer-ence on management of data. ACM, Seattle, WA, pp 73-84 4. Zhang T, Ramakrishnan R, Linvy M (1996) BIRCH: A n efficient

data clustering method for very large databases. In: Jagadish HV, Mumick IS (eds) Proceedings of the A C M SIGMOD international conference on management of data. A C M , Quebec, pp 103-114 5. Fisher D (1987) Knowledge acquisition via incremental

concep-tual clustering. Machine Learn 2:139-172

(10)

noise. In: Simoudis E, Han J, Fayyad U M (eds) Proceedings of the second international conference on knowledge discovery and data mining. A A A I , Portland; pp 226-231

7. Ester M, Kriegel H-P, Sander J, et al (1998) Clustering for mining in large spatial databases. Künstliche Intelligenz (special issue on data mining) 12:18-24

8. Ester M, Kriegel H-P, Sander J, et al (1998) Incremental cluster-ing for mincluster-ing in a data warehouscluster-ing environment. In: Gupta A , Shmueli O, Widom J (eds) Proceedings of the international con-ference on very large databases (VLDB'98). Morgan Kaufmann, New York, pp 323-333

9. Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCIuster: a multi-resolution clustering approach for very large spatial data-bases. In: Gupta A , Shmueli O, Widom J (eds) Proceedings of the international conference on very large databases (VLDB'98). Morgan Kaufmann, New York, pp 428^39

10. Samet H (1990) The design and analysis of spatial data structures. Addison-Wesley, M A

11. Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the A C M SIGMOD international conference on management of data. ACM, Boston, pp 47-57 12. Ankerst M , Breunig M M , Kriegel H-P, et al (1999) OPTICS:

ordering points to identify the clustering structure. In: Proceed-ings of the A C M SIGMOD international conference on manage-ment of data. A C M , Philadelphia, pp 49-60

13. Hinneburg A , Keim D A (1998) A n efficient approach to cluster-ing in large multimedia databases with noise. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of the 4th international conference on knowledge discovery and data min-ing. A A A I , New York, pp 58-65

14. Ma S, Wang TJ, Tang SW, et al (2003) A new fast clustering algorithm based on reference and density. In: Proceedings of the W A I M conference. Springer-Verlag, Heidelberg, pp 214¬ 225

15. Murray AT, Estivill-Castro V (1998) Cluster discovery techniques for exploratory spatial data analysis. Int J Geogr Inf Sc 12:431¬ 443

16. Qian WN, Zhou A Y (2002) Analyzing popular clustering algo-rithms from different viewpoints. J Software 13:1382-1394 17. Han J, Kamber M (2001) Data mining concepts and techniques.

Morgan Kaufmann, San Francisco, pp 335-391

18. Januzaj E, Kriegel H-P, Pfeifle M (2004) Scalable density-based distributed clustering. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD'04). Lectures notes in computer science. Springer, Berlin/ Heidelberg, pp 231-244

19. Wen J-R, Nie J-Y, Zhang H-J (2002) Query clustering using user logs. ACM Trans Inf Sys 20:59-81

20. Spieth C, Streichert F, Speer N , et al (2005) Clustering-based approach to identify solutions for the inference of regulatory networks. In: Proceedings of the IEEE congress on evolutionary computation. IEEE, Edinburgh

21. Xu X, Ester M , Kriegel H-P, et al (1998) A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of the IEEE international conference on data engineering. IEEE Computer Society, Oriando, pp 324-331 22. Ng RT, Han J (1994) Efficient and effective clustering

methods for spatial data mining. In: Boca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann, Santiago, pp 144-155 23. Abraham T, Roddick JF (1999) Survey of spatio-temporal

data-bases. Geolnformatica 3:61-99

24. Guting R H (1994) A n introduction to spatial database systems. VLDB J 3:357-399