• Nie Znaleziono Wyników

HeBIS: A Biologically Inspired Data Classification System

N/A
N/A
Protected

Academic year: 2021

Share "HeBIS: A Biologically Inspired Data Classification System "

Copied!
249
0
0

Pełen tekst

(1)

Systems Research Institute Polish Academy of Sciences

Victor B. Taylor, MSc

Ph.D. Thesis

HeBIS: A Biologically Inspired Data Classification System

Supervisor: Prof. Dr. hab. Inż. Janusz Kacprzyk Afiliacja

Systems Research Institute Polish Academy of Sciences

Warsaw, September 2010

(2)
(3)
(4)

ABSTRACT

This research develops and investigates an algorithm for the self-organized development of a classification network. This idea for this classification network, known as HeBIS (Heterogeneous Biologically Inspired System), is based on a heterogeneous mixture of intelligent yet simple processing units (cells) that can potentially consist of several types of machine learning constructs. A list of these constructs can include self-organizing feature maps (SOFM), artificial neural networks (ANN), and support vector machines (SVM) that communicate with each other via the diffusion of artificial proteins. The context for the self-organization of the network and the communication between the processing cells is that of an artificial genetic regulatory network (GRN). An evolved GRN based on an artificial chemistry of simulated proteins is used as the controller. Artificial genes within each processing cell are essentially excitatory and inhibitory switches that control the concentration and diffusion of artificial proteins throughout a simulated environmental lattice. This GRN controls both the growth of the classification network and the specific behaviors of the individual processing cells. These controls also use artificial chemistry analogs of problem descriptors such as second-order statistics.

Self-organization and evolution of the network occur on several levels: the high-level topology of the network as well as parameters and behaviors that affect the internal organization of each processing cell. The artificial proteins used for communications and the transfer of regulatory information between and within the processing elements are also evolved as are the environmental proteins used to represent the input feature vector for the set of training and test exemplars. An evolutionary process incorporating particle swarm optimization is used to define an artificial genome that defines these elements as well as the classification information that the network presents to the user.

Behaviors such as input/output signal conditioning, machine learning processing, and environmental/regulatory communications are evolved and are used as the genome’s input to the high-level evolutionary process.

The HeBIS algorithm architecture is discussed in detail and extensions for future research are

proposed. Performance of a classification network based on this novel technique with a single type of

cellular machine learning element, a SOFM, is examined. This performance is compared with that of

a baseline standalone SOFM. For this case study, the problem considered is how well HeBIS learns

the empirical algorithm for cloud/no-cloud pixel detection that is used by the National Aeronautics

(5)

and Space Administration’s (NASA) for its multispectral optical datasets acquired from the Moderate

Resolution Imaging Spectroradiometer (MODIS) sensor on the earth-orbiting Aqua satellite.

(6)

ACKNOWLEDGMENTS

(7)

DECLARATION

(8)

Table of Contents

ABSTRACT ... 4

ACKNOWLEDGMENTS ... 6

DECLARATION ... 7

TABLE OF CONTENTS ... 8

FIGURES ... 11

TABLES ... 16

TERMS AND ACRONYMS ... 18

1. INTRODUCTION ... 19

1.1. P ROBLEM STATEMENT ... 20

1.2. D ELIMITATIONS OF THE RESEARCH ... 21

1.3. K EY CONTRIBUTIONS ... 21

1.4. O RGANIZATION OF THIS THESIS ... 21

2. LITERATURE REVIEW ... 24

2.1. M ACHINE LEARNING AND SELF - ORGANIZATION ... 25

2.1.1. Classification overview ... 25

2.1.2. Artificial neural networks ... 28

2.1.3. Self-organizing feature maps ... 34

2.2. E VOLUTIONARY COMPUTATION ... 40

2.2.1. Particle swarm optimization ... 40

2.3. B IOLOGICAL AND ARTIFICIAL EVOLUTIONARY DEVELOPMENT ... 43

2.4. S UMMARY ... 51

3. HETEROGENEOUS BIOLOGICALLY INSPIRED SYSTEM (HEBIS) ... 54

3.1. O VERVIEW ... 54

3.2. F UNDAMENTALS ... 55

3.2.1. Processing cell ... 55

3.2.2. Environment... 57

3.2.2.1. 1-D environment ... 58

3.1.1.1. 2-D environment ... 58

3.1.1.2. 3-D environment ... 59

3.1.2. Genetic regulatory network ... 59

3.1.2.1. Gene coding ... 60

3.1.2.2. Protein communications ... 64

3.1.3. Basic cell processing ... 66

3.1.3.1. Cell genome ... 66

3.1.3.2. Intrinsic behaviors ... 67

3.1.3.2.1. NumberProteinsInCell ... 67

3.1.3.2.2. NumberProteinsInLocalEnviro ... 67

3.1.3.2.3. ConcentrationStandardDeviationLocalEnviro ... 67

3.1.3.2.4. ConcentrationMeanLocalEnviro ... 68

3.1.3.2.5. ConcentrationMaxLocalEnviro ... 68

3.1.3.2.6. ConcentrationMinLocalEnviro ... 68

3.1.3.2.7. KillSelf ... 68

3.1.3.2.8. NumberFeatures ... 68

3.1.3.3. Learned behaviors ... 69

3.1.3.3.1. AddCell... 69

3.1.3.3.2. PruneSelf ... 69

3.1.3.3.3. ChangeToSOFMAndTrain ... 69

3.1.3.3.4. Classify ... 70

3.1.3.4. Cell types ... 70

3.1.3.4.1. SOFM ... 70

3.1.3.4.2. Pass-Thru ... 70

3.2. I NPUT FEATURE VECTOR REPRESENTATIONS ... 71

(9)

3.2.1. Direct feature-to-protein mapping ... 71

3.3. P ATTERN TRAINING FOR CLASSIFICATION ... 72

3.3.1. Self-organization principles ... 73

3.3.1.1. Protein analogs of statistical features ... 74

3.3.1.2. Cellular fission and death ... 74

3.3.2. Particle swarm optimization ... 75

3.3.3. Training algorithm ... 75

3.3.3.1. Training Algorithm: Presentation of training vectors and classes to the system ... 75

3.4. O UTPUT CODING ... 77

3.5. P OST PROCESSING OF CLASSIFICATION RESULTS ... 77

4. SIMULATIONS AND ANALYSES ... 78

4.1. S IMULATION LIMITS ... 78

4.2. G ENERAL METHODOLOGY ... 80

4.2.1. Remote sensing cloud/no-cloud problem ... 80

4.2.1.1. Description ... 80

4.2.1.2. Sensor and datasets ... 80

4.3. T HE CONSTRUCTION OF SIMPLE GENETIC REGULATORY NETWORKS ... 95

4.3.1. Introduction/Methodology ... 95

4.3.2. Experiments 1 and 2 – Proteins ... 96

4.3.2.1. Setup ... 96

4.3.2.2. Experiments 1 and 2 results and discussion ... 97

4.3.2.3. Experiments 1 and 2 conclusions ... 100

4.3.3. Experiments 3 and 4 - Protein Chemistry ... 101

4.3.3.1. Setup ... 101

4.3.3.2. Experiments 3 and 4 results and discussion ... 102

4.3.3.3. Experiments 3 and 4 conclusions ... 113

4.3.4. Experiment 5 - Gene activation ... 114

4.3.4.1. Setup ... 114

4.3.4.2. Experiment 5 results and discussion ... 115

4.3.4.3. Experiment 5 conclusions ... 127

4.4. S ELF - ORGANIZATION IN THE H E BIS ENVIRONMENT ... 128

4.4.1. Introduction/Methodology ... 128

4.4.2. Fitness function description ... 129

4.4.3. Experiment 6 - Swarm fitness characterization ... 132

4.4.3.1. Setup ... 132

4.4.3.2. Experiment 6 results and discussion ... 132

4.4.3.3. Experiment 6 conclusions ... 135

4.4.4. Experiment 7 - Initial location of processing cells ... 136

4.4.4.1. Setup ... 136

4.4.4.2. Experiment 7 results and discussion ... 136

4.4.4.3. Experiment 7 conclusions ... 137

4.4.5. Experiment 8 - Cellular actions ... 138

4.4.5.1. Setup ... 138

4.4.5.2. Experiment 8 results and discussion ... 139

4.4.5.3. Experiment 8 conclusions ... 141

4.4.6. Experiment 9 - Protein statistical analogs... 141

4.4.6.1. Setup ... 141

4.4.6.2. Experiment 9 results and discussion ... 142

4.4.6.3. Experiment 9 conclusions ... 143

4.4.7. Experiment 10 - Output protein comparison ... 143

4.4.7.1. Setup ... 144

4.4.7.2. Experiment 10 results and discussion ... 144

4.4.7.3. Experiment 10 conclusions ... 145

4.5. C LASSIFICATION ACCURACY ... 146

4.5.1. Introduction ... 146

4.5.2. Training algorithm parameters description ... 146

4.5.3. Fully-engaged HeBIS... 147

4.5.3.1. Experiment 11 - Size of geographic processing environment ... 147

4.5.3.1.1. Setup ... 148

4.5.3.1.2. Experiment 11 results and discussion ... 149

4.5.3.1.3. Experiment 11 conclusions ... 153

4.5.3.2. Experiment 12 - Size of intracellular SOFM kernel ... 153

4.5.3.2.1. Setup ... 153

(10)

4.5.3.2.2. Experiment 12 results and discussion ... 154

4.5.3.2.3. Experiment 12 conclusions ... 160

4.5.3.3. Experiment 13 - Protein chemistry reaction probability ... 161

4.5.3.3.1. Setup ... 161

4.5.3.3.2. Experiment 13 results and discussion ... 162

4.5.3.3.3. Experiment 13 conclusions ... 167

4.5.3.4. Experiment 14 - Shotgun ... 168

4.5.3.4.1. Setup ... 168

4.5.3.4.2. Experiment 14 results and discussion ... 171

4.5.3.4.3. Experiment 14 conclusions ... 194

4.6. C LASSIFICATION ROBUSTNESS ... 196

4.6.1. Introduction/Methodology ... 196

4.6.2. Experiment 15 - Noise ... 196

4.6.2.1. Setup ... 196

4.6.2.2. Experiment 15 results and discussion ... 198

4.6.2.3. Experiment 15 conclusions ... 199

4.6.3. Experiment 16 - Missing features ... 200

4.6.3.1. Setup ... 200

4.6.3.2. Experiment 16 results and discussion ... 200

4.6.3.3. Experiment 16 conclusions ... 203

5. SUMMARY AND CONCLUSIONS ... 204

6. FUTURE RESEARCH ... 209

6.1. F ITNESS FUNCTIONS ... 209

6.2. A DDITIONAL MACHINE LEARNING KERNELS ... 209

6.3. M ODULARITY AND LEARNED FUNCTIONALITY ... 209

6.4. A DDITIONAL CELLULAR ACTIONS ... 210

6.4.1. Scale-free networks ... 211

6.4.2. Mutual information ... 211

6.4.3. Cellular instantiation ... 212

6.5. G RAPHICS P ROCESSING U NIT ... 213

6.6. P ROTEIN - BASED COMMUNICATIONS FOR ARTIFICIAL DEVICES IN A BIOLOGICAL SYSTEM ... 214

7. APPENDICES ... 215

7.1. D ATA ... 216

7.2. G ENOME MAPPINGS FOR SHOTGUN DATA ... 234

7.3. P ROTEIN DIFFUSION EXAMPLE ... 235

7.4. H E BIS FITNESS FUNCTION DETAILS ... 238

7.5. H E BIS S HOTGUN E XPERIMENT C ORRELATION M APS AND PDF ... 240

7.6. H E BIS TRAINING CYCLE DETAILS ... 241

BIBLIOGRAPHY ... 242

(11)

Figures

F IGURE 1. P ERFORMANCE OF EMPIRICAL LEARNING SYSTEMS . ... 26

F IGURE 2. A N ARTIFICIAL NEURON . ... 29

F IGURE 3. A RTIFICIAL F EEDFORWARD N EURAL N ETWORK . ... 29

F IGURE 4. S ELF -O RGANIZING F EATURE M AP . ... 36

F IGURE 5. S ELF -O RGANIZING F EATURE M AP T OPOLOGY P RESERVATION E XAMPLE [61]. ... 37

F IGURE 6. P ARTICLE S WARM O PTIMIZATION A LGORITHM [90]. ... 42

F IGURE 7. E NVIRONMENTAL LATTICE AND PROCESSING CELL OVERVIEW . ... 55

F IGURE 8. M AJOR BLOCKS OF CELL FUNCTIONALITY . ... 56

F IGURE 9. L INEAR GRID NUMBERING SCHEME . ... 58

F IGURE 10. P LANAR GRID NUMBERING SCHEME . ... 59

F IGURE 11. T HREE - DIMENSIONAL LATTICE NUMBERING SCHEME . ... 59

F IGURE 12. G ENOME /G ENE PROTEIN HIERARCHY . ... 61

F IGURE 13. S TANDARD REGULATORY / ENVIRONMENTAL AND SWITCH PROTEIN DESCRIPTIONS . ... 61

F IGURE 14. D IRECT FEATURE - TO - PROTEIN MAPPING . ... 72

F IGURE 15. S CHEMATIC OF THE TRAINING ALGORITHM . ... 76

F IGURE 16. T RAINING /T ESTING PIXEL AND ITS RELATIONSHIP TO ITS SURROUNDING GEOGRAPHIC PIXELS . ... 78

F IGURE 17. T RAINING /T ESTING PIXEL AND THE SURROUNDING MULTISPECTRAL INFORMATION . ... 79

F IGURE 18. P SEUDOCOLOR IMAGE FOR A2002193183000 DATASET . G REY AND WHITE COLORS CORRESPOND TO CLOUD PIXELS , BLACK CORRESPONDS TO WATER , AND GREEN AND BROWN REFER TO LAND PIXELS . ... 84

F IGURE 19. G ROUND TRUTH ( CLOUD / NO - CLOUD ) FOR A2002193183000 DATASET . R ED CORRESPONDS TO LAND PIXELS WHEREAS WHITE PIXELS REFERENCE CLOUDS AND BLACK CORRESPONDS TO WATER . ... 85

F IGURE 20. L AND MASK ( CLOUD / NO - CLOUD ) FOR A2002193183000 DATASET . R ED CORRESPONDS TO LAND PIXELS AND BLACK CORRESPONDS TO WATER PIXELS . ... 86

F IGURE 21. C LOUD / NO - CLOUD CLASS BREAKDOWN ACCORDING TO SPECIFIC WAVELENGTH - BAND FEATURE . ... 87

F IGURE 22. 412 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 88

F IGURE 23. 443 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 88

F IGURE 24. 469 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 89

F IGURE 25. 488 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 89

F IGURE 26. 531 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 89

F IGURE 27. 551 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 90

F IGURE 28. 555 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 90

F IGURE 29. 645 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 91

F IGURE 30. 667 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0) PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 91

F IGURE 31. 678 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0)

PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL

WITHIN THE RANGE [0, 1]. ... 91

(12)

F IGURE 32. 748 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0)

PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 92 F IGURE 33. 859 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0)

PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 92 F IGURE 34. 869 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0)

PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 93 F IGURE 35. 1240 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0)

PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 93 F IGURE 36. 2130 NM CLOUD / NO - CLOUD SCATTER PLOT WITH CLOUD (C0) PIXELS REPRESENTED AS 1 AND NO - CLOUD (C0)

PIXELS REPRESENTED AS -1 ON THE ABSCISSA . M AGNITUDES ARE LOG - NORMALIZED AND BIASED AND SCALED TO FALL WITHIN THE RANGE [0, 1]. ... 94 F IGURE 37. B ASELINE NUMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR ZERO - LENGTH GENOME IN E XPERIMENT 1.

V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 98 F IGURE 38. B ASELINE NUMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR THE 3- GENE GENOME IN E XPERIMENT 2.

V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 98 F IGURE 39. N UMBER OF PROTEINS IN ENVIRONMENT COMPARED BETWEEN THE BASELINE GENOME FROM E XPERIMENT 1

AND THE MULTI - GENE GENOME FROM E XPERIMENT 2. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 100 F IGURE 40. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 0- GENE GENOME VS . CELLULAR ITERATION FOR

REACTION PROBABILITIES OF 0 %, 0.1 %, 1 %, AND 10 % WITH ERROR BARS REMOVED FOR CLARITY . ... 103 F IGURE 41. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 3- GENE GENOME VS . CELLULAR ITERATION FOR

REACTION PROBABILITIES OF 0 %, 0.1 %, 1 %, AND 10 % WITH ERROR BARS REMOVED FOR CLARITY . ... 104 F IGURE 42. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 0- GENE GENOME VS . CELLULAR ITERATION FOR A

REACTION PROBABILITY OF 0.1 %. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN ... 105 F IGURE 43. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 0- GENE GENOME VS . CELLULAR ITERATION FOR A

REACTION PROBABILITY OF 1.0 %. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN ... 105 F IGURE 44. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 0- GENE GENOME VS . CELLULAR ITERATION FOR A

REACTION PROBABILITY OF 10 %. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 106 F IGURE 45. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 3- GENE GENOME VS . CELLULAR ITERATION FOR A

REACTION PROBABILITY OF 0.1 %. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN ... 106 F IGURE 46. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 3- GENE GENOME VS . CELLULAR ITERATION FOR A

REACTION PROBABILITY OF 1.0 %. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN ... 107 F IGURE 47. N UMBER OF PROTEINS IN ENVIRONMENTAL LATTICE FOR A 3- GENE GENOME VS . CELLULAR ITERATION FOR A

REACTION PROBABILITY OF 10 %. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 107 F IGURE 48. F ITTING AND STATISTICAL INFORMATION FOR A 0- GENE GENOME IN AN ENVIRONMENT WITH A REACTION

PROBABILITY OF 0.1 %. V ERTICAL BARS ON THE UPPER CHART CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 109 F IGURE 49. F ITTING AND STATISTICAL INFORMATION FOR A 0- GENE GENOME IN AN ENVIRONMENT WITH A REACTION

PROBABILITY OF 1.0 %. V ERTICAL BARS ON THE UPPER CHART CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 110 F IGURE 50. F ITTING AND STATISTICAL INFORMATION FOR A 0- GENE GENOME IN AN ENVIRONMENT WITH A REACTION

PROBABILITY OF 10%. V ERTICAL BARS ON THE UPPER CHART CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 110 F IGURE 51. F ITTING AND STATISTICAL INFORMATION FOR A 3- GENE GENOME IN AN ENVIRONMENT WITH A REACTION

PROBABILITY OF 0.1 %. V ERTICAL BARS ON THE UPPER CHART CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 111 F IGURE 52. F ITTING AND STATISTICAL INFORMATION FOR A 3- GENE GENOME IN AN ENVIRONMENT WITH A REACTION

PROBABILITY OF 1.0 %. V ERTICAL BARS ON THE UPPER CHART CORRESPOND TO THE STANDARD DEVIATION OF THE

SAMPLE MEAN . ... 112

(13)

F IGURE 53. F ITTING AND STATISTICAL INFORMATION FOR A 3- GENE GENOME IN AN ENVIRONMENT WITH A REACTION PROBABILITY OF 10 %. V ERTICAL BARS ON THE UPPER CHART CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 112 F IGURE 54. C OMPARISON BETWEEN 0- GENE AND 3- GENE GENOMES FOR VARYING ENVIRONMENTAL REACTION

PROBABILITIES . E RROR BARS REMOVED FOR CLARITY . ... 113 F IGURE 55. E XAMPLE GENE ACTIVATION MAP . ... 116 F IGURE 56. 3- GENE GENOME ACTIVATION VS . ITERATION FOR C0 (178) USING GENOMES 93 AND 146 IN ( A ) AND ( B ),

RESPECTIVELY . ... 117 F IGURE 57. 3- GENE GENOME ACTIVATION VS . ITERATION FOR C0 (142) USING GENOMES 9 AND 147, RESPECTIVELY , IN ( A )

AND ( B ). ... 117 F IGURE 58. 3- GENE GENOME ACTIVATION VS . CELLULAR ITERATION FOR C0 (48) USING GENOME 178. ... 118 F IGURE 59. 10- GENE GENOME ACTIVATION VS . CELLULAR ITERATION FOR ORIGINAL AND TWO CLONED CELLS WITHIN THE

ENVIRONMENTAL MATRIX - C0 (42) TEST PIXEL FOR GENOME 117. T HE ORIGINAL GENOME IS PRESENTED IN ( A ) WHILE THE TWO CLONED GENOMES ARE PRESENTED IN ( B ) AND ( C ). ... 119 F IGURE 60. 10- GENE GENOME ACTIVATION VS . CELLULAR ITERATION FOR ORIGINAL AND TWO CLONED CELLS WITHIN THE

ENVIRONMENTAL MATRIX - C0 (50) TEST PIXEL USING GENOME 117. T HE ACTIVATION FOR THE ORIGINAL CELL IS SHOWN IN ( A ) AND THE ACTIVATIONS FOR THE CLONED CELLS ARE SHOWN IN ( B ) AND ( C ). ... 120 F IGURE 61. 10- GENE GENOME ACTIVATION VS . CELLULAR ITERATION FOR ORIGINAL AND TWO CLONED CELLS WITHIN THE

ENVIRONMENTAL MATRIX - C1 (51) TEST PIXEL USING GENOME 117. T HE ORIGINAL CELL ACTIVATION IS PRESENTED IN

( A ) AND THE ACTIVATIONS FOR THE CLONED CELLS ARE LISTED IN ( B ) AND ( C ). ... 120 F IGURE 62. 10- GENE GENOME ACTIVATION VS . CELLULAR ITERATION FOR THE SAME TEST PIXEL C0 (42). G ENOMES 91,

123, AND 135 ARE DISPLAYED IN ( A ), ( B ), AND ( C ), RESPECTIVELY . G ENOMES 123 AND 135 SHOW ORIGINAL AND TWO CLONED CELL ACTIVATIONS FOR EACH OF THESE GENOMES . ... 122 F IGURE 63. 40- GENE GENOME ACTIVATION FOR GENOME 30 FOR TEST PIXELS C0 (26) IN THE TOP IMAGE ( A ) AND C1 (25) IN THE BOTTOM IMAGE ( B ). T HE GENOME SHOWS MULTI - GENE ACTIVATION FOR C0 AND SINGLE GENE ACTIVATION FOR

C1 WITH DIFFERING RESPONSES . ... 123 F IGURE 64. 40- GENE GENOME ACTIVATION FOR GENOME 66. T HE TOP ( A ) AND MIDDLE ( B ) IMAGES ARE THE MULTI - GENE

ACTIVATION PROFILES FOR C0 (30) WITH AN ORIGINAL AND CLONED CELL . T HE BOTTOM ( C ) IMAGE SHOW THE ACTIVATION FOR THE SAME GENOME , BUT FOR A C1 (31). ... 124 F IGURE 65. 40- GENE GENOME ACTIVATION OF GENOME 90 FOR TEST PIXELS C0 (30), C1 (31), C0 (32), C0 (34), C0 (54),

AND C1 (55), RESPECTIVELY FROM TOP TO BOTTOM , IN FIGURES ( A ) – ( F ). ... 126 F IGURE 66. D ECISION REGION MAPPING AND BOUNDARY BASED ON THE VALUE OF Corr C0 max . ... 130 F IGURE 67. A VERAGE BEST GENOME FITNESS VS . BREED # FOR 1- PARTICLE PSO SWARM . V ERTICAL BARS CORRESPOND TO

THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 133 F IGURE 68. A VERAGE BEST GENOME FITNESS VS . BREED # FOR 100- PARTICLE PSO SWARM . V ERTICAL BARS CORRESPOND

TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 133 F IGURE 69. A VERAGE BEST GENOME FITNESS VS . BREED # FOR 250- PARTICLE PSO SWARM . V ERTICAL BARS CORRESPOND

TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 134 F IGURE 70. A VERAGE BEST GENOME FITNESS VS . BREED # FOR 500- PARTICLE PSO SWARM . V ERTICAL BARS CORRESPOND

TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 134 F IGURE 71. B EST AVERAGE PEAK GENOME FITNESS VS . THE NUMBER OF PARTICLES IN THE PSO SWARM . V ERTICAL BARS

CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN OF THE PEAK FITNESS FOR EACH SWARM TESTED . ... 135 F IGURE 72. CV AVERAGE FITNESS VS . INITIAL CELL LOCATION FROM BEST BRED GENOME . V ERTICAL BARS CORRESPOND

TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 137 F IGURE 73. CV AVERAGE FITNESS VS . ACTIVATED CELLULAR ACTION . V ERTICAL BARS CORRESPOND TO THE STANDARD

DEVIATION OF THE SAMPLE MEAN . ... 139 F IGURE 74. CV AVERAGE FITNESS VS . ACTIVITY LEVEL OF CELLULAR PROTEIN STATISTICS . V ERTICAL BARS CORRESPOND

TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 142 F IGURE 75. CV AVERAGE FITNESS VS . STATIC OR PSO- EVOLVED SETTING OF THE OUTPUT C0/C1 PROTEIN . V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 145 F IGURE 76. M ULTI - SPECTRAL DATA CUBE FOR A 5 X 5 GEOGRAPHIC REGION WITH 15 BANDS OF MULTISPECTRAL DATA . ... 149 F IGURE 77. F ULL - IMAGE AVERAGE CLASSIFICATION ACCURACY VS . SIZE OF GEOGRAPHIC REGIONS SURROUNDING TEST

PIXEL . V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . ... 150 F IGURE 78. F ULL - IMAGE CLASSIFICATION ACCURACY VS . FITNESS FOR A 3 X 3 GEOGRAPHIC REGION SURROUNDING TEXT

PIXEL . ... 151 F IGURE 79. F ULL - IMAGE CLASSIFICATION ACCURACY VS . FITNESS FOR A 5 X 5 GEOGRAPHIC REGION SURROUNDING TEXT

PIXEL . ... 152

F IGURE 80. C LASSIFICATION ACCURACY VS . FITNESS FOR CASE WITH 0 NEURONS IN INTRACELLULAR SOFM. ... 155

F IGURE 81. C LASSIFICATION ACCURACY VS . FITNESS FOR CASE WITH 1 NEURON IN INTRACELLULAR SOFM. ... 156

(14)

F IGURE 82. C LASSIFICATION ACCURACY VS . FITNESS FOR CASE WITH 4 NEURONS IN INTRACELLULAR SOFM. ... 157

F IGURE 83. C LASSIFICATION ACCURACY VS . FITNESS FOR CASE WITH 9 NEURONS IN INTRACELLULAR SOFM. ... 159

F IGURE 84. C LASSIFICATION ACCURACY VS . FITNESS FOR CASE WITH 81 NEURONS IN INTRACELLULAR SOFM. ... 160

F IGURE 85. F ULL - IMAGE AVERAGE CLASSIFICATION ACCURACY VS . THE NUMBER OF NEURONS IN THE INTRACELLULAR SOFM. V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN OF THE CLASSIFICATION ACCURACY . ... 160

F IGURE 86. C LASSIFICATION ACCURACY VS . FITNESS WITH 0.0 REACTION PROBABILITY . ... 163

F IGURE 87. C LASSIFICATION ACCURACY VS . FITNESS WITH 0.001 REACTION PROBABILITY . ... 164

F IGURE 88. C LASSIFICATION ACCURACY VS . FITNESS WITH 0.01 REACTION PROBABILITY . ... 165

F IGURE 89. C LASSIFICATION ACCURACY VS . FITNESS WITH 0.1 REACTION PROBABILITY . ... 166

F IGURE 90. F ULL - IMAGE CLASSIFICATION ACCURACY VS . THE PROBABILITY OF PROTEIN REACTION IN THE ENVIRONMENTAL LATTICE . V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN . .... 167

F IGURE 91. S AMPLE ROC CURVE WITH FALSE - POSITIVE RATE ALONG THE ABSCISSA AND TRUE - POSITIVE RATE AS THE ORDINATE . ... 170

F IGURE 92. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_05_28_07_58_58_61 TEST . T HIS IS H E BIS SELECTED TRIAL # 0. ... 175

F IGURE 93. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_06_11_01_18_20_77 TEST . T HIS IS H E BIS SELECTED TRIAL # 1. ... 176

F IGURE 94. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_06_10_08_03_09_38 TEST . T HIS IS H E BIS SELECTED TRIAL # 2. ... 177

F IGURE 95. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_06_09_16_37_07_17 TEST . T HIS IS H E BIS SELECTED TRIAL # 3. ... 178

F IGURE 96. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_05_28_12_49_29_76 TEST . T HIS IS H E BIS SELECTED TRIAL # 4. ... 179

F IGURE 97. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_06_03_10_39_23_85 TEST . T HIS IS H E BIS SELECTED TRIAL # 5. ... 180

F IGURE 98. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_05_28_00_32_27_50 TEST . T HIS IS H E BIS SELECTED TRIAL # 6. ... 181

F IGURE 99. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_06_09_17_43_13_23 TEST . T HIS IS H E BIS SELECTED TRIAL # 7. ... 182

F IGURE 100. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_05_28_12_42_02_75 TEST . T HIS IS H E BIS SELECTED TRIAL # 8. ... 183

F IGURE 101. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_06_01_233_35_56_3 TEST . T HIS IS H E BIS SELECTED TRIAL # 9. ... 184

F IGURE 102. H E BIS CLASSIFICATION IMAGERY AND ROC FOR 2010_06_10_07_35_57_34 TEST . T HIS IS H E BIS SELECTED TRIAL # 10. ... 185

F IGURE 103. C OMPARISON PLOT OF H E BIS CLASSIFICATION ACCURACY VS . FITNESS OF THE GENOME FOR 200 TRIALS . P ROTEIN CHEMISTRY IS DEACTIVATED . ... 189

F IGURE 104. C OMPARISON PLOT OF H E BIS CLASSIFICATION ACCURACY VS . FITNESS OF THE GENOME FOR 79 TRIALS WITH PROTEIN CHEMISTRY ACTIVATED . ... 189

F IGURE 105. H E BIS CLASSIFICATION ACCURACY VS . REACTION PROBABILITY FOR 79 TRIALS . P ROTEIN CHEMISTRY IS ACTIVATED . ... 190

F IGURE 106. H E BIS CLASSIFICATION ACCURACY VS . REACTION PROBABILITY AND FITNESS FOR 79 TRIALS . P ROTEIN CHEMISTRY IS ACTIVATED . ... 191

F IGURE 107 H E BIS CLASSIFICATION ACCURACY VS . REACTION PROBABILITY AND MINIMUM PROTEIN CORRELATION FOR 79 TRIALS . P ROTEIN CHEMISTRY IS ACTIVATED . ... 192

F IGURE 108. C LASSIFICATION ACCURACY VS . ENVIRONMENTAL DIFFUSION RATE FOR 200 SHOTGUN TRIALS . P ROTEIN CHEMISTRY IS DEACTIVATED . ... 193

F IGURE 109. C LASSIFICATION ACCURACY VS . GENOME FITNESS AND ENVIRONMENTAL DIFFUSION RATE FOR 200 SHOTGUN TRIALS . P ROTEIN CHEMISTRY IS DEACTIVATED . ... 193

F IGURE 110. N OISE COMPARISON PLOT FOR CLASSIFICATION ACCURACY VS . NOISE STANDARD DEVIATION FOR BOTH H E BIS AND SOFM TRIALS . V ERTICAL BARS CORRESPOND TO THE STANDARD DEVIATION OF THE SAMPLE MEAN OF CLASSIFICATION ACCURACY . ... 199

F IGURE 111. C OMPARISON OF BEFORE AND AFTER CLASSIFICATION ACCURACY FOR THE MODIS BAND 16 KNOCKOUT . .. 201

F IGURE 112. C OMPARISON OF BEFORE AND AFTER CLASSIFICATION ACCURACY FOR THE MODIS BAND 7 KNOCKOUT . .... 202

F IGURE 113. O NE METHOD OF PRESENTING TRAINING / TEST DATA TO H E BIS. E ACH BEHAVIOR IS TRAINED SEPARATELY , CANDIDATE GENOMES ARE CREATED , AND THE CANDIDATES THEN UNDERGO EVOLUTIONARY OPTIMIZATION IN A FINAL CV/GA LOOP . ... 210

F IGURE 114. P ROCESSING MODEL FOR RESEARCH INFRASTRUCTURE . ... 214

(15)

F IGURE 115. P ARAMETER ACTIVATION MAPS FOR GENOMES DISCOVERED DURING SHOTGUN EXPERIMENTS . T HE COLOR BAR FROM T ABLE 81 IS APPLICABLE TO THE EVOLVED ELEMENTS OF THE GENOMES IN THIS FIGURE . ... 234 F IGURE 116. S INGLE PROTEIN DIFFUSION FROM FOUR SITES WITHIN AN 11 X 11 X 11 CUBIC ENVIRONMENT . T HIS FRAME

SHOWS FOUR INITIAL SITES OF PROTEIN ACTIVATION AT THE BEGINNING OF THE SIMULATION ... 235 F IGURE 117. F RAME #2 IN THE SIMULATION T HIS FRAME IS A SNAPSHOT OF ACTIVITY IN THE ENVIRONMENTAL LATTICE

AFTER ITERATION 2 OF THE DIFFUSION SIMULATION . T HE TWO RED PROTEIN SITES ARE STILL ACTIVELY PRODUCING PROTEINS WHEREAS THE TWO GREEN SITES ARE DECAYING . T HE LIGHT BLUE COLOR REPRESENTS SITES WITHIN THE LATTICE THAT HAVE THE LOWEST NON - ZERO PROTEIN CONCENTRATIONS AT THIS POINT IN THE SIMULATION . ... 235 F IGURE 118. F RAME #6. T HE SITES COLORED RED ARE STILL ACTIVELY PRODUCING WHEREAS THE LIGHT - BLUE - COLORED

AND DARK - BLUE - COLORED SITES POSSESS LOWER CONCENTRATIONS OF THE SIMULATED PROTEIN . T HE DARKER BLUE SITES CONTAIN LOWER CONCENTRATIONS OF PROTEIN THAN THE LIGHT - BLUE SITES . T HIS IS THE SNAPSHOT FROM ITERATION 6 OF THE SIMULATION . ... 236 F IGURE 119. F RAME #26. A FTER 26 ITERATIONS , THE ARTIFICIAL PROTEIN HAS DIFFUSED THROUGHOUT A LARGE PORTION

OF THE 11 X 11 X 11 ENVIRONMENTAL MATRIX . H OTTER COLORS ( E . G . RED , YELLOW , GREEN ) CORRESPOND TO HIGHER CONCENTRATIONS OF THE PROTEINS WHEREAS COOLER COLORS ( E . G . LIGHT BLUE , BLUE ) CORRESPOND TO AREAS OF RELATIVELY LOW CONCENTRATIONS . ... 236 F IGURE 120. F RAME #40. A T ITERATION 40, THE PROTEIN HAS DIFFUSED THROUGHOUT THE ENVIRONMENTAL LATTICE .

T HE RED SITES ARE THE LOCATIONS OF THE ORIGINAL AND CONTINUING PROTEIN SOURCES . H OTTER COLORS CORRESPOND TO HIGHER PROTEIN CONCENTRATIONS WHEREAS COOLER COLORS CORRESPOND TO LOWER

CONCENTRATIONS . ... 237 F IGURE 121. R EGIONS OF EQUIVALENT Corr C0 max FOR THE FITNESS FUNCTION . ... 238 F IGURE 122. Θ corr

2 PORTION OF FITNESS FUNCTION . ... 238 F IGURE 123. Mag

corr2

PORTION OF FITNESS FUNCTION . ... 239 F IGURE 124. C ORRELATION COEFFICIENT GRID FOR PROCESSING PARAMETERS AND CLASSIFICATION RESULTS .

P ARAMETERS AND RESULTS ARE NUMBERED FROM 1 TO 26. ... 240 F IGURE 125. S IGNIFICANCE P - VALUE GRID FOR PROCESSING PARAMETERS AND CLASSIFICATION RESULTS OBTAINED WITH A

S TUDENT ’ S T - TEST . P ARAMETERS AND RESULTS ARE NUMBERED FROM 1 TO 26. ... 240

F IGURE 126. H E BIS CLASSIFICATION TRAINING CYCLE . ... 241

(16)

Tables

T ABLE 1. B ITWISE XOR F UNCTIONAL M APPING ... 65

T ABLE 2. O RBITAL I NFORMATION FOR NASA' S A QUA S ATELLITE . ... 80

T ABLE 3. N OMINAL R ESOLUTIONS FOR THE MODIS S ENSOR . ... 81

T ABLE 4. 36 B ANDS OF M ULTISPECTRAL D ATA FROM MODIS. ... 81

T ABLE 5. NASA/CEOS D ATASET L EVEL D EFINITION ... 82

T ABLE 6. 17 B ANDS FROM MODIS FOR A2002193183000 LAC_ X _NIR. ... 82

T ABLE 7. M ULTISPECTRAL BANDS USED FROM MODIS [111]... 83

T ABLE 8. S IMULATION P ARAMETERS FOR E XPERIMENT 1. ... 96

T ABLE 9. S IMULATION P ARAMETERS FOR E XPERIMENT 2. ... 97

T ABLE 10. S TATISTICAL S UMMARY FOR E XPERIMENTS 1 AND 2. ... 99

T ABLE 11. S IMULATION P ARAMETERS FOR E XPERIMENT 3. ... 101

T ABLE 12. S IMULATION P ARAMETERS FOR E XPERIMENT 4. ... 102

T ABLE 13. S TATISTICAL S UMMARY FOR E XPERIMENT 3. ... 108

T ABLE 14. S TATISTICAL S UMMARY FOR E XPERIMENT 4. ... 108

T ABLE 15. S IMULATION P ARAMETERS FOR E XPERIMENT 5. ... 115

T ABLE 16. S IMULATION P ARAMETERS FOR S ELF -O RGANIZATION E XPERIMENTS ... 129

T ABLE 17. E XPERIMENT 6 T RIAL D ISTRIBUTION ... 132

T ABLE 18. T RIAL DISTRIBUTION FOR E XPERIMENT 7. ... 136

T ABLE 19. T RIAL D ISTRIBUTION FOR E XPERIMENT 8 ... 139

T ABLE 20. E XPERIMENT 9 P ARAMETERS ... 142

T ABLE 21. E XPERIMENT 10 P ARAMETERS ... 144

T ABLE 22. R ANGE OF P ERTINENT H E BIS T RAINING P ARAMETERS FOR C LASSIFICATION ... 147

T ABLE 23. P ERTINENT SOFM T RAINING P ARAMETERS FOR C LASSIFICATION ... 147

T ABLE 24. T RIAL D ISTRIBUTION ACROSS G EOGRAPHIC R EGION S IZE FOR E XPERIMENT 11. ... 148

T ABLE 25. S IMULATION P ARAMETERS FOR E XPERIMENT 11. ... 148

T ABLE 26. C ONFUSION M ATRIX FOR 3 X 3 G EOGRAPHIC R EGION ... 150

T ABLE 27. C ONFUSION M ATRIX FOR 5 X 5 G EOGRAPHIC R EGION ... 152

T ABLE 28. H E BIS K ERNEL S IZES FOR THE I NTRACELLULAR SOFM IN E XPERIMENT 12. ... 153

T ABLE 29. S IMULATION P ARAMETERS FOR E XPERIMENT 12. ... 154

T ABLE 30. C ONFUSION M ATRIX FOR I NTRACELLULAR SOFM WITH 0 N EURONS ... 154

T ABLE 31. C ONFUSION M ATRIX FOR I NTRACELLULAR SOFM WITH 1 N EURON ... 156

T ABLE 32. C ONFUSION M ATRIX FOR I NTRACELLULAR SOFM WITH 4 N EURONS ... 157

T ABLE 33. C ONFUSION M ATRIX FOR I NTRACELLULAR SOFM WITH 9 N EURONS ... 158

T ABLE 34. C ONFUSION M ATRIX FOR I NTRACELLULAR SOFM WITH 81 N EURONS ... 159

T ABLE 35. P ROTEIN R EACTION P ROBABILITY D ISTRIBUTION FOR E XPERIMENT 13. ... 161

T ABLE 36. S IMULATION P ARAMETERS FOR E XPERIMENT 13. ... 162

T ABLE 37. C ONFUSION M ATRIX FOR 0.0 P ROTEIN R EACTION P ROBABILITY ... 162

T ABLE 38. C ONFUSION M ATRIX FOR 0.001 P ROTEIN R EACTION P ROBABILITY ... 163

T ABLE 39. C ONFUSION M ATRIX FOR 0.01 P ROTEIN R EACTION P ROBABILITY ... 165

T ABLE 40. C ONFUSION M ATRIX FOR 0.1 P ROTEIN R EACTION P ROBABILITY ... 166

T ABLE 41. L IST OF H E BIS P ARAMETERS TO R ANDOMIZE FOR E XPERIMENT 14. ... 169

T ABLE 42. L IST OF SOFM P ARAMETERS TO R ANDOMIZE FOR E XPERIMENT 14. ... 169

T ABLE 43. D ISTRIBUTION OF C LASS AND I NFRASTRUCTURE P IXELS IN A2002193183000... 171

T ABLE 44. O PERATIONAL P ARAMETERS FOR S ELECTED H E BIS S HOTGUN E XPERIMENTS ... 172

T ABLE 45. O PERATIONAL P ARAMETERS FOR SOFM E XPERIMENTS ... 172

T ABLE 46. C LASSIFICATION R ESULTS FOR S ELECTED H E BIS S HOTGUN E XPERIMENTS ... 173

T ABLE 47. C LASSIFICATION R ESULTS FOR S ELECTED SOFM E XPERIMENTS ... 173

T ABLE 48. F EATURE AND R ESULT I NDICES FOR C ORRELATION C OEFFICIENT AND P-V ALUE M ATRICES ... 187

T ABLE 49. B EST C LASSIFICATION A CCURACIES FOR THE S ELECTED H E BIS AND SOFM E XAMPLES ... 195

T ABLE 50. D ATASET DEFINITIONS FOR E XPERIMENT 15. ... 197

T ABLE 51. S IMULATION P ARAMETERS FOR THE H E BIS "B EST " G ENOME FOR E XPERIMENT 15. ... 197

T ABLE 52. S IMULATION P ARAMETERS FOR THE 2 X 1 SOFM "B EST " C ODEBOOK FOR E XPERIMENT 15... 197

T ABLE 53. S IMULATION P ARAMETERS FOR THE 3 X 1 SOFM "B EST " C ODEBOOK FOR E XPERIMENT 15... 197

T ABLE 54. M ISSING F EATURE C OMPARISON FOR C LASSIFICATION A CCURACY U SING H E BIS AND SOFM A LGORITHMS

WITH MODIS B AND 16 K NOCKOUT ... 201

(17)

T ABLE 55. M ISSING F EATURE C OMPARISON FOR C LASSIFICATION A CCURACY U SING H E BIS AND SOFM A LGORITHMS

WITH MODIS B AND 7 K NOCKOUT ... 202

T ABLE 56. E XPERIMENT 6 A GGREGATE B REEDING D ATA . ... 216

T ABLE 57. E XPERIMENT 7 D ATA . ... 216

T ABLE 58. E XPERIMENT 8 D ATA . ... 217

T ABLE 59. E XPERIMENT 9 D ATA ... 217

T ABLE 60. E XPERIMENT 10 D ATA . ... 217

T ABLE 61. E XPERIMENT 11 D ATA FOR 3 X 3 AND 5 X 5 G EOGRAPHIC R EGION C OMPARISON ... 217

T ABLE 62. E XPERIMENT 11 S CATTER D ATA FOR 3 X 3 G EOGRAPHIC R EGION ... 218

T ABLE 63. E XPERIMENT 11 S CATTER D ATA FOR 5 X 5 G EOGRAPHIC R EGION ... 219

T ABLE 64. E XPERIMENT 12 - I NTRACELLULAR SOFM D ATA ... 219

T ABLE 65. E XPERIMENT 12 S CATTER D ATA FOR 0 X 0 I NTRACELLULAR SOFM ... 220

T ABLE 66. E XPERIMENT 12 S CATTER D ATA FOR 1 X 1 I NTRACELLULAR SOFM ... 220

T ABLE 67. E XPERIMENT 12 S CATTER D ATA FOR 2 X 2 I NTRACELLULAR SOFM ... 221

T ABLE 68. E XPERIMENT 12 S CATTER D ATA FOR 3 X 3 I NTRACELLULAR SOFM ... 221

T ABLE 69. E XPERIMENT 12 S CATTER D ATA FOR 9 X 9 I NTRACELLULAR SOFM ... 222

T ABLE 70. E XPERIMENT 13 A GGREGATE C LASSIFICATION D ATA ... 222

T ABLE 71. E XPERIMENT 13 D ATA FOR 0.0 R EACTION P ROBABILITY ... 223

T ABLE 72. E XPERIMENT 13 D ATA FOR 0.001 R EACTION P ROBABILITY ... 223

T ABLE 73. E XPERIMENT 13 D ATA FOR 0.01 R EACTION P ROBABILITY ... 224

T ABLE 74. E XPERIMENT 13 D ATA FOR 0.1 R EACTION P ROBABILITY ... 224

T ABLE 75. E XPERIMENT 14 - S TATISTICAL S UMMARY D ATA FOR H E BIS C LASSIFICATION A CCURACY AND F ITNESS S CATTER D ATA ... 225

T ABLE 76. E XPERIMENT 14 H E BIS S CATTER D ATA FOR F ITNESS AND C LASSIFICATION A CCURACY ... 225

T ABLE 77. E XPERIMENT 14 - A GGREGATE C LASSIFICATION A CCURACY R ESULTS FOR SOFM. ... 230

T ABLE 78. S ELECTED R ESULTS FROM E XPERIMENT 14 ... 231

T ABLE 79. D ATA FOR E XPERIMENT 15- C LASSIFICATION A CCURACY FOR 0.1 P ROBABILITY N OISE I NJECTION WITH V ARYING N OISE S TANDARD D EVIATIONS ... 232

T ABLE 80. D ATA FOR E XPERIMENT 16 – C LASSIFICATION A CCURACY WITH MODIS B AND 15 K NOCKED O UT ... 232

T ABLE 81. C OMPARISON OF S ELECTED G ENOMES FOR S HOTGUN E XPERIMENTS ... 233

(18)

Terms and Acronyms

ANN: Artificial Neural Network CDF: Cumulative Distribution Function

CEOS: Committee on Earth Observation Satellites CV: Cross Validation

EC: Evolutionary Computation ED: Evolutionary Development ESA: European Space Agency Evodevo: Evolutionary Development GA: Genetic Algorithm

Genetic Algorithm:

GRN: Genetic Regulatory Network ML: Machine Learning

MODIS: Moderate Resolution Imaging Spectroradiometer NASA: National Aeronautics and Space Administration PDF: Probability Density Function

PSO: Particle Swarm Optimization

ROC: Receiver Operating Characteristic curve SOFM: Self Organizing Feature Map

SVM: Support Vector Machine

SI: Swarm Intelligence

TOA: Top of Atmosphere

TBD: To Be Determined

(19)

1. Introduction

The need currently exists on a worldwide basis for the timely extraction of knowledge for the management of natural resources. Extraction of this knowledge for local, regional, and global applications is being driven by the desire to more precisely assess the issues associated with both anthropogenic and natural drivers in the environment. Data are being collected from a multitude of earth-orbiting sensor platforms from the European Space Agency (ESA), the National Aeronautics and Space Administration (NASA) and a host of other national agencies as well as private concerns.

This torrent of raw data, for example, NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) sensors routinely collect a terabyte of earth imagery on a daily basis, brings with it the issue of extraction of usable knowledge from these information-rich datasets for efficient management of the earth’s natural resources.

State-of-the-art on-orbit optical sensors routinely collect earth resource data that are multispectral or hyperspectral in nature. This increase in the number of available bands of information promises to provide better discrimination of desired classes only if an appropriate level of class precision is available [1].

Remote sensing classification problems are difficult to solve because of many issues. A sampling of these pitfalls includes spectral and spatial noise in the geographic areas of interest, loss of usable data due to cloud cover, noisy multi-temporal datasets, and noisy human and machine-generated class labels. A multitude of complex regional characteristics around the globe, e.g., the particulates in different coastal regions, terrain effects, and open ocean effects and other disturbances also affect the analytical chain associated with knowledge acquisition and analysis techniques. These problems contribute to the difficulty of solving computational classification problems in the area of satellite- based optical remote sensing.

This research examines a method through which classification knowledge for potential use in remote

sensing applications may be acquired from multi-spectral datasets that are routinely used by

researchers and government policy leaders. This research is based on contemporary machine learning

techniques that have been combined to develop a novel system that is based on recent thought in

biological evolutionary development. In particular, it is based on the observation that biological

evolution has provided life as we know it with successful means of navigating the data-rich pitfalls

and rewards associated with day-to-day survival in a harsh environment. It is hoped that the

application of the ideas associated with biologically complex structures beyond the typical realm of

(20)

neural networks can provide novel means for processing human-produced, information-rich constructions in the future.

The idea behind this dissertation is the research and development of an algorithm that creates a self- organizing classification network. This classification network can be based on a mixture of heterogeneous machine learning constructs such as self-organizing feature maps (SOFM), artificial neural networks (ANN), and support vector machines (SVM) [2]. The biological context for self- organization and communications between the simple processing constructs, or cells, in the network is that of genetic regulatory networks and evolutionary development [3,4,5]. Each processing cell consists of an artificial genome that has excitatory and inhibitory switches that are controlled through the communication of artificial proteins in a simulated environmental lattice in which the cells reside.

Within each cell, protein switches control the expression of particular proteins from each processing cell. These proteins diffuse through the lattice and in turn are used for communications between the processing elements. This communication is based on an artificial protein chemistry with concentration levels.

Self-organization and evolution of the network occur on several levels. The high-level topology of the network- both the number and types of the simple processing cells (SOFM, ANN, or SVM)- can change as can the internal organization of each processing element. At this lower processing-element level, this could include parameters such as the choice of kernel used for a particular SVM element.

Artificial proteins used for communications between processing elements as well as between the

“outside” environment (the input data patterns) and the classification topology also adapt to the application domain. These communications and environmental proteins are released into the classification lattice if their corresponding genes are switched on and expressed.

1.1. Problem statement

Claim: Biological inspiration is a powerful paradigm in classification and hybridization introduces interesting and useful qualities. It can provide powerful tools for solving relevant satellite image classification problems.

We will investigate in detail as to whether the combination of an artificial genetic regulatory network

(GRN) and a basic machine learning element in a rudimentary self-organizing network is effective

when applied to binary classification in a multi-dimensional space; i.e. using multispectral feature

(21)

vectors acquired from an optical satellite image. This detailed case study will examine how our novel hybridization idea works and also how to set it up for a class of practical classification problems.

1.2. Delimitations of the research

• Simple GRNs will be constructed

o Simplified artificial protein representation

o Protein interaction through a simple protein “chemistry”

o Size and complexity are limited by the available computational resources

• Self-organization

o Facilitated by evolutionary computation; i.e. particle swarm optimization; and a small set of self-organizing rules inspired by biological evolutionary development and statistical analysis

o Rudimentary classification training algorithm is based on repeated presentation of samples to the classification network in the PSO framework.

• Accuracy for a real-world binary cloud/no-cloud classification problem will be addressed

• Robustness for benchmark problems will be addressed in the areas of datasets with noisy features and datasets that have missing features

• A single binary classification using remotely sensed multispectral optical data will be examined

• Comparisons are limited to simple implementations of SOFM machine learning kernels with research into SVM and ANN kernels left for future research.

• It is not the purpose of this research to delve deeply into the merits of one SOFM training algorithm versus another one.

1.3. Key contributions

• Novel application of a hybridized biological construct with machine learning to a practical computational classification problem

• Determination of the effectiveness of a simplified GRN applied to multi-dimensional classification.

• Artificial proteins communicate classification information and results to and from the cellular machine learning kernels.

• Training of a GRN via particle swarm optimization (PSO).

• Application of a GRN-based classification system to a real-world multispectral remote sensing problem domain, i.e. cloud detection in optical satellite imagery.

• Performance comparison of HeBIS and a SOFM-only classification algorithm on a remotely-sensed multispectral dataset with unadulterated features, noisy features, and deleted features.

1.4. Organization of this thesis

The thesis is organized as follows:

(22)

Chapter 1 outlines the objectives and limitations of this research and lists the key contributions of this dissertation.

Chapter 2 reviews the current state of the art for classification systems based on machine learning techniques that are embedded in self-organizing structures- as is the case with HeBIS. As such, this chapter provides direct linkage to the origins of the core ideas used in the HeBIS architecture.

Specifically, the areas examined are Artificial Neural Networks (ANN) and Self-Organizing Feature Maps (SOFM). ANNs and associated research into their self-organization are presented as important background in addition to being an introduction to the SOFM theory. Promising new developments are discussed as well as the problems associated with each of these machine learning paradigms.

Attempts to alleviate these problems through the use of a “bare bones” Genetic Regulatory Network (GRN) based on an artificial protein chemistry form the basis for the remainder of the dissertation using the HeBIS self-organizing system. Information on this GRN is presented from a computational development viewpoint and an overview of the current state-of-the art in this research domain is also included. This information is presented in the context of the cell-to-cell and intra-cell communications that are enabled by a computational environment based on artificial proteins. A literature overview of GRNs applied to classification problems is also presented.

In Chapter 3, a detailed architecture of the Heterogeneous Biologically Inspired System (HeBIS) is presented. This treatment includes information on the protein environment and lattice structure of the system; the artificial proteins, their different types and encodings, and the associated simulated protein chemistry; and the basic processing cells. Also examined are the inherent and learned behaviors that each cell can acquire through the protein reactions within the simulated environment.

Finally, the classification training algorithm is examined. The underlying particle swarm optimizer that is used for system optimization is also discussed and sample training and classification processing data flows are given to solidify the presentation of this material.

In Chapter 4, the simulation results are presented. Through these simulations, HeBIS classifications are examined and compared to classifications based on SOFMs. Simulation results are also presented which characterize salient properties of various instantiations of the HeBIS architecture.

Chapter 5 discusses and summarizes the comparison results in detail.

(23)

Finally, in Chapter 6, the dissertation concludes with an outline of potential avenues of future work in

this research area.

(24)

2. Literature review

Hybridization Background Algorithm Discussion GRN Analyses GRN Training with PSO GRN Action Analyses Remote Sensing Background Remote Sensing Application – Analyses and Comparisons Robustness Analyses

The list of research areas that this dissertation touches upon is quite extensive. This stems from the fact that the idea that biology has something to teach engineers and computer scientists has become more accepted by researchers over the last several years. During the last decade, basic biological research has become cheaper to perform and its volume has increased. This is now being coupled with an exponential increase in computing power and information processing. The intersection of these disciplines has seen a fertile exchange in ideas between the biological researchers and the

“information processors”. This dissertation is itself a result of this exchange and it focuses on the attempt to apply biological principles to knowledge extraction; specifically the area of automated multi-class classification. To do this, an overview of work that is directly applicable to this research is required.

This review is divided into four sections that concentrate on machine learning and evolutionary development within the context of self-organization and classification. Section 2.1 introduces machine learning as it is applied in the specific domains of general classification, artificial neural networks and self-organizing feature maps. Section 2.2 covers evolutionary computation with emphasis on particle swarm optimization. Section 2.3 presents biological and artificial evolutionary development and provides an overview of the research and applications associated with pattern classification and creation. Finally, Section 2.4 summarizes the advantages and limitations of the outlined techniques and proposes that there may be improvements in pattern classification if ideas from these different research areas are blended together through HeBIS, the Heterogeneous Biologically Inspired System.

HeBIS research is based on simple SOFM pattern recognition kernels (cells) with a GRN-based

communications infrastructure wrapped around them. The emphasis in this research is to determine

whether a GRN can be successfully used to create a classification network which can be used as the

basis for further research, not to examine the relative advantages or disadvantages of different

subclasses of this or other simple processing kernels.

(25)

This review only provides a concise overview of these topics. Appropriate references are included for further detailed examination.

2.1. Machine learning and self-organization 2.1.1. Classification overview

Machine learning entails methods by which computers may be programmed to learn. The accuracy and precision of a computerized classification system are the primary attributes of such a system.

Learning tasks can be classified into analytical and empirical techniques. With analytical learning, no external experiences – data and environmental descriptions- are required whereas empirical learning explicitly requires the use of external data and experience [6]. This research is primarily concerned with empirical learning in both supervised and unsupervised learning environments for classification applications.

Classification is the process through which an object is mapped to a specific class within a set of classes that has been defined for the problem. In this research, an object and its associated definition or class is referred to as a labeled example or exemplar. The set of labeled exemplars constitute the training set of data in which the object is a feature vector of many descriptive numerical features that is mapped to a specific class label. These training data are applied to a given learning algorithm and the result is a specific instantiation of a classifier. In turn, this classifier is evaluated for its precision and accuracy by applying it to a test data set that is composed of a separate set of labeled examples that have been taken from the same underlying statistical distribution as the training set.

Classifiers should generalize well to datasets that they have not been directly trained on. In other

words, a good classifier is one that, once it has been trained on a small training set, may be used to

effectively classify larger sets of data. Classification rate is the statistic that is the percentage of test

examples which are correctly classified. The misclassification rate is the converse, those test

exemplars that have been misclassified by the classifier [7]. These statistics are further refined in the

cases where successive classification decisions are not independent and when the classification

decisions are not equally important [6]. The latter leads to the development of the Receiver

Operating Characteristic curve, ROC, that is used to gauge classification performance for ranking

classifiers [8]. Through an ROC, the performance of a classifier is examined by changing the

threshold that is used to decide between two classes. As this threshold changes, one can construct an

(26)

ROC from the false-positive classification rates and the corresponding true-positive classification rates.

Empirical learning systems, whether supervised or unsupervised, trade performance between three factors: the complexity of the classifier, the amount of training data, and the generalization ability of the system when it is applied to new, unseen exemplars. As the classifier’s complexity increases, its generalization ability increases, peaks, and then decreases. As more training data become available to the classifier, more detailed information becomes available about the problem’s statistical manifold.

However, as the complexity of the classifier increases, the system’s generalization accuracy will increase and then decrease after a certain point is reached. These points are noted in Figure 1.

Figure 1. Performance of empirical learning systems.

(27)

Low variance and high variance classifiers are defined respectively as systems that exhibit a small degree or a high degree of change in classification performance as different (and noisy) exemplars are presented and tested [7].

A high bias system is defined as one which exhibits high classification precision on the problem, but with low recall and a low bias system is one which exhibits low precision with high recall, with

recall = t p t p + f n

ffffffffffffffffffffff ,

(1)

precision = t p

t p + f p

ffffffffffffffffffffff ,

(2)

where t p is the true-positive rate, f p is the false-positive classification rate, and f n is the false- negative classification rate.

A low bias system essentially can represent almost any classifier whereas a high bias system is not complex enough to represent the optimal classifier.

Many mechanisms exist through which classifier complexity is matched to the complexity of the training data [9,10,11].

Unsupervised learning requires no exemplar-class training pairs because the interrelationships between the examples’ features are automatically categorized and clustered by these types of algorithms according to a set of rules that is defined before the feature vectors are presented to the learning system [12, 13].

Typically, machine learning algorithms make weak or no assumptions about the training data.

Therefore, machine learning techniques generally require a large number of training data so that the

problem’s statistical manifold can be adequately sampled. However, if domain knowledge is applied,

the size of the required data set (for a given level of classifier precision) is typically much less than it

(28)

is in the case in which no knowledge is used. This introduction of bias into the system is risky, however, since the a priori knowledge must be correct or the added bias may preclude the discovery of an accurate classifier.

2.1.2. Artificial neural networks

Historically, research into artificial neural networks (ANNs) has been motivated by the differences between mammalian brains and human-engineered digital computers. Researchers have typically focused their efforts in two ways. The first is an attempt to better understand how the brain works by simulating its topology through models of varying complexity. Secondly, researchers have attempted to mimic the brain’s operation in a quest to improve engineered information processing systems.

It is through this second camp of researchers that the modern variants of artificial neural networks have been utilized for complex and nonlinear applications such as pattern recognition. The processing and self-organizing abilities of even the smallest mammalian brain outstrip current supercomputers given almost any applicable performance metric.

A small sampling of the early work in the field includes [14,15,16,17,18,19].

A generic artificial neural network is composed of a collection of simple processing elements that are interconnected. This type of architecture is one that is extended in this current work with the HeBIS network’s processing cells that are interconnected through a GRN. Each of the simple processing elements in an artificial neural network is called an artificial neuron and is based on a simple mathematical model of a biological neuron.

Figure 2 shows an artificial neuron that receives a set of numerical inputs, applies a multiplicative

weighting function to each input and then sums these results over all of the weighted inputs to the

neuron.

(29)

Figure 2. An artificial neuron.

This individual weight is called a synaptic weight and it mimics the excitatory and inhibitory responses at the input of a biological neuron. In the artificial case, a negative-valued weight acts to inhibit that input whereas a positive-valued weight excites that input in the artificial neuron. The summed result of these weights is then nonlinearly mapped through a normalizing activation function to the neuron’s output. This activation function is typically chosen to be a scaled sigmoid function.

At this point, the output signal (a number) is either passed on as an input to another neuron or is the output of a layered feedforward network as in Figure 3.

Figure 3. Artificial Feedforward Neural Network.

A feedforward neural network is typically composed of three layers: an input layer, a hidden layer

and an output layer. In its most general form, the neurons in these layers are interconnected within

the layers and also between the layers via synapses that are excitatory or inhibitory. These responses

(30)

are controlled by the previously mentioned synaptic weights. A properly-sized feedforward network composed of three layers can theoretically approximate any arbitrary function [20]. Similarly, the idea is that HeBIS forms a layered network of interconnected processing cells, albeit one that may not be as apparent as that of an ANN.

ANN architectures are mature and can be differentiated according to the following characteristics:

• Neuron Models,

• Synaptic interconnection (network) models), and

• Training paradigms.

Common artificial neuron models use simplified versions of the actual operating properties inherent in the biological neuron. For example, biological neurons appear to actually process their inputs and outputs according to a pulsing signal model. This neuron model has been mostly ignored by the computational research community but interest has increased recently [21]. Besides the common sigmoidal activation functions, other nonlinear functions have also been examined [20].

Synaptic interconnection models define how the neurons in the different layers of a network are connected to each other. Types of architectures include feedforward networks in which the inputs to the network are processed by the input layer and then “fed forward” to the hidden layer and then the output layer. Recurrent neural networks are oscillatory and function by feeding information backwards through the network or directly back to the originating neuron [20,22]. Another model is the Self-Organizing Feature Map which is described in detail in a separate section [23].

Many paradigms for the training of artificial neural networks’ synaptic weights have been researched.

Backpropagation is a workhorse training technique for the neural network community [24]. Other neural network training techniques include [25, 26, 27, 28, 29, 30].

Researchers have also applied classical mathematical tools to the training issue. Sequential Monte

Carlo methods are used to train the ANN as each new training example is presented to the network

[31]. Iterative training such as this is useful in instances when the training datasets are large, consist

of thousands (or more) of features, and when the dataset’s statistics are time-varying and/or non-

Gaussian. [31] uses Monte Carlo sampling to characterize the training set’s probability distribution

for such an iterative and time-varying process. This technique, HySIR, was shown in 2000 to have

Cytaty

Powiązane dokumenty

Autostop cieszył się wielkim powodzeniem i w tym samym tekście pojawił się komentarz, że pewien polski pisarz zainteresował się tematem: „autor kil­ ku głośnych

In hetalgemeen kan gesteld worden dat de kritische krachten in het geval dat masten en Laadgerei in aanvaring met de brug komen, kunnen variaren - van enkele tonnen tot ca.

To powoduje in- tegrację nauki, edukacji oraz realnych warunków działalności zawodowej, przebudowę i zmiany całego systemu edukacyjnego, w tym motywów i celów

N ajw ażniejszym prześw iadczeniem gnostyka będzie zatem to, że uni­ w ersalne w yjaśnienie ludzkiego b y tu i jego dziejów jest możliwe, a ta szczególna

The independent variables whose predictive value has been checked include age related factors (Age, Age of Learning), language experience (Language Use and

Keywords: ACG, Acoustocerebrography, Stroke, Brain Monitoring, Neu- rology, Signal processing, multispectral signal decomposition, matrix con- dition, error estimation,

– prof.. w sprawie powtórnej oceny jako ś ci kształcenia na kierunku „ekonomia” prowadzonym na Wydziale Ekonomicznym Wy ż szej Szkoły Rozwoju Lokalnego w Ż yrar- dowie

Two subcategories of non-standard employment, small part-time jobs and on-call work, have lower transition rates to standard employment than other non-standard jobs (temporary work