• Nie Znaleziono Wyników

Classification and clustering multivariate statistical methods for hyperspectral datasets in R Environment P-12

N/A
N/A
Protected

Academic year: 2021

Share "Classification and clustering multivariate statistical methods for hyperspectral datasets in R Environment P-12"

Copied!
1
0
0

Pełen tekst

(1)

ISSRNS 2016: Abstracts / Extended abstracts / Synchrotron Radiation in Natural Science Vol. 15, No. 1-2 (2016)

68

P-12

Classification and clustering multivariate statistical methods for hyperspectral datasets in R Environment

K. Banas1*, A. Banas1, E. Jasek-Gajda2, M. Gajda2, W. M. Kwiatek3, B. Pawlicki4 and M. Breese1

1Singapore Synchrotron Light Source, National University of Singapore, 5 Research Link, Singapore 117603, Singapore

2Department of Histology, Jagiellonian University Medical College, Kopernika 7, 31-034 Krakow, Poland

3Institute of Nuclear Physics PAN, Radzikowskiego 152, 31-342 Krakow, Poland

4Gabriel Narutowicz Hospital, Pradnicka 37, 31-202 Krakow, Poland

Keywords: synchrotron radiation, x-ray fluorescence, multivariate statistical analysis, classification techniques

*e-mail: slskb@nus.edu.sg

The experiments performed at synchrotron light sources very often provide as the result big datasets. This is especially true with 2D spectroscopy. Hyperspectral datasets (spectra with additional information for example about the position of the place where spectrum was recorded) should be treated in a special way. They are highly correlated in two-fold way: each spectrum is a superposition of the number of peaks with additional baseline function, but also spectra from adjecent regions are usually very similar due to local homogenity of the sample. Additionaly, very often these datasets represent so-called wide data case where the number of variables is bigger than the number of observations.

While there is a number of software solution for evalution of the experimental results in the image format (for example for imaging and tomography experimental results) hyperspectral data evaluation standardised approach is still missing.

In this contribution discussion and comparison of two methods for X-ray fluorescence (XRF) spectral datasets evaluation is presented.

Traditionally each spectrum is deconvoluted by fitting the model in order to obtain elemental concentration values, subsequently these concentrations are used as the variables in building classification models by using linear discriminant analysis (LDA) or partial least-square discriminant analysis (PLSDA).

Proposed alternative approach is using directly complete spectral datasets. By using multivariate statistical techniques reduction of the dimension is performed, then new variables (principal or latent components) are included for constructing classification functions.

Comparison of the performance of models constructed with both methods and LDA or PLSDA will be shown.

Cross-validation of the models is done by leave-one- out (LOOCV) method. Alternative approach for unsupervised classification based on hierarchical cluster analysis allows for additional independent validation.

XRF spectral data were recorded at beamline L of Hasylab synchrotron source. Samples were 15 microns thin sections of biological material stretched on Mylar foil. Policapillary was used to focus X-rays into small spot size allowing spatially resolved study of heterogenous material.

Complete data preprocession, evaluation and visualization (except deconvolution of XRF spectra model fitting) was performed with R environment [1] for statistical analysis and RStudio Graphic User Interface [2].

Figure 1. Diagram showing two possible approaches for hyper- spectral data analysis.

Acknowledgments: This work was partially supported by NUS Core Support C-380-003-003-00 and the European Community under the Contract RII3-CT-2004-506008 (IA-SFS) (HASYLAB project No II-20052050 EC)

___________________________________________________

[1] R Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2016.

[2] RStudio: Integrated development environment for R. 2016; http://www. rstudio.com

Cytaty

Powiązane dokumenty

When is it

Stack-losses of ammonia Y were measured in course of 21 days of operation of a plant for the oxidation of ammonia (NH3) to nitric acid (HNO 3 ).. Discuss the

We find that our model of allele frequency distributions at SNP sites is consistent with SNP statistics derived based on new SNP data at ATM, BLM, RQL and WRN gene regions..

For “(i)→(ii)” we first observe that, if C is a countable structured algebra and B ⊆ P(Z) is the algebra which is generated by the arithmetic sequences and the finite sets, then

The input is a labeled dataset, D, and the Output is an estimate of the validation performance of algorithm A, denoted by P A The most important steps in the protocol are the

The control problem of the fed-batch fermentor for peni- cillin production was solved with the matrix-free inexact Newton method, presented in the article.. At first, the overall

Delay Locally the remote presences until data for the most delayed remote presence arrives: As for the Live Stage masking approach, the administrator interaction system uses

However, consulting companies interested in introducing circular solutions in the buildings sector should focus not only on showing different values, which can be