MASTER THESIS DIPLOMA

(1)

Faculty of Electronics,

Telecommunications and Informatics

Departament of: Computer Architecture

Field of study: Informatics

Specialisation: Distributed applications and internet services

Mode of study: II level, stationary

Student's name and surname: Szymon Bultrowicz

ID: 119289

MASTER THESIS DIPLOMA

Subject in English:

A subsystem for load visualisation of distributed clusters with GPU

Subject in Polish:

Podsystem wizualizacji wykorzystania rozproszonej sieci klastrów z procesorami GPU

Supervisor

Paweª Czarnul, PhD MEng.

Head of Department

Prof. Henryk Krawczyk, PhD MEng.

Gda«sk, 03.12.2013r.

(2)

Wydziaª Elektroniki, Telekomunikacji i Informatyki

Katedra/Zakªad: Katedra Architektury Systemów

Komputerowych

Kierunek studiów: Informatyka (studia w j¦z. angielskim)

Specjalno±¢: Distributed applications and internet

services

Mode of study: stacjonarne

Imi¦ i nazwisko: Szymon Bultrowicz

Numer albumu: 119289

PRACA DYPLOMOWA MAGISTERSKA

Temat pracy:

Podsystem wizualizacji wykorzystania rozproszonej sieci klastrów z procesorami GPU

Opiekun pracy dr in». Paweª Czarnul

Kierownik Katedry/Zakªadu prof. dr hab. in». Henryk Krawczyk

Gda«sk, 03.12.2013r.

(3)

Streszczenie

Tematy pracy: Podsystem wizualizacji wykorzystania rozproszonej sieci klas- trów z procesorami GPU

Celem pracy byªo rozszerzenie istniej¡cego systemu KernelHive o mo»liwo±¢ monitorowania i ±ledzenie post¦pów pracy. KernelHive jest akademickim systemem pozwalaj¡cym na automatyczne rozpraszanie aplikacji OpenCL napisanych przez u»ytkownika. Zadaniem programisty jest utworzenie przepªywu sterowania w GUI poprzez przeci¡ganie poszczególnych w¦zªów i budowanie grafu skierowanego przepªywu. Ka»dy z w¦zªów w grae zawiera jeden lub wi¦cej kod kernela aplikacji OpenCL. Po uruchomieniu aplikacji na serwerze, rozbijana jest ona na poszczególne zadania, które s¡ potem rozpraszane na kolejne warstwy:

1. klastry,

2. w¦zªy

3. karty graczne,

4. w¡tki na karcie gracznej.

Dystrybucja zada« na poszczególne elementy oraz zarz¡dzanie nimi s¡ w peªni zautomatyzowane przez system KernelHive.

i

(4)

KernelHive w du»ej mierze uªatwia i wspomaga tworzenie aplikacji rozproszonych, ale jak ka»dy z systemów rozproszonych jest podatny na bª¦dy. Z tego powodu bardzo istotne w zapewnieniu stabilno±ci jest umo»liwienie monitorowania

±rodowiska i samego systemu.

Po zbadaniu istniej¡cych rozwi¡za«, zostaªy one podzielone na trzy grupy:

1. rozwi¡zania umo»liwiaj¡ce monitorowanie hostów i klastrów,

2. rozwi¡zania ±ledz¡c¦ prac¦ aplikacji uruchomionych na kartach gracznych 3. narz¦dzia do wizualizacji podgl¡du pracy aplikacji.

Pierwsz¡ grup¦ stanowi¡ systemy takie jak Wolfpack, czy Nagios. S¡ bardzo rozbudowane, co jest zarówno ich zalet¡, jak i wad¡. Umo»liwiaj¡ monitorowanie wielu zasobów, chocia» nie posiadaj¡ wbudowanych agentów pozwalajacych na pobieranie danych o karcie gracznej. Dodatkowo, wa»nym elementem jest te» integracja z KernelHivem, która w ich przypadku nie wygl¡da na ªatw¡, w szczegól- no±ci bior¡c pod uwag¦ Wolfpacka napisanego w .Necie.

Grupa numer dwa to dedykowane rozwi¡zania do monitorowania aplikacji uruchomionych na kartach gracznych. Najbardziej popularne w±ród nich to Nvidia Visual Proler i AMD CodeXL. S¡ tak»e bardzo rozbudowane, ale ich podsta- wow¡ wad¡ jest przynale¹no±¢ do technologii poszczególnych rm. Produkt Nvidii obsªuguje tylko aplikacje napisane w j¦zyku CUDA, co automatycznie eliminuje go z listy mo»liwych rozwi¡za«. AMD CodeXL dopuszcza u»ycie OpenCLa, lecz wszelkie niskopoziomowe polecenia dziaªaj¡ tylko na kartach rmy AMD. Istniej¡

tak»e narz¦dia nie powi¡zane z »adn¡ rm¡, przykªadem mo»e by¢ TAU Per- formance System lub VampirTrace. Oba rozwi¡zania obsªuguj¡ wiele j¦zyków programowania oraz metod wstrzykiwania swoich agentów do kodu. Vampir- Trace pozwala na u»ycie manualnego API, wstrzykiwanie kodu podczas kom- pilacji lub automatycznie wstrzykiwanych kodu podczas uruchomienia. Cz¦±¢

(5)

VampirTrace'a pobieraj¡ca dane i zachowuj¡ca je do otwartego pliku OTF jest bezpªatna. Druga jego cz¦±¢ nazwana Vampir sªu»y do wizualizacji pobranych danych i jest opublikowana na licencji komercyjnej. Pomijaj¡c wady i zalety poszczególnych poszczególnych rozwi¡za«, »adne z nich nie obsªuguje monitorowania rozproszonego systemu. Tego typu narz¦dzia albo podpinaj¡ si¦ pod lokalnie uruchomion¡ aplikacj¦, albo wykonuj¡ analiz¦ post mortem. Ostatni¡ grup¡ istniej¡cych rozwi¡za« s¡ wizualizatory pracy aplikacji rozproszonych. Tutaj za przykªad mog¡ posªu»y¢ PARADE i PGL. Oba produkty posiadaj¡ rozbudowane mechanizmy renderowania 2D, a PGL dodatkowo pozwala na podgl¡d wyników w trójwymiarze. PARADE jest jednak produktem bardzo niszowym i trudno znale¹¢ do niego czytelna dokumentacj¦, a co za tym idzie, jego integracja z Ker- nelHivem mogªaby przysporzy¢ wielu problemów. Wad¡ PGLa, z kolei, jest du»e obci¡»anie sieci z powodu przesyªania gotowych grak zamiast u»ycia API ren- deruj¡cego obraz dopiero po stronie klienta.

Zaimplementowane w ramach pracy rozwi¡zanie mo»na podzieli¢, podobnie jak w przypadku istniej¡cych rozwi¡za«, na trzy moduªy:

1. moduª monitorowania systemu i poszczególnych w¦zªów, 2. moduª ±ledzenia post¦pów wykonania aplikacji,

3. moduª podgl¡du pracy aplikacji.

Moduª monitorowania zostaª zrealizowany w postaci rozszerzenia agentów znajdu- j¡cych si¦ na poszczególnych w¦zªach wchodz¡cych w skªad systemu KernelHive.

Agenty próbkuj¡ co pewien okres czasu stan obci¡»enia hosta, po czym przesyªaj¡

dane, poprzez agenta klastra, na serwer. Na serwerze dane przechowywane s¡ w bazach RRD, które przeznaczone s¡ do przechowywania danych cyklicznych. Po stronie GUI dane reprezentowane s¡ w postaci grafu topologii, z której mo»na prze- j±¢ do widoku Monitora zasobów poszczególnych hostów. Skªada si¦ on z wykresów

(6)

obci¡»enia obserwowanych metryk, w skªad których wchodz¡ np. obci¡»enie CPU, zaj¦to±¢ pami¦ci, pr¦dko±¢ wentylatora, itp. Problemem napotkanym tutaj byªa integracja istniej¡cego mechanizmu serializacji i przesyªania wiadomo±ci, by byª jak najbardziej wydajny. Wiadomo±ci z agentów wysyªane s¡ sekwencyjne i sto- sunkowo cz¦sto. W przyszªo±ci, w przypadku wi¦kszej liczny metryk, rozmiar danych tak»e mo»e wzrosn¡¢. Z tego powodu wiadomo±ci zostaªy rozdzielone na dwa typy: wiadomo±ci jednorazowe i sekwencyjne. Wiadomosci jednorazowe za- wieraj¡ niezmienne dane na temat w¦zªa, np. rozmiar pami¦ci i liczb¦ rdzeni procesora i s¡ serializowane standardowym mechanizmem, czyli zapisywane do ci¡gu znaków. W przypadku danych sekwencyjnych, mechanizm serializacji zostaª rozszerzony o mo»liwo±¢ zapisu danych binarnych, co znacznie zmniejsza ilo±¢

przesyªanych danych.

Drugi moduª przeznaczony do pomiaru post¦pów pracy aplikacji wykorzystuje fakt rozbijania przez serwer poszczególnych w¦zªów przepªywu pracy aplikacji na wiele drobniejszych zada« rozdystrybuowanych pó¹niej na poszczególne hosty w systemie. Moduª obserwuje, które z zada« zostaªy ju» wykonane, które s¡ w trakcie wykonywania, a które caªy czas czekaj¡ w kolejce. Maj¡c takie informacje, w GUI rysowana jest siatka kafelków podobna do tej znanel z manad»erów pobiera«.

Ostatni moduª implementacji pozwala na podgl¡danie aktualnego stanu pracy aplikacji. Zostaªo to zrealizowane w postaci API dost¦pnego po stronie kernela aplikacji OpenCL. Aplikacja taka dostaje na wej±ciu tablic¦, któr¡ podczas wykonania wypeªnia danymi przydatnymi do rysowania. Dane s¡ nast¦pnie przesyªane na serwer i zapisywane. Po stronie GUI, programista ma mo»liwo±¢ utworzenia wªasnego wizualizatora, który jako parametr przyjmuje dane wysªane przez ju»

wykonane zadania i przy ich pomocy oraz z u»yciem Java Graphics rysuje ksz- taªty.

(7)

Testy funkcjonalne moduªu zaimplementowanego w ramach pracy pokazaªy, »e speªnia on wymagania go dotycz¡ce. Jako test, wybrana zostaªa prosta aplikacja obliczaj¡ca caªk¦ pod dan¡ funkcj¡ metod¡ prostok¡tów. Do jej utworzenia u»yty zostaª jeden w¦zeª typu expandable, który zawiera w sobie trzy kernele: Data Parti- tioner, Data Processor oraz Data Merger. Data Processor zostaª w tym przypadku rozbity przez serwer na 8 zada«, przez co mo»liwe byªo pokazanie zarówno post¦pu wykonywania si¦ przepªywu pracy, jak i umo»liwiªo to wizualizacj¦ podgl¡du aktualnego wyniku pracy. Podczas testów wydajno±ciowych, okazaªo si¦ te», »e po dodaniu moduªu monitorowania, agent KernelHive'a obci¡»a system tylko niez- nacznie, w pomijalnym stopniu. Jedynym problemem tu mo»e by¢ u»ycie pami¦ci wirtualnej, któr¦ si¦gaªo nawet 20GB. Pami¦¢ wirtualna nie jest jednak adekwat- nym miernikem, poniewa» uwzgl¦dnia tak»e wszelkie zaªadowane z dysku biblioteki wspóªdzielone, dlatego problem ten zostaª uznany za maªo istotny. U»ycie pami¦ci prywatnej nie jest du»e i wynosi niecaªe 5MB.

Zaªo»eniem pracy byªo stworzenie moduªu do systemu KernelHive, który umo»li- wiaªby monitorowanie ±rodowiska, na którym pracuje, podgl¡d post¦pów wykonywania aplikacji oraz podgl¡d danych wyliczanych przez aplikacj¦, w trakcie jej uruchomienia. Zaªo»enia te zostaªy speªnione, lecz istnieje wiele ±cie»ek rozwoju.

Jedn¡ z propozycji mogªoby by¢ dodanie logiki wykrywaj¡cej potencjalne prob- lemu ze ±rodowiskiem b¡d¹ aplikacj¡. Innym sposobem rozszerzenia funkcjonal- no±ci mo»e by¢ dodanie kolejnych metryk, zbieranych przez agenty, np. zu»ycie pr¡du, czy u»ycie sieci. Istnieje te» wiele mo»liwo±ci usprawnienia samego GUI, by byªo bardziej funkcjonalne.

(8)

Chapter 1 Introduction

I like control

Michael Jordan

The predecessor of all distributed systems was Advanced Research Projects Agency Network (ARPANET), invented in the late 1960s. It was then, when multiple computing units were allowed to connect with each other at the rst time in computer science. However, the rst globally available network, called the Internet, was introduced with the standardisation of Transmission Control Pro- tocol/Internet Protocol (TCP/IP) protocol in 1982. After that, terminal devices from around the world had began to join to that one, massive distributed system.

Nowadays, the Internet connects about 2.5 billion users [1] and still counting. Dis- tributed systems took part of our lives. However, not only the size of the Internet is progressing but also the performance of particular devices. Engineers are constantly accelerating processors, the memory is turning cheaper. Processors have been advanced from 12MHz in 1980s [2] to multicore 3.6GHz processing units with HyperThreading, available for ordinary people. In case of memory, the situation

5

(13)

is similar. The price of memory decreased from $400 for 64KB stick to $50-$60 per 8GB stick [3]. Although, at some point the growth stopped a little because of technological barriers, breaking ipso facto the Moore's law. Starting from that moment, people started to focus more on parallelisation and using distributed environments using very fast network infrastructures that are available nowadays.

Distributed systems are vulnerable to errors because of several reasons e.g.

network issues, diversity of components, straitened administration, and so on.

Due to that fact, a very important activity is taking control of such a system.

Control is important in every aspect of life. Control is power, which comes from knowledge. This power gives people opportunity to improve their lives. The same thing applies to computer science. The required knowledge is given to users by proper monitoring tools that can say in which state their system currently is. The power allows to improve entire system through debugging and bottleneck ndings.

It is clear that monitoring is a very important or even crucial aspect.

One of such distributed solutions which is meant to take all the best from many parallelisation techniques is KernelHive. It uses several levels of parallelism, distributing single application into clusters, nodes, graphic cards and nally multiple threads. Such a connection of multiple layers makes the system very complex and hard to maintain. At this level it is essential to create an opportunity to monitor the whole system, which would allow developers to control it and nd ways to optimise and develop their own applications. Moreover, in KernelHive, apart from standard infrastructure monitoring mechanisms, it can be done even more. Ker- nelHive as a server for distributed OpenCL applications can track not only system state but also current progress of application itself. Such a solution could help user predict the nal output of a running application without having to wait after the whole work-ow will end. It could allow user to observe continuous visualisation of the current state the application is in currently.

(14)

It is easy to see that monitoring of so complex, multilayer architecture is not a trivial task but very important from the point of view of developers using Ker- nelHive, which seems to have a great potential.

(15)

(16)

Chapter 2 Existing solutions

In majority of cases it is better to reuse already implemented and tested solutions. It allows to reduce code redundancy, reduce number of bugs in the code and decrease overall development time. However, because KernelHive is a very young academic project, there is no such dedicated solution yet. On the other hand, there are many solutions with behaviour similar to the desired one, which is a module for the KernelHive that is able to monitor as well the entire distributed infrastructure as the progress of OpenCL application running on Graphics Processing Unit (GPU). Requirements of the monitoring module are described in more details in Chapter 3.

Generally, the existing solutions can be divided into three groups, presented in more details in the next sections:

Cluster monitoring systems which are able to provide diverse metrics gathered from within entire cluster of heterogeneous nodes.

Monitoring tools that provide user with debugging and proling mechanisms.

Application output preview systems able to visualise work of distributed applications.

9

(17)

2.1 Cluster monitoring systems

2.1.1 Wolfpack

Wolfpack [4] authors call it a "swiss knife" of monitoring. It is written in .NET, as an expendable Windows service. Its power comes from ease of writing custom plugins which, besides of many existing plug-ins, makes him very exible. Moreover, Wolfpack is an open source project so it is possible also to even adjust the engine, if needed. The goal of Wolfpack is to monitor dierent kind of metrics, spread through entire infrastructure using, so called, "touch points". It allows to monitor e.g.:

1. CPU load

2. RAM usage

3. MSMQ queues

4. rewall and IIS server logs

5. WWW server availability

Wolfpack serves also GUI for mobile interfaces.

Architecture

Information from touch points are gathered by agents installed on remote ma- chines, however, an agent can be installed on local machine as well. After gathering, the data is send to the server implemented as a system service. Server listens to events with two entry points depending on user preferences: using Windows Communication Foundation (WCF) service or NServiceBus.

(18)

Figure 2.1: Wolfpack example interface [4]

Figure 2.2: Wolfpack architecture [4]

After data processing the server publishes it in many ways. It can be saved in a database, sent to a cell phone as a notication or relayed to another queue or

(19)

service.

Interesting is also how the system can be extended. Wolfpack oers a bunch of so called "extensibility points", which are slots where user can attach custom plugins, e.g.:

Startup Plugin, Scheduler Plugin, HealthCheck Plugin Allow to congure agent, his schedule and behaviour.

Publisher Plugin, Publisher Filters Dene which data should be published and in which way.

Activity Plugin, RoleProle Plugin Allow to implement custom, server-side logic.

Growl Notication Finalisers Allows to dene custom Growl notications.

Restrictions

Wolfpack seems to be a good and exible solution. Although, three restrictions can be seen at glance, which force to exclude Wolfpack from a list of possible solutions to use with KernelHive.

At rst, there is no module for monitoring GPU applications. Writing a custom GPU agent is possible but additional work on integration with KernelHive would be required, which makes benets from using existing solution too small comparing to writing custom solution from scratch.

The second restriction is complexity of the Wolfpack. It makes the infrastructure quite large and heavy to maintain, which also implicitly impacts resources usage. Communication overhead also seems to be much bigger than in case of custom solution.

The last obstacle is the technology Wolfpack is written in, which is .NET Framework. Although, there are solutions that make running .Net application

(20)

possible on Linux-based systems e.g. Mono, it is not perfect. Mono supports .Net up to specic version, which usually is the one from 2-3 years ago and even then it does not support all the features. It means that it can be impossible to run Wolfpack on Linux or at least it can make Wolfpack vulnerable to bugs and odd behaviour.

2.1.2 Nagios

Nagios [5] is another Open Source project for infrastructure monitoring. As well as applications, it allows to control services, whole operating systems and other components. The advantage of Nagios is the used technology. The core part of Nagios is written in C and the front-end in PHP, which allows to run it on Linux without any problems.

Nagios is one of the most popular monitoring systems and has huge community which is also a great advantage. It has well developed Application Programming Interface (API), which in addition to the big community makes writing custom plug-in very easy. Moreover, Nagios source code is published under GPL, so it is possible to adjust the code if necessary.

Figure 2.3: CPU usage monitoring using Nagios [6]

Worth to consider would be usage of Nagios network topology visualisation

(21)

Figure 2.4: Network infrastructure visualisation using Nagios [6]

component in KernelHive to visualise system infrastructure.

Restrictions

Nagios is a very complex system, what again indicates Central Processing Unit (CPU), memory and communication overhead. In KernelHive it would be used only in host and GPU monitoring what considering the complexity of the entire system would be a waste of resources. Additionally, GPU plugin has to be written on user own and integrated with KernelHive anyway.

(22)

Another obstacle is complexity of installation, conguration and maintenance of the Nagios, even in the Linux standards. Proper conguration requires quite good knowledge of the administrator. Nagios has commercial technical support, but such a solution is not desired in this project.

2.1.3 Other solutions

There are several other solutions, which are not described here, because they are very similar. The described ones are the most common, free solutions for monitoring distributed systems. Other found projects did not oer any additional functionalities which are worth to mention from the perspective of this work.

Examples of the other solutions:

• Ganglia Monitoring System

• Compuware APM

• New Relic

2.2 GPU monitoring tools

2.2.1 NVIDIA Visual Proler

Nvidia Visual Proler [7, 8], as the name can suggest, is a software to prole GPU applications. It is contained in the standard CUDA SDK bundle and allows developers to prole applications written in CUDA. It consists of a whole bunch of tools that allow to debug applications and monitor their runtime. Examples of such tools can be:

Timeline Unied timeline that contains e.g. memory transfers or call stacks of CPU side methods as well as the GPU side ones.

(23)

Lower-level data drilldown Possibility of drilling the call stack down to data on the base level provided by video card hardware.

The Visual Proler has support for automatic search of bottlenecks in applications, gathering data from remote systems and its analysis.

An interesting addition is the Nsight plug-in for Visual Studio and Eclipse which transfers many of NVP directly to Integrated Development Environment (IDE). It makes writing and debugging applications signicantly easier, because it is integrated and developer does not need to use an external tool. Visual Proler itself has distribution for either Windows and Linux.

Figure 2.5: Nvidia Visual Proler proling results[8]

(24)

Restrictions

It seems that Visual Proler has the most of functionalities that would be useful in KernelHive. However, its basic restriction is the technology it is targeted to. It was created by Nvidia and it is meant for CUDA language, although KernelHive kernels are written in OpenCL. Additionally, it allows data downloading from remote devices by the execution units but it is not created to monitor applications executed in parallel on many nodes.

The Visual Proler is a great tool but unfortunately it will not be found useful in this project.

2.2.2 AMD CodeXL

AMD CodeXL [9, 10] is a complex set of tools to prole and debug GPU dedicated applications. Considering functionality it is very similar to Nvidia Visual Proler with the dierence of no restriction of language the applications have to be written in because it also supports OpenCL.

CodeXL consists of functions already known from the Visual Proler, e.g. timeline, call stack drill-down or static code analysis. It introduces also several new features that are worth to mention:

Detailed GPU load statistics AMD CodeXL allows to monitor in details which of the components is busy, how much they are loaded and what are their lim- its. It is a very useful tool to support manually bottleneck searches.

Assembler code preview Such an opportunity can be useful to see exactly what has the compiler generated, which can be useful in introducing further optimisations.

Results visualisation Used especially by developers of graphical applications, allows to preview video card buers and OpenCL or OpenGL draft graphics.

(25)

Figure 2.6: Application proling in AMD CodeXL[10]

Very interesting CodeXL functionality is his ability to fully integrate with Vi- sual Studio. In that case developer gains additional possibility to debug his application directly in IDE. Similar feature was provided by Visual Proler which also supported Eclipse. CodeXL has no such an opportunity which can be a blocker for Linux developers. However, it does not integrate with Linux glsplide, it has a standalone Linux version.

Restrictions

In the case of CodeXL there is no technological restriction because OpenCL supports both OpenCL and CUDA. CPU level proling is also independent of the hardware platform and it supports AMD units as well as the Intel ones. Although, it cannot be forgotten that this is a branded AMD software and it uses specic API provided by AMD cards. It means that AMD CodeXL allows to monitor applications written in Nvidia or OpenCL and supports both Intel and AMD platforms but the low-level features are provided only for AMD devices which considering

(26)

Figure 2.7: Integration Visual Studio with CodeXL[10]

the requirement of Nvidia card support, eliminates this tool from the possible solutions.

Another obstacle, similarly to the Visual Proler, is the lack of support of distributed applications.

2.2.3 AMD gDEBugger i AMD APP Proler

AMD gDEBugger is a simplied version of CodeXL. Its Windows version is distributed as a plug-in to Visual Studio and on Linux as a standalone version. It allows debugging, OpenCL and OpenGL textures preview, and so on but it does not allow proling of the applications.

AMD App Proler, on the other hand, is a supplement for AMD gDEBugger and it is dedicated for proling. It has distributions for both Linux and Windows and also as a Visual Studio plug-in. User can nd in it such features as e.g.

time-line, video card load measurements or assembler code preview.

(27)

gDEBugger and APP Proler together provide all functionalities of AMD CodeXL but no more. There is no reason then to extend the topic about them.

They have the same restrictions and cannot be used in KernelHive solution from the same reasons. Although author found it worth to mention about them as an alternative to the CodeXL.

(a) AMD APP Proler (b) AMD gDEBugger Figure 2.8: AMD APP Proler and AMD gDEBugger[11, 12]

2.2.4 TAU Performance System

Tuning and Analysis Utilities (TAU) [13, 14] is a very interesting system for several reasons. At rst, this is the rst of the mentioned here solutions that is fully independent to any hardware manufacturer. It was made as a result of work of united developers from University of Oregon Performance Research Lab, The LANL Advanced Computing Laboratory and the Research Centre Julich at ZAM.

Another characteristic attribute is the way how TAU measures performance of the applications. In the previous solutions the hardware provided an API which provided some piece of hardware and software data. Such an approach was the simplest one and the most accurate but had some major restrictions. The API is written by the manufacturers and is dierent for each brand. The other thing

(28)

is that applications can get only the data provided by such API but there are situations that manufacturers limited provided data for the business reasons.

TAU uses custom instrumentation methodology, completely independent from branded APIs. It integrates with GPU compilers to inject to the output code his own API. Next, in the runtime the injected code is being executed and the gathered data is being sent to the TAU component. TAU gives developers a huge

exibility of choosing the technology because it supports many of them. In the CPU programming languages one can nd e.g. C++, Java, Python, and more.

Developer can choose from many dierent GPU technologies as well e.g. form CUDA C, CUDA C++, OpenCL, pyCUDA, or HMPP.

TAU has his distributions for both Linux and Windows but the documentation is rather targeted for Linux users.

Figure 2.9: Visualisation of CUDA application runtime[14]

(29)

Restrictions

The disadvantage if that system is a very hard to read documentation. Due to that fact, it is very hard to determine if the system would be capable of handling distributed environment with 2-level parallelism and if the time spent on congur- ing and adjusting the TAU system would be smaller than writing custom module for system with very precise requirements.

2.2.5 VampirTrace

VampirTrace[15] is another solution which was not implemented by hardware manufacturers. It is made by Technische Universität Dresden and consists of 2 parts:

VampirTrace The main part, responsible for gathering data and saving them to glsotf log journals. VampirTrace is a free Open Source project.

Vampir[16, 17] A tool for visualisation data gathered using VampirTrace. It contains the whole logic of nding the bottlenecks and tools for proling applications. This analysis step is not distributed for free and users has to pay for it. However, there is no price on the Vampir website so it is dicult to say how much.

VampireTrace likewise the TAU has his own instrumentation mechanisms.

TAU however, has only one possible way of adding instrumentation methods e.g.

by injecting code during compile time. In that aspect VampirTrace signicantly exceeds TAU functionalities, because it supports several methods depending on user choice:

• manual instrumentation,

• automated code injecting during compile time,

(30)

Figure 2.10: Visualisation of CUDA application runtime in Vampir[17]

• code modication in runtime with the aid of Dyninst API,

• usage as a library,

Output le format is dedicated to visualisation in Vampir, but as far as it is an open format, it can be read by any application. One can also write has own parser if necessary. Unfortunately, the author was not able to nd any free tool capable of reading Open Trace Format (OTF) les.

Restrictions

The split of project into to separate parts can be useful in the point of view of performance and licensing. Vampire is a great tool that creates only log les and it is free which is benecial in that case because creating logs would be the interesting part from the point of view of that project. On the other hand, it requires a custom le parser to process the log les. Additional aspect is the performance one because, assuming such an approach, a whole log le has to be transferred instead of specialised data only comparing to the dedicated protocol solution.

(31)

Another drawback is the VampirTrace monitoring method. Optimisation process consists of several steps:

1. Run the application.

2. Generate OTF le using VampirTrace tool.

3. Open OTF le using a tool which can handle such a le (e.g. Vampir).

4. Follow application runtime log and nd bottlenecks.

5. Rewrite application.

6. Run the whole process again in a loop until expected result has been reached.

The expected solution in KernelHive assumes monitoring in the real time which is hard to implement using VampirTrace.

2.3 Current application output preview systems

2.3.1 PARADE

PARADE[18] is a system that visualises runtime of a parallel application. Its goal is not to render the preview of application results but to show interactions and relations between components of an parallel application basing rather on a living organism than on a static code.

PARADE is split into 3 layers:

Instrumentation layer which gathers data about application and events Choreography layer which handles events and performs certain actions on them Presentation layer which is responsible for displaying results prepared by the

choreography layer

(32)

Figure 2.11: PARADE architecture[18]

Figure 2.12: Example visualisation of the quicksort algorithm in PARADE[18]

PARADE allows to monitor applications during runtime as well as after it.

Instrumentation can be done in three ways, depending on complexity level of an application or variety of information user want to receive. These three ways are:

Manually Manual calls of instrumentation functions. It requires additional work from the programmer but gives the greatest control of what and how is

(33)

instrumented

Overriding standard communication libraries It is done by wrapping standard libraries with wrappers containing calls of instrumentation routines. It requires additional eort only once to write such wrappers. Further usage is invisible for the developer, the only thing needed is to link proper, wrapped libraries.

Overriding standard system libraries A similar case to the previous one. The only dierence is that all system libraries are overwritten. Developer does not have to care about libraries at all in such case but on the other hand, all applications are instrumented by default, which can be a disadvantage in some aspects.

In visualisations PARADE uses POLKA system. It allows to create multi- windowed, two-dimensional and colourful animations. It provides built-in support for many components available for developer to use.

2.3.2 PGL

A Parallel Graphics Library for Distributed Memory Applications (PGL)[18] is a set of libraries published under GPL and created to present the results of parallel rendering applications. PGL consists of series of libraries:

1. Parallel Graphics Library 2. Parallel Visualization Library 3. Parallel Object Library 4. Distributed User Interface 5. Set of other tools

(34)

Figure 2.13: Example application runtime visualisation in PGL[18]

PGL is meant for Single Process Multiple Data (SPMD) applications, which is the most popular parallelism technique used in rendering applications. The technique used here is called polygon rendering. The whole graphic is divided into regions, which are triangles in that case. After that, to each of such a region are assigned an application instance and a piece of data the applications should process. After task completion, the results of each cell is sent to overall image.

There are several problems existing here regarding parallel computations:

Data distribution between processes

Load balancing Topic which is tightly bound with data distribution. In systems with multiple computing units, it is required to check if each unit receives proper amount of data. Another balance opportunity is dynamic data distribution in the runtime.

Communication between processes In the 3D graphic rendering process, the result 3D image is set from several 2D ones. To achieve that, processes have to communicate between each other. Communication can cause noticeable overhead on CPU and network consumption. Even more complex can be

(35)

communication in KernelHive where there are multiple levels of parallelism.

In the border cases, waiting for responses from the other threads can lead to deadlocks and many more issues.

Output data gathering Dividing image into multiple small, distributed pieces makes whole application to scale well but on the other hand, can cause several following problems. To avoid such problems there are 2 solutions available:

Push-up eect Sibling cells are merged and one huge image is being created. Although, it causes additional communicational overhead and forces using buers which contain temporal output.

Centralisation All instances are sending their own cells to one collector.

That approach requires additional management which would gather the cells and compose sequential stream.

These are general parallel applications problems, which PGL is not able to solve because there is no universal solution. Developer has to take care about it to be sure his application is robust and eective.

Restrictions

PGL is a very useful set of tools to parallel rendering that allows to visualise of program output in the real time. From the point of view of KernelHive however, it is worth to notice that it represents exactly the output graphic. Such an approach may cause huge network overhead because image streaming is very resource con- suming. In KernelHive case it would be good to reduce the amount of transferred data by using e.g. API which would only notify the end user about the changes.

(36)

2.4 Conclusions

The goal of the monitoring module for KernelHive is to monitor nodes of cluster, on which GPU parallel applications will be running. As it seems comparing examples of existing solutions, it is hard to nd matching solution that would meet all the requirements. It would be probably possible to join several of these projects but it would make a huge performance overhead as well as integration one. Summary comparison if described solution is shown in Table 2.1.

It seems reasonable then to write custom module from the beginning which will be dedicated to KernelHive system. Moreover, implementation of such a system will have many educational benets. In such a way, existing solutions will not be used as production ready components but more likely as example of possible good and trustworthy solutions.

(37)

Name of the

project Advantages Drawbacks

Wolfpack Highly developed Flexible

Support for multiple sensors

Written in .Net No GPU agent Nagios Written in C++ and PHP

Huge community Well developer API

Published under GPL li- cence

High complexity

Hard to install and main- tainNo GPU agent

Nvidia Visual

Proler Many GPU monitoring

tools

Possible integration with IDEs

Supports only CUDA applications

No support for distributed monitoring

AMD CodeXL Many GPU monitoring tools

Possible integration with IDEsSupport for OpenCL applications

Low-level API available only for AMD cards

No support for distributed monitoring

TAU Perfor-

mance System Not branded

Code injection at runtime Many supported languages

Hard to understand documentation

Hard to integrate VampirTrace Open Source monitoring

partOutput exported to OTF

lesSeveral ways of use

Commercial visualisation partNo ready and free OTF parsers yet

Only post-mortem analysis PARADE Capable of showing colour-

ful animations Many ways of use

Very niche project

Poor and hard to read documentation

Hard integration PGL Set of libraries helping con-

tinuous visualisation Huge network overhead because of transferring ready images instead of lightweight API

Table 2.1: Comparison of existing solutions

(38)

Chapter 3 Monitoring module specication

3.1 KernelHive overview

At rst it is essential to know what KernelHive is and how it works. It is needed to understand the nature of the problem.

The goal of KernelHive is to provide a developer with a environment which allows him to easily write distributed application [19, 20, 21]. Such a developer could write and OpenCL application in a Graphical User Interface (GUI) and run it. System should handle the whole communication and distribute it through the cluster. Developer could also create a workow consisting of multiple nodes, through which the data would be transferred and processed. KernelHive can manage multiple levels of parallelism, distributing data through multiple clusters, nodes and graphic cards. Figures 3.1 and 3.2 show examples workow and code editor of a KernelHive application.

This section is only a brief overview of the KernelHive system.

31

(39)

Figure 3.1: Example KernelHive application workow

3.1.1 KernelHive architecture

KernelHive consists of several layers. Each of them, except GUI, represents one of the distribution level.

Layers of KernelHive:

Hive Worker The lowest level which represents physical graphic card. Each GPU introduces additional parallelisation level because of its capability of running dozens of threads simultaneously.

Hive Unit Represents a physical node which can contain several graphic cards.

Additionally, OpenCL beside the GPUs supports also CPUs so each CPU can be used as an extra processing unit.

Hive Cluster Represents a cluster which is a group of nodes visible to each other.

(40)

Figure 3.2: KernelHive code editor with an example code

Figure 3.3: Sample KernelHive environment/architecture tree

Hive Engine Management center of the entire system which is capable of man- aging multiple clusters, receiving and distributing jobs, etc.

(41)

Hive UI GUI where developers can write their own OpenCL applications, create a workow and schedule a job.

Every layer has its own agent running on its device. It means that on each node there is a running Hive Unit, on each cluster there is a running Hive Cluster instance and so on. The components are written in two dierent technologies: Java and C++. The lower ones, which are Hive Worker and Hive Unit, are supposed to be lightweight and are written in C++. Because they are working on the same machine, there is no problem with communication between them. The higher levels of the system i.e. Hive Cluster, Hive Engine and Hive UI are written in Java due to ease of implementation. Communication between Java components are realised as web services. Cross technologies communication are done using standard TCP sockets. Component diagram of the KernelHive system is visible in Figure 3.4.

Hive Worker

Hive Worker is responsible to download the part of data it will be operating on and

nally compile OpenCL kernel and run it on a given graphic card. Hive Workers are run by Hive Units for each task it receives. Each Hive Worker gets the ID of the GPU it will be running on as a parameter so it can be said that Hive Worker represents one video card. However, not every graphic card can have its own running Hive Worker because it is created on demand only for specied device.

Hive Unit

Hive Unit is an agent running constantly on each node and represents a node capable of running jobs. His tasks is to listen on new jobs, run proper Hive Worker and pass task parameters to it. After that, the whole communication with be handled by Hive Worker on its own.

(42)

Figure 3.4: KernelHive component diagram [22]

Hive Cluster

Hive Cluster is another type of an agent. It represents a cluster which is not a physical device as in the previous 2 cases but rather a logical groups of nodes. The cluster nodes are supposed to be in secure network and to be able to see units and the other way round. Hive Cluster is in that case a single point of contact with

(43)

Hive Engine. It can be running besides Hive Unit or on a completely dierent machine however, it will not be capable of running jobs then. Its task is to listen on new jobs from Hive Engine and pass them to Hive Units. It also downloads data from address given to him by the Hive Engine and serves then as a distributor for Hive Units.

Hive Engine

Hive Engine is an application running on a Glasssh Server. It is a kernel of the system. It receives job work-ow from the GUI, divides it into specic jobs.

Hive Engine keeps information about whole infrastructure e.g. available nodes and graphic cards. Having that information, it is capable of distributing jobs through entire system doing some optimisations rst. Hive Engine contacts only with Hive Clusters on the one side and only with Hive UI on the other.

Hive UI

Hive UI is a GUI that allows user to write and run his own application. After creating a new project, developer is able to construct a new application work-ow and edit kernels of specic nodes such as DataProcessor, Data Partitioner, and so on. Hive UI contains an OpenCL code editor with syntax colouring. After

nishing application, developer can run an application and get the link to results.

Hive Common

Hive Common is a library common for all components. There are denitions for common types, web services, protocols, etc. Because of use of dierent technologies it is in fact split into dierent libraries, one for Java and one for C++ modules.

(44)

3.2 Description of the existing system

The project is a module for the KernelHive system and its goal is to provide functionality of measuring load of nodes in the monitored clusters and displaying work progress of applications run on graphic cards.

The existing system allows user to automatically distribute an application and to run it with very simple mechanism of work progress measurement. That simplied mechanism consists of three states:

1. Application is being distributed 2. Application is running

3. Application has ended

It can be easily seen that for now there is no possibility of measuring real progress of an application. Provided information are useful rather for the developers of the system, allowing them to debug the distribution part, rather than for the developers of the distributed application. The KernelHive user would like to be aware of current progress of his application. Moreover, the existing system does not provide the user with information about the load of the particular nodes.

Such data can be valuable while nding the bottleneck of the application or just monitoring heath of the infrastructure. The load data can act as as a feedback and become an input of automatic diagnose module.

3.3 System goals

System goals specify targets the system should focus on. In case KernelHive, they are specied as:

1. provide information about system load,

(45)

2. bottleneck nding assistance,

3. provide information about work progress - displaying percentage progress of application work and state which it is currently in,

4. provide estimated current application output.

3.4 System shareholders

System shareholders are users of the system and other people that have interests in KernelHive.

1. Application developer

• Cluster load visualisation

• Application work progress overview

• Application optimisation 2. Application user

• Application work progress overview

• Estimated current application output overview 3. Cluster administrator

• Cluster load visualisation

3.5 System context

There will be three groups of destination users:

(46)

Application developer Advanced user, aware of distributed programming rules, templates and techniques. Developer interface should be fully functional and provide every information that could be useful in optimisation and debugging.

Application user A user that knows the computer well but has no deep knowledge about parallel programming. Application user needs just basic data about application progress and possibility to overview current application output

Cluster administrator Advanced user, not interested in actual application, rather in the infrastructure. Interface should provide him with information about cluster load and system health.

The system is dedicated for Linux homogeneous environment and has to integrate with standard mechanisms and tools available in Linux operating system.

3.6 Functional scope

Functional scope denes functionalities the system has to have from the perspec- tives of particular shareholders.

1. Application developer needs:

• load visualisation of the particular nodes,

• hints about possible bottlenecks in the system.

• state and progress tracking of the application on particular nodes,

• current application output overview.

2. Application user needs:

(47)

• average system load visualisation,

• average application progress tracking,

• current application output overview.

3. Cluster administrator needs:

• state and progress tracking of the application on particular nodes

3.7 Quality requirements

A properly implemented system has to meet specied quality requirements, which can be split into a few categories:

Availability The target availability in the business environment is 24h per day with reliability estimated at 98% level. For now, it is only test version planned, where reliability on 80% level is acceptable.

Mobility System agents should work on every homogeneous Linux cluster.

Security System in the test version assumes that provided infrastructure is safe and trusted as well as users of the system so for now there is no requirement for authentication and authorisation mechanisms. Although, the possibility to easily add security mechanisms has to exist.

Performance Introducing the system can impact the system with additional overhead o transfer and computing power, however it has to be on an acceptable level which is settled to 10% overhead of CPU load and 30%

overhead of transferred data.

Congurability User should be able to turn o some of the functionalities, if his priority is the performance of system. Moreover, also GUI should be cong-

(48)

urable, because too much information in one place can decrease readability and usability.

Flexibility Possibility of future development should be taken under account, especially in case of application output overview and resources management.

3.8 Limitations

Limitations specify other requirements for the system, e.g. deadline or target environment. Limitations for KernelHive are specied in the list below.

1. System has to fully integrate with existing communication mechanisms of KernelHive.

2. System has to work especially on KASK lab cluster.

3. Project is being done by 1 person.

4. Free solutions are preferred.

5. Project should be testing ready in June 2013 and production ready in Septem- ber 2013.

6. Project documentation is required.

3.9 Use case diagram

Use case diagram presents interactions between dierent types of users and the system. The diagram for KernelHive is visible in Figure 3.5.

(49)

Figure 3.5: Use case diagram for KernelHive

(50)

Chapter 4 Implementation

4.1 Monitoring module architecture and implemen- tation

4.1.1 Monitoring module architecture

Monitoring ow starts from the bottom level components as Hive Worker and Hive Unit. As far as it has to be passed to Hive UI and processes in between, monitoring module has to be added to each of the components as it can be seen in component diagram in Figure 4.1 which is the extended version of the one shown in Figure 3.4.

4.1.2 Message ow

As it was mentioned before, monitoring results have to be transferred through all layers of KernelHive. Simplied schema is shown in Figure 4.2.

43

(51)

Figure 4.1: KernelHive component diagram

Figure 4.2: Message ow diagram

(52)

There are several types of messages sent through the mentioned ow:

1. monitoring data, 2. progress data, 3. work preview data.

Monitoring messages

The part of the monitoring module which gathers load statistics uses 2 dierent messaging mechanisms because of a 2 dierent natures of needed information.

The rst type is a standard sequential data messages that need to be send every certain period of time e.g. every 1 second. Due to high frequency of dispatch they have to be compressed as good as it is possible to make them small. Because of that, the binary serialisation format was chosen. Each part of a monitoring sequential message has specied order and length so no needles data is being transferred. This type of messages contains changeable specications which are:

• Host ID - size of 2B,

• Current CPU speed - size of 2B,

• CPU usage per each core - size of 2B per core,

• Node memory usage - size of 2B,

• Number of GPu devices - size of 2B,

• Information about each GPU that node has - size of 10B per device:

GPU ID - size of 2B,

GPU memory usage - size of 4B,

(53)

GPU usage - size of 2B,

GPU fan speed - size of 2B.

The format of the message can be easily changed or extended by adding support of the following elds to the formatter in Hive Unit and message parser in the Hive Engine. The only limitation is the size of the message which is 65,507 bytes imposed by the IPv4 and UDP datagram sizes [23, 24]. It is also possible to extend it using IPv6 Jumbograms [25].

The second type of messages are the ones that consists of static values. Such information is send only once after unit connection to cluster so there is no pressure on message size here. Because unit reporting mechanism has been already done, it was only extended to suit monitoring requirements. This type of messages contain information about:

• host name,

• CPUs or cores available,

• total amount of available memory,

• information about available graphic cards,

vendor name,

graphic card friendly name,

graphic card ID,

graphic card total amount of memory.

Work progress messages

Work progress uses already dened messages. In contrast to monitoring messages, the work progress messages are initiated by the Hive Worker. They are not being

(54)

transferred through Hive Unit but go directly to the Hive Cluster component and then to the engine where they are being processed. As far as reporting progress of particular jobs is not essential these messages are being sent using UDP protocol which have smaller overhead and leads to reduced network usage but, on the other hand, there is no guarantee of message delivery.

Work preview data messages

Work preview data messages use already dened but extended mechanism of reporting work progress. Work preview data are also not crucial for the system.

Similarly to work progress messaging mechanism, this one is initiated by the Hive Worker.

Despite the fact that it integrates with the existing messages, it is a mixture of 2 serialisation mechanisms. Work progress messages are serialised to plain text where each part of data is a sequence of chars delimited by a space. Considering the fact that Work preview messages can store large amount of data, such a serialisation mechanism would be ineective. To avoid that, the part of message that contains the preview data is being serialised to a binary format.

4.1.3 Data storage

The monitoring module uses 2 dierent mechanisms of storing data dedicated to 2 dierent purposes. Because of various nature of the data there is no possibility to keep it in one place using the same data storage. However, all databases are being stored and maintained by Hive Engine on Glasssh server.

Monitoring time series data

Monitoring time series data contains statistics gathered by the Hive Unit agent about system load. The agent sends data sequentially every second, by default.

(55)

Each sample contains information about several points of interests.

Standard are not meant to such purpose. The main issue is size of stored data. Relational databases contain set of entities and with every next sample this set would grow. Considering several types of samples and a bigger number of monitored nodes such a growth could be considerably large. There could be some workarounds applied e.g. ushing old data every hour but it also does not solve all problems.

The solution used in monitoring module for KernelHive is a RRD which stands for Round Robin Database. It is widely used by monitoring applications and applications gathering statistic data in general. Using RRD developer can specify resolution of a database and its capacity. Round robin mechanism takes care of the size of the data by overwriting old data if the capacity reaches given limit.

The next interesting feature of RRD is keeping samples in slots which represent specic time. While gathering statistics usually exact time is not essential. Ap- proximate date of package arrival is fully acceptable and also solves some problems with data representation. Using RRD the developer can dene the density of slot and how tolerant such slot should be until the value will be set as undened. Ad- ditionally, when storing a new value, some metrics can be immediately calculated e.g. average value from the last 5 minutes which allows to reduce size of data and eort put on selecting data.

RRD also provides a built-in mechanism for rendering graphs of stored data.

Given database and series name, it creates an image le with a chart shown.

Example charts are shown in Figure 4.3 and 4.4.

(56)

Figure 4.3: Sample visualisation of CPU usage using built-in RRD mechanism

Figure 4.4: Sample visualisation of GPU memory usage using built-in RRD mechanism

(57)

Topology data

The purpose of topology database is to store all information about clusters, nodes and graphic cards that are meaningful for the user. It can contain information that are useless in case of engine job management but can provide valuable information for developer e.g. device vendor and name, host total memory, number of processors, etc. Due to that fact, topology storing mechanism is separated from the main engine and is used only in monitoring module. In view of characteristics of stored data which, target database engine should be relational and quick.

As a topology storage, H2 Database Storage engine [27, 26] was chosen. It is a very fast and lightweight engine as it is shown in a performance comparison between several common database engine which is show in Figure 4.5.

Additional advantage of H2 engine is operating on le system. H2 does not require installation. It can be run in 2 modes. The rst is a standard client- server mode when users starts a server as a daemon and whole communication runs over TCP/IP protocol. It allows then multiple client connections and serves a web user interface. Second mode is an embedded mode. In that case there is no server needed because H2 is used as a library. These 2 modes are fully interchangeable and, except from database initialisation, transparent for developer.

Features and use of both modes are the same. It is allowed by the lack of central database management system. During database creation or opening developer passes database le explicitly so there is only one database in one le and nothing to manage. Despite of that fact, H2 engine provides support for an SQL language which support most of commonly used commands. The only dierence between these 2 modes is that while using server there are multiple connections available in the opposite of embedded mode where only one connection is possible. This is the reason for developers to use server during development because this one slot

(58)

Figure 4.5: Performance comparison between several common database engines[26]

is used by the application which means that any other tool such as database GUI cannot be used while application is running.

Database schema used in KernelHive is shown in Figure 4.6. In kernel hive there are 3 sets of entities stored in the database:

• clusters represented by the table Clusters,

• nodes represented by the table Units,

(59)

Figure 4.6: Schema of the topology database

• graphic cards represented by the table Devices.

The visualisation of the topology stored in the database is visible in the Hive UI in the Topology browser section. Sample visualisation is shown in Figure 5.13 in Section 5.1.4.

4.1.4 Work preview component implementation

The work preview component's task it to present partial application results in a graphical way during the application runtime. To achieve that an API was created using which developer can create the preview graphics.

Figure 4.7: Overview diagram of visualisation architecture

(60)

The rst concern to think about is what, how and where the visualisation can be made taking under account the fact that the visualisation engine should be lightweight and should not consume much network trac. Because it is a graphical process it seems to be the best choice to place it in the User Interface (UI) component. In such a way, Hive Engine has been lightened because of no need to render any images. Moreover, placing the whole presentation logic on the client side is more proper from the architectural perspective. Current visualisation process diagram is visible in Figure 4.7.

The nature of rendered graphics can dier very much between various applications e.g. considering application that calculates positions of planets it can be orbits with circles or spheres representing planets, for application calculating integral using rectangles method it could be set of rectangles approximating the area below function. There can be also many ideas how to visualise a single application.

These all requirements demand from the API to be very exible.

The used solution is to use an array of objects that is passed to the OpenCL kernel. During execution of such kernel, developer lls the array with objects that represent partial results. After kernel nishes processing a data package, the array of preview objects is being binary serialised and transferred to the engine where they are concatenated into one array from multiple data packages received. Such data is provided to the Hive UI where are being rendered.

Rendering mechanism is dependent on the application developer only. Devel- oper is provided with a sample class implementing given interface, which he has to full with his own visualisation algorithm. The Java interface for developer is visible in Listing 4.1. The algorithm as the input should receive a list of object lled by the kernel. It can be i.e. coordinates of following rectangles in the rectangle rule. Having such data and Graphics object to render on, developer can use Java AWT API to draw custom shapes. The data are being received from the Engine

(61)

every given piece of time, by default it is one second. After that, the canvas is being cleared and visualisation method is being called. Example of visualisation algorithm that shows progress of calculating integral using rectangle rule is shown in Listing 4.2.

Listing 4.1: Visualisation Java interface

package pl. gda .pg. eti . kernelhive . gui . component . workflow . preview ; import java . awt . Graphics ;

import java . util . List ;

import pl. gda .pg. eti . kernelhive . common . monitoring . service . PreviewObject ; /***

* @author Szymon Bultrowicz

public interface*/ IPreviewProvider {

void paintData ( Graphics g, List < PreviewObject > data , int areaWidth , int areaHeight );

}

Listing 4.2: Sample implementation of visualisation algorithm

package pl. gda .pg. eti . kernelhive . gui . component . workflow . preview ; import java . awt . Color ;

import java . awt . Graphics ; import java . lang . Math ; import java . util . List ;

import pl. gda .pg. eti . kernelhive . common . monitoring . service . PreviewObject ; public class PreviewProvider implements IPreviewProvider {

private static final int MAX_VALUE = 100;

public void paintData ( Graphics g, List < PreviewObject > data , int areaWidth , int areaHeight ) {

g. setColor ( Color . YELLOW );

float minX = Float . POSITIVE_INFINITY ; float maxX = Float . NEGATIVE_INFINITY ; float minY = 0;

float maxY = Float . NEGATIVE_INFINITY ; for( PreviewObject po : data ) {

if( validatePreviewObject (po)) {

minX = Math . min (po. getF1 () , minX );

maxX = Math . max (po. getF1 () + po. getF2 () , maxX );

maxY = Math . max (po. getF3 () , maxY );

} }

float ratioX = areaWidth / ( maxX - minX );

float ratioY = areaHeight / ( maxY - minY );

for( PreviewObject po : data ) {

(62)

if(! validatePreviewObject (po)) { continue;

}int width = Math . round ( ratioX * po. getF2 ());

int height = Math . round ( ratioY * po. getF3 ());

int x = Math . round ( ratioX * po. getF1 ());

int y = areaHeight - height ; g. fillRect (x, y, width , height );

} }

private boolean validatePreviewObject ( PreviewObject po) {

return ! Float . isNaN (po. getF1 ()) && Math . abs (po. getF1 ()) < MAX_VALUE

&& ! Float . isNaN (po. getF2 ()) && Math . abs (po. getF2 ()) < MAX_VALUE

&& ! Float . isNaN (po. getF3 ()) && Math . abs (po. getF3 ()) < MAX_VALUE ; } }

Such a solution seems to be lightweight, because there is only a set of values, binary serialised sent through the network. The engine only concatenates the data and the whole rendering mechanism is dened on the client side. It is also exible because the whole rendering logic can be dened by the application developer.

4.2 Implementation problems

4.2.1 Integration problems

The main integration problems occurred during integration with current Kernel- Hive communication mechanisms. It would be best if monitoring module could use already dened channels of messaging because of maintenance and implementation reasons, however now always was it possible. Communication through TCP/UDP messages was based on human readable strings. Although, it is sucient for oc- casional transitions, in the monitoring module the messages are being sent very often, like monitoring statistics or they are very large, like preview data messages.

In that case, existing mechanisms have to be extended by using binary serialisation which heavily reduces transferred data size and increases performance.

Such an extension was introduced in work preview component, described in more details in Section 4.1.2.

Completely new messaging mechanisms had to be introduces in component collecting monitoring statistics. There was no appropriate channel to send such a

(63)

message so it had to be implemented from the beginning. The messaging mechanism is described in Section 4.1.2.

4.2.2 Conceptual problems

Load monitoring component

The rst conceptual problem was during implementation of storing the sequential statistics gathered from nodes. These are not standard data kept in databases.

There are no relations between samples and the most important, it is a large amount of them. Statistics are currently sent every second and contain data about several metrics per each node. Common relational databases could be used but it would generate several problems e.g. coping with the amount of stored data or rendering irregular samples. To avoid such troubles, a storage dedicated for statistic data was introduced called RRD. RRD solution is described in details in Section 4.1.3.

Work preview component

Another and bigger conceptual concern has been met in the work preview component. There are a few possible ways to achieve collecting uncompleted data.

Collecting partial information during computations This solution assumes that there are available some partial results e.g. output buers that can be sampled after each loop execution or periodically every piece of time. How- ever. it is easy to see that only output buers can be not sucient to properly visualise preview of the work.

Specication of an API for direct notications about progress Such solution is very exible because allows developer to pass any data he needs for visualisation. It is also easy to introduce. On the other hand, it requires from developer additional actions like calls of specied methods to be able to collect preview data.

Injection of an agent that would collect data automatically This solution would assume that it is an agent injected to the executable that is capable of

MASTER THESIS DIPLOMA

MASTER THESIS DIPLOMA

PRACA DYPLOMOWA MAGISTERSKA

Streszczenie

Contents

Chapter 1 Introduction

Chapter 2

Existing solutions

2.1 Cluster monitoring systems

2.1.1 Wolfpack

2.1.2 Nagios

2.1.3 Other solutions

2.2 GPU monitoring tools

2.2.1 NVIDIA Visual Proler

2.2.2 AMD CodeXL

2.2.3 AMD gDEBugger i AMD APP Proler

2.2.4 TAU Performance System

2.2.5 VampirTrace

2.3 Current application output preview systems

2.3.1 PARADE

2.3.2 PGL

2.4 Conclusions

Chapter 3

Monitoring module specication

3.1 KernelHive overview

3.1.1 KernelHive architecture

3.2 Description of the existing system

3.3 System goals

3.4 System shareholders

3.5 System context

3.6 Functional scope

3.7 Quality requirements

3.8 Limitations

3.9 Use case diagram

Chapter 4

Implementation

4.1 Monitoring module architecture and implemen- tation

4.1.1 Monitoring module architecture

4.1.2 Message ow

4.1.3 Data storage

4.1.4 Work preview component implementation

4.2 Implementation problems

4.2.1 Integration problems

4.2.2 Conceptual problems

2.2.1 NVIDIA Visual Proler

2.2.3 AMD gDEBugger i AMD APP Proler

Monitoring module specication

4.1.2 Message ow