augmented visual simultaneous localization and mapping system

(1)

augmented visual simultaneous localization and mapping system

Uogólniony, wielorobotowy schemat rozszerzonego, wizyjnego systemu równoczesnej lokalizacji i budowy mapy

by

Adam Schmidt

Doctoral Dissertation

Institute of Control and Information Engineering Faculty of Electrical Engineering

Poznan University of Technology

Instytut Automatyki i In ˙zynierii Informatycznej Wydział Elektryczny

Politechnika Pozna ´nska

Advisor: Andrzej Kasi ´nski, Ph.D., D.Sc.,

professor of Poznan University of Technology

Pozna ´n, Poland, March 2014

(2)

The simultaneous localization and mapping (SLAM) algorithms provide mobile robots with the ability to explore an unknown environment while keeping track of their own position.

Over the years significant effort has been made to develop accurate and efficient visual SLAM systems. Although several successful and widely recognized algorithms have been proposed, the visual SLAM problem is still far from being solved.

This thesis presents a framework generalizing the filtration based visual SLAM algo- rithms and allowing a comparison of various methods on common ground. The proposed framework uses the Extended Kalman Filter to manage the model of the environment con- sisting of an arbitrary number of heterogeneous robots, static cameras, environment markers and point features. Besides serving as an evaluation tool, the proposed framework can be used to develop visual SLAM systems aimed at specific scenarios using the top-down design paradigm.

Moreover, several modifications to the contemporary visual SLAM systems improving the accuracy of the trajectories reconstruction were proposed and evaluated. The proposed modifications include a new movement model of the robot allowing for measurements of the orientation change, simplified point features representations, measurements of the relative orientation between the current cameras poses and the local maps.

Additionally, the environment model was extended with static cameras and artificial markers. The methods for marker detection and calculation of its pose w.r.t. observing cam- era’s coordinates were proposed and evaluated. The observations of the markers placed within the environment and on the robots significantly improved the accuracy of the trajec- tory reconstruction.

The performance of the most popular point features detectors and descriptors in the context of the mobile robots navigation was evaluated. The extensive experiments assessed the accuracy of the essential matrix estimation, the robot’s trajectory tracking within the visual SLAM system and the processing time.

The robot motion capture system was constructed to gather the data necessary for the

evaluation of the proposed algorithms. Moreover, the methods for spatial calibration of the

system’s components and the robot’s markers and cameras were proposed and tested. The

dataset consisting of the video sequences recorded by the robots’ onboard cameras and the

static cameras as well as the reference robots’ trajectories was registered. The dataset was

made publicly available in order to facilitate further research on the visual robots navigation.

(3)

(4)

Algorytmy jednoczesnej samolokalizacji i budowy mapy otoczenia (ang. SLAM - simulta- neous localization and mapping) umo ˙zliwiaj ˛ a robotom mobilnym eksporacj˛e nieznanego

´srodowiska i równoczesn ˛ a estymacj˛e poło ˙zenia robota w tym ´srodowisku. W ostatnich lat- ach widoczny jest intensywny rozwój dokładnych i wydajnych wizyjnych systemów SLAM.

Jednak ˙ze, pomimo dost˛epno´sci powszechnie uznanych za skuteczne systemów, problem wizyjnej jednoczesnej samolokalizacji i budowy mapy otoczenia wci ˛ a ˙z pozostaje nierozwi ˛ a- zany.

W rozprawie zaprezentowano schemat systemu generalizuj ˛ acy wizyjne, oparte na fil- tracji algorytmy SLAM oraz umo ˙zliwiaj ˛ acy porównanie ró ˙znych metod w ujednoliconym

´srodowisku. Proponowany schemat wykorzystuje rozszerzony filtr Kalmana (ang. EKF - extended Kalman filter) do zarz ˛ adzania modelem ´srodowiska składaj ˛ acego si˛e z dowolnej liczby heterogenicznych robotów, statycznych kamer, znaczników umieszczonych w otocze- niu robota oraz cech punktowych. Oprócz testowania algorytmów, proponowany schemat mo ˙ze by´c wykorzystywany do projektowania wizyjnych systemów SLAM dopasowanych to wybranych scenariuszy działania zgodnie z paradygmatem projektowania zst˛epuj ˛ acego.

Ponadto, w rozprawie zaproponowano i przetestowano szereg modyfikacji współczes- nych systemów SLAM zwi˛ekszaj ˛ acych dokładno´s´c rekonstrukcji trajektorii robotów. Zapro- ponowane modyfikacje obejmuj ˛ a nowy model ruchu robota umo ˙zliwiaj ˛ acy pomiary zmiany orientacji, uproszczone reprezentacje cech punktowych oraz pomiary wzgl˛ednej orientacji pomi˛edzy aktualn ˛ a poz ˛ a kamery i lokalnymi mapami.

Model ´srodowiska robota został wzbogacony o statyczne kamery i sztuczne znaczniki.

Opracowano metod˛e detekcji markerów na obrazie oraz wyznaczania ich pozy w układzie współrz˛ednych obserwuj ˛ acej je kamery. Obserwacje znaczników umieszczonych w ´srodo- wisku oraz na robotach mobilnych znacznie zwi˛ekszyły dokładno´s´c ´sledzenia robotów.

W pracy zawarto równie ˙z eksperymentaln ˛ a ocen˛e popularnych detektorów i deskryp- torów obrazowych cech punktowych. Skuteczno´s´c działania analizowanych algorytmów została oceniona w kontek´scie dokładno´sci estymacji macierzy zasadniczej, dokładno´sci

´sledzenia ruchu robota w systemie SLAM oraz czasu przetwarzania.

W celu zgromadzenia danych eksperymentalnych niezb˛ednych do oceny skuteczno´sci proponowanego systemu zaprojektowano i skonstruowano system ´sledzenia ruchu robotów.

Ponadto, zaproponowano i przetestowano metody estymacji wzajemnego poło ˙zenia ele-

mentów systemu oraz kamer i znaczników zamontowanych na robotach mobilnych. System

(5)

umieszczone w otoczeniu robota a tak ˙ze referencyjne trajektorie robotów. Zgromadzona

baza danych jest publicznie dost˛epna w celu ułatwienia dalszego rozwoju algorytmów wiz-

yjnej nawigacji robotów.

(6)

Numerous people contributed to the creation of this thesis with their kindness, patience, time, skills and knowledge. I would like to express my gratitude to those, who influenced me the most.

First of all, I would like to thank my advisor, Professor Andrzej Kasi ´nski, for his guid- ance, support and willingness to teach me. This work would not be possible without his trust and encouragement to pursue my scientific goals.

I am very grateful to Professor Piotr Skrzypczy ´nski for enlightening discussions and constantly helping me with his immense knowledge on the SLAM systems. I would also like to thank my colleagues working with me on the visual navigation of robots: Marek Kraft for sharing his expertise on visual odometry with me as well as for the countless days and nights spent on correcting and polishing my thesis, Zuzanna Domagała, Michał Fularz, Michał Nowicki for the help with the experiments which would be impossible without them.

Working with you has been inspiring but also a great fun.

A special thanks goes to my friend Michał Bruch, who always supported me at the mo- ments of doubt. Your firm belief in me has been a tremendous help.

I also owe my deepest gratitude to my wife Berenika and my Parents for making me who I am and putting up with me. You are the best!

Finally, I would also like to thank the institutions which supported my research: the National Science Centre for financing the research grant 2011/01/N/ST7/05940, and the Voivodeship Labour Office in Pozna ´n for granting me the doctoral scholarship under the European Social Fund.

Adam Schmidt

Pozna ´n

March 12, 2014

(7)

(8)

1 Introduction 1

1.1 Introduction . . . . 1

1.2 Goals of the thesis . . . . 2

1.3 Contributions of the thesis . . . . 2

1.4 Structure of the thesis . . . . 3

2 Review of the existing visual SLAM systems 5 2.1 The genesis . . . . 5

2.2 Algorithms . . . . 5

2.3 Environment representation . . . . 6

2.4 Multi-robot VSLAM . . . . 7

2.5 Benchmarking . . . . 8

3 The generalized visual SLAM framework 11 3.1 The probabilistic environment model . . . . 11

3.2 Coordinate systems hierarchy and pose representation . . . . 15

3.3 System initialization and operation . . . . 19

3.4 Motion prediction . . . . 20

3.5 Measurements . . . . 22

3.6 Update . . . . 24

3.7 Maintenance . . . . 26

4 Environment elements 29 4.1 Robot representation . . . . 29

4.1.1 General model . . . . 29

4.1.2 Agile model . . . . 29

4.1.3 Deliberate model . . . . 33

4.1.4 Measurable model . . . . 36

4.2 Camera representation . . . . 38

4.2.1 State vector . . . . 38

4.2.2 Camera models . . . . 39

(9)

4.3.2 Local maps . . . . 49

4.3.3 Simple inverse depth features . . . . 53

4.3.4 Extremely simple inverse depth features . . . . 56

4.3.5 Feature detection . . . . 58

4.3.6 Feature matching . . . . 61

4.4 Artificial markers . . . . 69

4.4.1 Introduction . . . . 69

4.4.2 Using markers for initialization . . . . 70

4.4.3 Measurement . . . . 79

4.4.4 Detection . . . . 83

4.4.5 Calculating marker pose . . . . 86

5 Dataset 91 5.1 Introduction . . . . 91

5.2 Motion registration system . . . . 91

5.2.1 Motion capture subsystem . . . . 91

5.2.2 Environment elements . . . . 92

5.3 Calibration . . . . 94

5.3.1 Introduction . . . . 94

5.3.2 Cameras calibration . . . . 95

5.3.3 Motion capture subsystem calibration . . . . 97

5.3.4 Robot cameras calibration . . . 101

5.3.5 Robot markers calibration . . . 105

5.4 Trajectory reconstruction . . . 108

6 Experiments 115 6.1 Implementation . . . 115

6.2 Detectors and descriptors evaluation . . . 115

6.3 Marker detection and pose estimation . . . 129

6.4 Visual SLAM evaluation . . . 132

6.4.1 Evaluation procedure . . . 132

6.4.2 Velocities and accelerations parametrization . . . 133

6.4.3 Feature representation . . . 135

6.4.4 Direct orientation measurements . . . 139

6.4.5 Relative map and camera orientation . . . 141

6.4.6 Static cameras and artificial markers . . . 142

6.4.7 Feature detectors and descriptors . . . 143

6.4.8 Two robots . . . 145

(10)

7.2 Applications . . . 150 7.3 Future work . . . 151

Bibliography 153

(11)

Introduction

1.1 Introduction

The ability to explore an unknown environment while keeping the track of its own position is the fundamental aspect of the autonomous operation of a mobile robot. The first works on the subject started in 1980s and led to the creation of a prominent field of the mobile robotics research known as Simultaneous Localization and Mapping (SLAM) [18]. Ever since a sig- nificant effort has been made to develop algorithms offering better accuracy, computational efficiency or suited for an increasing number of applications.

Though initially the SLAM algorithms used mainly ultrasonic scanners as the main exte- roceptive sensors [21], the research on the vision-based systems quickly followed [5][83][23].

The popularity of the visual SLAM solutions has several reasons including a constantly de- creasing cost of cameras, relatively simple measurement models, information-rich measure- ments and the possibility to use the same camera for other tasks e.g. human detection and interaction.

The intensive research in the field of the visual SLAM led to the availability of numerous algorithms, methods and implementations. Such diversity stimulates further development, however it is getting harder and harder to compare different solutions on a common ground.

Moreover, different methods are usually developed with a specific scenario in mind and built using the bottom-up approach. Therefore, there is a growing need for a framework generalizing and unifying different approaches. Such a framework will allow to compare the efficiency of different visual SLAM components in an otherwise identical environment.

Moreover, such a framework will facilitate the design of specific visual SLAM systems using the top-down paradigm.

Despite the fact that numerous, successful algorithms have been proposed, the problem of the visual SLAM is far from being solved and several issues remain open. There is still a need for the development of movement models that would allow to cope with sudden changes of movement direction or speed. Moreover, the representation of the robot’s envi- ronment needs to be made more efficient and compact in order to facilitate the development of large-scale systems. The visual SLAM systems can be also extended by additional sources of measurement starting with the wheel odometry or inertial measurement unit (IMU) orien- tation tracking to the estimates of the ego-motion obtained using the 8-point algorithm [45].

Finally, there is a growing interest in a field of cooperative, multi-robot SLAM allowing ex-

ploration of the environment by teams of heterogeneous robots. It is also worth noting that

the static elements of the environment such as cameras or artificial markers can be included

in the SLAM system thus enhancing its performance.

(12)

It is also important to notice that open and widely available datasets are required to facilitate the comparison and evaluation of different solutions, as well as the reproducibility of results. Though there are several datasets aimed at benchmarking of the single-robot systems [17][2][114][113][106][11], none of them incorporate the additional, static elements of the environment. Moreover, to the extent of the author’s knowledge, currently there is no dataset allowing the evaluation of a multi-robot system.

The following thesis presents a generalized framework built upon and significantly ex- tending the well known and successful MonoSLAM system [23][26]. The system uses the Ex- tended Kalman Filter (EKF) for the estimation of the environment’s state vector. The monoc- ular camera is considered to be the basic sensor used by the robots. The decision to use the EKF and monocular vision was based on the fact, that many newer solutions are compared with the MonoSLAM [23][26]which is in fact considered to be a standard for SLAM systems’

evaluation. However, the framework can be easily adapted to any other filtering algorithm and different types of sensors can be integrated into the framework.

1.2 Goals of the thesis

The goal of the thesis is to provide a generalized framework facilitating a fast design of single- and multi-robot systems, evaluation of the visual SLAM algorithms and identifica- tion of their limitations. In particular, the framework is used to address the following issues influencing the efficiency of the EKF-based visual SLAM:

• incorporating static cameras and artificial markers in order to create a heterogeneous, multi-robot system,

• using additional sources of measurements increasing the accuracy of the trajectory re- construction,

• defining new models of the environment elements reducing the dimensionality of the SLAM problem.

1.3 Contributions of the thesis

The main contribution of the thesis is the development of the generalized, multi-robot vi- sual SLAM framework. The proposed framework facilitates fast design of the visual SLAM systems composed of arbitrary number of robots, cameras and markers according to the top-down design paradigm. Moreover, the framework enables the evaluation of different variations of the visual SLAM systems on a common ground and can be easily extended to incorporate new models, techniques and algorithms.

The second contribution is the design of new robot’s movement model enabling the mea-

surements of the robot’s orientation change and thus improving the accuracy of the trajec-

tory reconstruction. Moreover, the influence of the movement parametrization on the sys-

tem’s performance was evaluated.

(13)

Thirdly, two simplified representations of the point features and local maps were intro- duced. The reduced models not only significantly reduce the dimensionality of the SLAM problem but also improve the accuracy of the tracking. Additionally, the introduction of the local maps enabled vision-based measurements of the relative orientation between the maps and the moving cameras. Moreover, the introduction of the local maps allows for predicting the appearance of the point feature’s neighborhood in the image from the current camera pose and warping the template, thus improving the feature matching.

Fourthly, the environment model was extended with static cameras and artificial markers installed both statically and on the moving robots. The additional measurement sources im- proved the robot’s trajectory tracking. The mechanisms for initialization of the multi-robot system and for mutual robot observations were proposed. Moreover, an efficient method for detection and identification of the marker as well as the calculation of its pose w.r.t. observ- ing camera’s coordinates was designed and evaluated.

Fifthly, the thorough evaluation of the popular image feature point detectors and de- scriptors in the contexts of both the essential matrix estimation and visual SLAM was per- formed. Finally, the robot movement capture system was designed and constructed. The methods for calibration of the spatial relations between the system’s cameras as well as the cameras and markers placed on the robots were designed. The system was used to register unique experimental data facilitating the work on single- and multi-robot visual navigation.

All the proposed modifications of the visual SLAM system were evaluated using the newly obtained data. The registered video sequences along with the reference trajectories will be made publicly available to facilitate further research on visual robot navigation.

1.4 Structure of the thesis

The following structure of the thesis was assumed. Chapter 2 presents the background of the thesis including the short history of the visual SLAM systems and highlighting some of the most influential algorithms.

Chapter 3 contains a detailed description of the proposed generalized SLAM framework.

The structure of the environment model is defined along with assumptions regarding the environment’s characteristics. The systems operation including its initialization, prediction, measurement and update stages of the EKF, as well as the system’s maintenance is presented.

The models of the environment’s elements such as the robots, static cameras, artificial markers, point features and local maps are presented in chapter 4. Additionally, the image processing algorithms used in the system are described in this chapter.

The experimental data registered during the study described in chapter 5. The motion registration system, its elements, as well as the proposed calibration methods are presented in this chapter.

The performed experiments and their results are described in chapter 6. Finally, chapter

7 concludes the thesis and describe the future goals.

(14)

(15)

Review of the existing visual SLAM systems

2.1 The genesis

The early works on building the map of the environment and estimating the position of a moving robot within the map at the same time include the sonar-based navigation systems of Chatila and Laumond [18] and Crowley [21] as well as the vision-based system of Ayache and Faugeras [5]. Along with the papers of Smith and Cheesman [108] and Durrant-Whyte [29] they served as basis for the research on the SLAM systems.

The next milestone was the realization that exploring an unknown environment while observing its elements leads to highly correlated estimates of the environment elements’

state. Therefore, the solution of the concurrent localization and mapping requires updating the whole state vector (including the pose of the tracked robot and poses of the environment elements) after each observations. Such an idea seemed prohibitive due to limited compu- tational resources available at the time. After short recess the work on the SLAM system resumed with an important work of Csorba [22] providing a proof of the convergence of the SLAM problem.

Afterwards, numerous SLAM systems using different kinds of sensors were presented.

Those included sonars [118], [92], laser scanners[84][124][62][105] and GPS sensors [120].

The vision-based systems included the early works of Neira et al. [83], the first real-time monocular SLAM proposed by Davison [23][26], works of Se et al. [102][103] on the SIFT [71] based systems or the mono- and stereo-vision SLAM by Lemaire [63].

The more detailed description of the history of SLAM systems can be found in a two- part tutorial by Durrant-Whyte [28][8]. The survey of the latest developments in the SLAM theory can be found in [27]. The review presenting the development of the visual SLAM systems is given in [35]. The following section will focus on the background of specific areas of the vision-based SLAM systems.

2.2 Algorithms

The numerous visual SLAM systems can be divided into two main groups of solutions. The first one consists of the systems based on the probabilistic filters while the second one con- tains solutions based on the Structure from Motion (SfM) approach.

The successful and widely known MonoSLAM [23][26] uses the Extended Kalman Filter

(EKF) to manage the state vector of the moving camera and the natural features. The main

disadvantage of the EKF-based systems is their quadratic numerical complexity hindering

(16)

the design of large-scale systems. Numerous attempts to alleviate this problem were taken including exploiting the specific structure of covariance matrices [42], the ATLAS Frame- work [12], Unscented Kalman Filter [115][49][46][86], Sparse Extended Information Filter [119] or Conditionally Independent Submaps [89].

The FastSLAM proposed by Montemerlo et al. [79] is based on the Rao-Blackwellized particle filter in which each particle contains the pose of the robot and a map of features managed by a separate EKF filter. Such an approach allowed achieving the logarithmic complexity of the system. The FastSLAM was also successfully extended to the multi-robot scenario [41].

The application of the SfM approach to the SLAM was mainly caused by the increased availability of memory, which allowed to store large sets of observations. The works of Mouragnon [81][82] present an application of the Local Bundle Adjustment (LBA) to the task of visual SLAM. The LBA is also employed in the widely recognized Parallel Tracking and Mapping (PTaM) system [53][54] proposed by Klein and Murray.

The FrameSLAM proposed by Konolige and Agarwal [56] uses a graph-based represen- tation of the environment with the so called skeleton frames serving as nodes and nonlinear constraints being the edges. Recently, a g2o graph optimization framework [60] has been proposed. The framework can be used as a base for different graph-based SLAM systems such as [55].

The two main approaches to the SLAM problem were thoroughly compared in the work by Strasdat et al. [111]. The main conclusion was that the number of tracked features has the strongest impact on the accuracy of the SLAM system, which favors the optimization- based systems. However, in the case of a large uncertainty the filtration-based approaches can perform better.

2.3 Environment representation

The camera or the entity carrying it is always the main element of the SLAM environment.

In general it is assumed that the movement of the camera is smooth, i.e. there are no sudden changes in the speed or the direction. However, it is rarely the case. Several methods have been proposed to cope with this problem ranging from extending the movement model to include the estimated accelerations [39] to reducing the model’s complexity and basing the prediction on the commands issued to the robot [96]. An alternative approach to improving the robustness w.r.t. erratic movements was proposed by Pupilli and Calway [90] whose system switches from the Unscented Kalman Filter tracking to the Particle Filter in case of the erratic movement and returns to the UKF once the smooth movement resumes. The motion blur on the images registered by the tracked cameras is another problem arising from the rapid and erratic movement. Mechanisms to compensate it were proposed by Mei et al.

[76] and Meilland et al. [77]. Several approaches to the representation of the static elements of the environment are possible. The grid occupancy map was used by Gutmann et al. [43]

while Marks et al. used the elevation variance grid map [73]. However, such an approach

to the visual SLAM is rare as dense mapping significantly increases the computational costs

(17)

and hinders the development of large-scale systems.

If the accurate tracking of the robot or camera is the main goal of the system the sparse, then the landmark-based maps are usually used. The landmarks are considered to be easily detectable and identifiable objects whose position in the environment is tracked. Usually, the dimension-less point features are the object of choice as their position can be represented either by Cartesian points [23] or by using the inverse depth parametrization [19]. How- ever, different possibilities were also explored including the systems using line segments [38][64][86] or planar structures [88][75]. The analysis of the influence of the point and line landmarks on the performance of the SLAM system was presented in [109]. It is worth not- ing that artificial objects of a known size are also used in order to improve the quality of the trajectory reconstruction as shown in the works of Baczyk [6][7].

As most of the visual SLAM systems are based on the tracking of the point features the algorithms to detect and match such points are required. Numerous other algorithms allowing detection and matching of such features have been developed over the years and their detailed description is provided in sections 4.3.5 and 4.3.6. Several attempts to assess the usefulness of various detectors and descriptors for the visual navigation were made. In [40] Gil et al. provided the characteristics of the point features detectors and descriptors positively influencing the performance of the visual SLAM. The compartive analysis of the detectors and descriptors focusing on the appropriateness for the task of the visual odometry was presented in [100] and [97]. Recently, Hartmann et al. presented the evaluation of the most popular descriptors in the context of the graph-based SLAM [47].

2.4 Multi-robot VSLAM

Recently, increasing attention has been paid to the development of the multi-robot visual SLAM systems. Howard [51] presented a system based on the particle filter allowing track- ing of the movement of multiple robots. Several different scenarios were considered in- volving starting the exploration with the known poses of the robots as well as completely independent exploration and merging of maps. In [41] Gil et al. present the extension of the single-robot FastSLAM [79] system to the multi-robot scenario. Unfortunately, the method was evaluated only by using a simulated environment.

The work by Bellesta et al. involved an operation of independent robots whose maps were merged to create a single, global map. The individual robots operated using the Fast- SLAM system [79] and the proposed algorithm focused on the alignment and fusion of the maps. Moreover, it was assumed that the robots move on a 2D plane and the alignment is also 2D only.

The system proposed by Wu et al. [122] uses vertical lines and door plates as the envi- ronment representation. Two robots simultaneously explore a shared workspace building a common map. The shared environment consisting of multiple maps was also used in the system by Castle et al. [16], which allows for localization of a wearable camera within one of many shared maps.

Most of the multiple-robot systems focus on the cooperation of a team of wheeled mo-

(18)

bile robots. Therefore, the considered environment is practically 2D, which significantly simplifies the robots alignment. The system proposed in [34] was designed for a team of un- manned aerial vehicles (UAVs). The ground station is used to create an individual map for each of the robots and to merge them when an overlap is detected. The authors noticed that the multi-robot SLAM is computationally very demanding and focused on the development of specialized data structures allowing efficient operation. The same issue was noticed by Riazuelo et al. [91] who proposed a cloud-based framework for multi-robot, graph-based visual SLAM.

2.5 Benchmarking

Both the evaluation of the existing visual SLAM solutions and the further development of the algorithms depend on the availability of the benchmarking datasets. Over the years sev- eral such databases differing in the complexity of the environment, the number of different sensor used, the reference trajectory reconstruction method etc.

In their paper Abdallah et al. [2]present a system aimed at gathering the benchmarking data for the outdoor visual SLAM systems. The main motivation behind this work was that the vast majority of benchmarking data focuses on the indoor environment. The robot equipped with a stereo-vision camera, interial measurement unit (IMU) and a GPS unit was used. The New College Vision and Laser Data Set [106] is another attempt to provide the data for evaluation of the outdoor SLAM systems. The authors used a robot carrying a stereo- camera, a 5-view omni-directional camera, a GPS unit and two laser scanners. Moreover, a 5-DOF odometry data was provided. A similar approach was presented by Blanco et al.

[11]. The authors used a robot equipped with two cameras, three laser scanners, both the Real-Time Kinematic (RTK) and consumer grade GPS and IMU. In all three cases the GPS measurements are used to reconstruct the ground truth trajectory.

The Rawseeds Project dataset [17] is one of the most popular datasets used for the eval- uation of the visual navigation algorithms. The authors used a mobile robot equipped with a number of sensors:

• a low resolution color camera,

• three high-resolution grayscale cameras forming a trinocular system,

• an omnidirectional camera,

• short- and long-range laser range finders,

• ultrasound trandsducers,

• inertial measurement unit.

The robot was used to register both the indoor and the outdoor trajectories in both static

and dynamic environment. The ground truth(GT) trajectories for the indoor data were

reconstructed by fusing the measurements from the static multi-camera system and laser

(19)

rangefinders tracking the robot. Unfortunately, the reference data are provided only for a part of the robot’s indoor trajectories. The trajectories for the outdoor sequences were ac- quired using a high-precision RTK GPS system.

The Rawseeds dataset suffers from two main drawbacks. First, the data registered by different sensors is sometimes desynchronized. Second, the calibration of the sensor poses w.r.t. the robot’s coordinate system is described very briefly and the data provided by the au- thors differ from the observed sensor’s poses (refer to section 5.3). Nevertheless, the dataset contains a lot of data useful for the development of the single-robot visual navigation sys- tems.

The increasing popularity of the Kinect sensor led to creation of the Freiburg RGB-D dataset [114][113]. The dataset consists of a number of sequences registered with the Kinect sensor which was either hand-held or attached to the Pioneer mobile robot. The data was gathered in indoor environment comprising of an office and an industrial hall. The high- precision movement tracking system was used to reconstruct the ground truth trajectories.

The authors have also provided methodology to evaluate the visual navigation algorithms.

Despite being aimed mostly at the RGB-D systems the dataset is also widely used as bench- mark for purely visual systems. Its main weakness is the fact that the Kinect sensor captures images of relatively low resolution (640 × 480). Moreover, no other additional sensors such as IMU were used.

An interesting approach to the evaluation of the visual SLAM systems was presented by Funke and Pietzsch [37]. The authors noted that the precise ground truth trajectory can be obtained only in the artificial, rendered sequences. Therefore, they proposed a system allowing to generate both the reference trajectory and the video sequence registered by a virtual camera moving through a virtual environment.

To the extent of the authors knowledge there are no publicly available benchmarking

datasets for evaluation of the multi-robot visual navigation systems. Morover, none of the

currently used datasets incorporates video sequences from the static cameras.

(20)

(21)

The generalized visual SLAM framework

3.1 The probabilistic environment model

The proposed probabilistic model of the environment is a core part of the generalized visual SLAM framework. It is based on the probabilistic feature map used in [107][23][26]. How- ever, unlike the original approach, the environment is not limited to a single camera moving within a global map made of point features (figure 3.1). In order to provide more flexibility to the model it is assumed that the environment consists of arbitrary number of (figure 3.2):

• robots,

• cameras,

• natural point features,

• artificial markers,

• local maps.

The robots are the main element of the environment because only they are active and capable of the environment exploration. Each robot is equipped with at least one camera and possibly other sensors allowing to execute partial state measurements (e.g. IMU for the orientation measurement). Moreover, markers can be attached to the robot allowing visual observations of the robot. It is worth noting that though a robot may carry an arbitrary

Figure 3.1: The environment model of the MonoSLAM system consisting of a camera and

point features

(22)

number of cameras no stereo-vision algorithms are explicitly used. It is also assumed that the robots move independently of each other. The robot models implemented within the presented framework are described in section 4.1.

The cameras are placed either on a moving robot or statically within the environment.

They serve as the main sensor used to update the state of the environment. The robots’ on- board cameras are used to observe point features, to estimate the orientation change between two different camera poses and to observe artificial markers placed on robots or within the environment. The static cameras are used to observe the markers attached to the moving robots and thus facilitate robots’ movement tracking. It is assumed that the position of the static cameras is not known during the system start-up and thus is initialized according to the first successful observation of the mobile robot. Section 4.2 provides the description of cameras representation.

The point features constitute the main part of the static environment. They correspond to the characteristic elements of the environment that are easily trackable by cameras despite changes in lighting and camera position (e.g. door, window or furniture corners). Their observations via the cameras placed onboard the mobile robots are the main source of infor- mation used to track robots’ movement. It is assumed that no prior knowledge regarding the position and size of the features is available. As cameras are bearing only sensors, ini- tially only the line passing through the feature and the camera’s focal point can be defined, yet no depth information can be obtained. Consecutive observations of the feature allow convergence of the feature’s position to a point. The models of the point features used in the proposed framework are presented in section 4.3.

The artificial markers are easily observable and identifiable objects of a known size used to facilitate the robots’ navigation[6][7]. As the their size is known a priori, a single obser- vation is sufficient to calculate their pose relative to the camera. The markers can be placed either in the environment (further called environment markers) or on the mobile robots (fur- ther called robot markers). In the first case, their position is not known during the system start-up and they are initialized after the first successful observation via the robot’s onboard camera. Their consecutive observations can be used to reduce the uncertainty of the robot’s movement. In the case of the robot markers their pose relative to the robot’s main coordinate system is assumed to be known. The robot markers allow for visual observations of robots either by other robots or the stationary cameras. Section 4.4 contains a description of the marker model implemented within the proposed framework.

The local maps are used to group features visible from a particular place. They are de- fined as local coordinate systems relative to the global coordinates and do not correspond to any physical entity. The position of the feature within the map is defined w.r.t. to the local coordinate system of the appropriate map. The application of maps simplifies the features management and feature observation as described in section 4.3.2. Moreover, the relative orientation of the camera w.r.t the map can be measured and used to update the system as described in section 4.3.2

The main task of the model is to provide the current estimates of the environment ele-

ments’ state as well as the uncertainty of those estimates. The model is created and initial-

(23)

Figure 3.2: The environment model of the proposed system consisting of robots with on- board cameras, static cameras, markers, maps and point features

ized at the start of the SLAM system and is maintained during the systems operation by the Extended Kalman Filter.

The model is represented by state vector x created by concatenating the state vectors of all the environment elements:

x = h

x

r

x

s

i

T

= h

x

r

x

c

x

m

x

_f

x

a

i

T

(3.1)

= h

x

¹_r

. . . x

^nR_r

x

¹_c

. . . x

^nC_c

x

¹_m

. . . x

^nM_m

x

¹_f

. . . x

^nF_f

x

¹_a

. . . x

^nA_a

i

T

where x

ⁱ_r

is the state vector estimate of the i-th robot, nR is the current number of robots and

x

_r

is the concatenation of all the robots’ state vectors. Analogous symbols are defined for the

cameras (x

ⁱ_c

, nC, x

c

), maps (x

ⁱ_m

, nM , x

m

), point features (x

ⁱ_f

, nF , x

f

) and artificial markers

(x

ⁱ_a

, nA, x

a

). Finally, x

s

is a concatenation of the state vectors of all static elements of the

environment.

(24)

It is assumed that the uncertainty of the state vector estimate can be modeled by a single, multivariate Gaussian distribution and is represented by a covariance matrix P :

P =

"

P

rr

P

rs

P

sr

P

ss

#

=







P

_rr

P

_rc

P

_rm

P

_rf

P

_ra

P

_cr

P

_cc

P

_cm

P

_cf

P

_ca

P

mr

P

mc

P

mm

P

mf

P

ma

P

_fr

P

_fc

P

_fm

P

_ff

P

_fa

P

_ar

P

_ac

P

_am

P

_af

P

_aa







(3.2)

=







P

_x¹_r_x¹_r

. . . P

_x¹

rx^nR_r

P

_x¹_r_x¹_c

. . . P

_x¹

rx^nC_c

P

_x¹_r_x¹_m

. . . P

_x¹

rx^nM_m

P

_x1

rx¹_f

. . . P

_x1

rx^nF_f

P

_x¹_r_x¹_a

. . . P

_x¹

rx^nA_a

.. . . .. ... .. . . .. .. . .. . . .. .. . .. . . .. ... .. . . .. ...

P

_x^nR

r x¹_r

. . .P

_x^nR

r x^nR_r

P

_x^nR

r x¹_c

. . .P

_x^nR

r x^nC_c

P

_x^nR

r x¹_m

. . .P

_x^nR

r x^nM_m

P

_xnR

r x¹_f

. . .P

_xnR

r x^nF_f

P

_x^nR

r x¹_a

. . .P

_x^nR

r x^nA_a

P

_x¹

cx¹_r

. . . P

_x¹

cx^nR_r

P

_x¹

cx¹_c

. . . P

_x¹

cx^nC_c

P

_x¹

cx¹_m

. . . P

_x¹

cx^nM_m

P

_x1

cx¹_f

. . . P

_x1

cx^nF_f

P

_x¹

cx¹_a

. . . P

_x¹

cx^nA_a

.. . . .. ... .. . . .. .. . .. . . .. .. . .. . . .. ... .. . . .. ...

P

_x^nC

c x¹_r

. . .P

_x^nC

c x^nR_r

P

_x^nC

c x¹_c

. . .P

_x^nC

c x^nC_c

P

_x^nC

c x¹_m

. . .P

_x^nC

c x^nM_m

P

_xnC

c x¹_f

. . .P

_xnC

c x^nF_f

P

_x^nC

c x¹_a

. . .P

_x^nC

c x^nA_a

P

_x¹

mx¹_r

. . .P

_x1

mx^nR_r

P

_x¹

mx¹_c

. . .P

_x1

mx^nC_c

P

_x¹

mx¹_m

. . .P

_x1

mx^nM_m

P

_x1

mx¹_f

. . .P

_x1

mx^nF_f

P

_x¹

mx¹_a

. . .P

_x1 mx^nA_a

.. . . .. ... .. . . .. .. . .. . . .. .. . .. . . .. ... .. . . .. ...

P

_x^nM

m x¹_r

. . .P

_x^nM

m x^nR_r

P

_x^nM

m x¹_c

. . .P

_x^nM

m x^nC_c

P

_x^nM

m x¹_m

. . .P

_x^nM

m x^nM_m

P

_xnM

m x¹_f

. . .P

_xnM

m x^nF_f

P

_x^nM

m x¹_a

. . .P

_x^nM

m x^nA_a

P

_x¹

fx¹_r

. . . P

_x¹

fx^nR_r

P

_x¹

fx¹_c

. . . P

_x¹

fx^nC_c

P

_x¹

fx¹_m

. . . P

_x¹

fx^nM_m

P

_x¹

fx¹_f

. . . P

_x1

fx^nF_f

P

_x¹

fx¹_a

. . . P

_x¹

fx^nA_a

.. . . .. ... .. . . .. .. . .. . . .. .. . .. . . .. ... .. . . .. ...

P

_x^nF

f x¹_r

. . .P

_x^nF

f x^R_r

P

_x^nF

f x¹_c

. . .P

_x^nF

f x^nC_c

P

_x^nF

f x¹_m

. . .P

_x^nF

f x^nM_m

P

_x^nF

f x¹_f

. . .P

_x^nF

f x^nF_f

P

_x^nF

f x¹_a

. . .P

_x^nF

f x^nA_a

P

_x¹

ax¹_r

. . . P

_x1

ax^nR_r

P

_x¹

ax¹_c

. . . P

_x1

ax^nC_c

P

_x¹

ax¹_m

. . . P

_x1

ax^nM_m

P

_x1

ax¹_f

. . . P

_x1

ax^nF_f

P

_x¹

ax¹_a

. . . P

_x1 ax^nA_a

.. . . .. ... .. . . .. .. . .. . . .. .. . .. . . .. ... .. . . .. ...

P

_x^nA

a x¹_r

. . .P

_x^nA

a x^nR_r

P

_x^nA

a x¹_c

. . .P

_x^nA

a x^nC_c

P

_x^nA

a x¹_m

. . .P

_x^nA

a x^nM_m

P

_xnA

a x¹_f

. . .P

_xnA

a x^nF_f

P

_x^nA

a x¹_a

. . .P

_x^nA

a x^nA_a







It is worth noting that the model is dynamic, i.e. its dimension is not constant, as robots and cameras may be added or removed during its operation and features, markers and maps can be added or removed as a result of the environment exploration.

In general, adding a new element to the model is the result of an observation by one of the cameras. The initial state vector is calculated as a function of the current state vector of the system x and the observation h whose uncertainty is modeled by an observation covari- ance matrix P

h

. Moreover, some state variables of the new element can be initialized with arbitrary chosen values v whose initial uncertainty is described by a covariance matrix P

v

. Thus, the new state vector of the system is constructed as:

x

^new

=

"

x

^old

g x

^old

, h, v

#

(3.3)

where g

xold, h, v

is the function describing calculation of the new element’s state vector.

As expanding the state vector must be accompanied by extending the state covariance matrix

(25)

the new covariance is given as:

P

^new

=J

g







P

^old

0 0 0 P

_h

0 0 0 P

v





 J

_g^T

(3.4)

J

g

=

"

I 0 0

J

g

(x

^old

) J

g

(h) J

g

(v)

#

(3.5)

3.2 Coordinate systems hierarchy and pose representation

The modeled environment consists of many coordinate systems either corresponding to some physical objects (i.e. robots, cameras and markers) or being purely virtual (i.e. local maps) as well as numerous dimensionless points (i.e. point features). The spatial relations between those entities form a hierarchic structure (figure 3.3).

The global map serves as a constant reference coordinate system for the SLAM frame- work. The pose of robots, static cameras and environment markers is defined w.r.t. the global frame and depends on the current estimate of the appropriate object’s state vector.

The pose of the onboard cameras and markers is expressed within the local coordinate sys- tem of the appropriate robot.

The point features can be assigned to either the global or any of the local maps and their position is defined with regard to the appropriate map frame. The detailed description of the point features representation is provided in section 4.3.

The state vector of each environment element with a corresponding coordinate system (i.e. robots, maps, cameras and markers) contains the position vector r and unitary orienta- tion quaternion q defining the transformation between the local coordinate system and the parent’s coordinates:

x = h

r q i

T

= h

r

_x

r

_y

r

_z

q

_a

q

_b

q

_c

q

_d

i

T

(3.6) Using the Cartesian vector to represent the position of an element is self-explanatory. How- ever, the choice of the orientation representation is more complex. There are several possible options including the rotation matrix, Euler angles, axis-angle notation, Rodrigues’ formula and unitary quaternions. As there are only 3 degrees of freedom associated with the rotation only the Euler angles and Rodrigues’ formula are not redundant. However, only the rotation matrix and quaternion representation do not suffer from the singularities. As the latter qual- ity is crucial for the successful operation of the SLAM system the more compact quaternion representation was selected.

During the operation of the SLAM system the position vectors, direction vectors and

orientation quaternions are often transformed to a different coordinate system within the

hierarchy presented above. The mechanism of those conversions along with the calculation

of the Jacobian matrices necessary to propagate the uncertainty is presented below.

(26)

Global map

Feature 1 Feature 2

Feature 3 Feature 4

Feature 5 Feature 6

Feature 7

Feature 8

Local map 1

Local map 2

Static camera 1

Static camera 2 Environment marker 1

Environment marker 2

Robot 1

Robot 2 Robot 3

Robot camera 1

Robot camera 2

Robot camera 3 Robot camera 4

Robot marker 1

Figure 3.3: An example of the coordinate systems hierarchy

In case of the operation on vectors the orientation quaternion can be easily converted to

(27)

the rotation matrix:

R(q) =







q

_a²

+ q

_b²

− q

_c²

− q

_d²

2q

_b

q

c

− 2q

_d

q

a

2q

_b

q

_d

+ 2q

c

q

a

2q

_b

q

_c

+ 2q

_d

q

_a

q

_a²

− q

_b²

+ q

_c²

− q

_d²

2q

_c

q

_d

− 2q

_b

q

_a

2q

b

q

d

− 2q

_a

q

c

2q

a

q

b

+ 2q

c

q

d

q

_a²

− q

_b²

− q

_c²

+ q

_d²





 (3.7)

In case of the rotation of a free vector p

^L

= [ p

^L_x

p

^L_y

p

^L_z

]

^T

from the local frame to the parent’s coordinates the rotated vector p

^P

= [ p

^P_x

p

^P_y

p

^P_z

]

^T

is given as:

p

^P

= R(q)p

^L

(3.8)

The rotation matrix depends only on the orientation quaternion. However, the vector being rotated also affects the values of the Jacobian matrix. Therefore, the Jacobians of the free vector p

^P

being the result of the rotation of the free vector p

^L

w.r.t. the position vector r and the orientation quaternion q are further denoted as JR

^p_p^LP

(r) and JR

^p_p^LP

(q) correspondingly and are defined as:

JR

^p_p^LP

(r) =0

_3,3

(3.9)

JR

^p_p^LP

(q) =2







p

^L_x

q

a

− p

^L_y

q

d

+ p

^L_z

q

c

p

^L_x

q

b

+ p

^L_y

q

c

+ p

^L_z

q

d

p

^L_y

q

_a

+ p

^L_x

q

_d

− p

^L_z

q

_b

p

^L_x

q

_c

− p

^L_y

q

_b

− p

^L_z

q

_a

p

^L_y

q

_b

− p

^L_x

q

_c

+ p

^L_z

q

_a

p

^L_y

q

_a

+ p

^L_x

q

_d

− p

^L_z

q

_b

p

^L_y

q

_b

− p

^L_x

q

_c

p

^L_z

q

_a

p

^L_z

q

_b

− p

^L_x

q

_d

p

^L_y

q

_a

p

^L_x

q

b

+ p

^L_y

q

c

+ p

^L_z

q

d

p

^L_x

q

a

− p

^L_y

q

d

+ p

^L_z

q

c

p

^L_y

q

_d

− p

^L_x

q

_a

− p

^L_z

q

_c

p

^L_x

q

_b

+ p

^L_y

q

_c

+ p

^L_z

q

_d





 (3.10)

If the vector is rotated from the parent coordinate system to the local one it is calculated as:

p

^L

= R(q)

^T

p

^P

(3.11)

Analogously, the rotation depends only on the orientation quaternion, yet the vector p

^P

affects the values of the Jacobians. Thus, the Jacobians of the inverse rotation of the free vector p

^P

are defined as:

JRi

^p_p^PL

(r) =0

_3,3

(3.12)

JRi

^p_p^PL

(q) =2







p

^L_x

q

a

+ p

^L_y

q

d

− p

^L_z

q

c

p

^L_x

q

b

+ p

^L_y

q

c

+ p

^L_z

q

d

−p

^L_x

q

_d

+ p

^L_y

q

_a

+ p

^L_z

q

_b

p

^L_x

q

_c

− p

^L_y

q

_b

+ p

^L_z

q

_b

p

^L_x

q

_c

− p

^L_y

q

_b

+ p

^L_z

q

_a

p

^L_x

q

_d

− p

^L_y

q

_a

− p

^L_z

q

_b

−p

^L_x

q

_c

+ p

^L_y

q

_b

− p

^L_z

q

_a

−p

^L_x

q

_d

+ p

^L_y

q

_a

+ p

^L_z

q

_b

p

^L_x

q

b

+ p

^L_y

q

c

+ p

^L_z

q

d

−p

^L_x

q

a

− p

^L_y

q

d

+ p

^L_z

q

c

p

^L_x

q

_a

+ p

^L_y

q

_d

− p

^L_z

q

_c

p

^L_x

q

_b

+ p

^L_y

q

_c

+ p

^L_z

q

_d





 (3.13)

If the transformed vector is bound (i.e. it describes a position of a point within the frame rather than a direction) the translation vector r has to be included in the transformation. The position of a point p

^L

transformed to the parent coordinate system is calculated as:

p

^P

= R

_P^L

(q)p

^L