Depth extraction, refinement and confidence estimation from image data

(1)

D

EPTH

E

XTRACTION

, R

EFINEMENT

A

ND

(2)

(3)

D

EPTH

E

XTRACTION

, R

EFINEMENT

A

ND

C

ONFIDENCE

E

STIMATION

F

ROM

I

MAGE

D

ATA

Proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof. ir. K. C. A. M. Luyben, voorzitter van het College voor Promoties,

in het openbaar te verdedigen op woensdag 17 juni 2015 om 15:00 uur

door

Görkem S

AYGILI

Master of Electrical and Computer Engineering, Koç University, Turkije geboren te ˙Izmir, Turkije.

(4)

Dit proefschrift is goedgekeurd door de promotor: Prof. dr. ir. M. J. T. Reinders

Copromotor: Dr. E. A. Hendriks

Samenstelling promotiecommissie bestaat uit:

Rector Magnificus, voorzitter

Prof. dr. ir. M. J. T. Reinders, Technische Universiteit Delft, promotor Dr. E. A. Hendriks, Technische Universiteit Delft, copromotor

onafhankelijke leden:

Prof. dr. T. Gevers, University of Amsterdam

Prof. dr. ir. P. P. Jonker, Technische Universiteit Delft Prof. dr. ir. B. Lelieveldt, Technische Universiteit Delft

Prof. dr. A. M. Tekalp, Koç University

Dr. ir. C. Varekamp, Philips Research

Prof. dr. ir. A. Hanjalic, Technische Universiteit Delft, reservelid

Advanced School for Computing and Imaging

This work was carried out in graduate school ASCI. ASCI dissertation series number 332.

Cover Designed by G. Saygılı

Printed by Proefschriftmaken.nl || Uitgeverij BOXPress ISBN 978-94-6186-475-8

All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without written permission from the author. An electronic version of this dissertation is available at

(5)

(6)

(7)

1

I

NTRODUCTION

The visual system is one of the most important systems of many animal species that evolved over millions of years. It has a huge impact on social interactions between animals. The Hu-man Visual System (HVS) is among the most enhanced visual systems in the living world that provides fast and accurate3Dperception of the environment. For centuries, the HVS has been the focus of interest of many researchers, artists and philosophers in terms of understanding and even challenging it. Only since the 1970s we are trying to teach computers how to see and understand the observed sceneries, a research field that we call Computer Vision (CV). Yet, there are still many challenges to let the computer see.

1.1. HVS

AND

3D

V

ISION

In nature, measuring the distance (depth) to obstacles is crucial for almost all living organ-isms. As an example, predators need the distance to their prays in order to hunt and survive. Distance information is also vital for humans in their daily lives. A typical example is the necessity of distance estimation while driving. It would not be possible to avoid collisions in traffic without depth perception. Species have different receptors to obtain this vital infor-mation. For instance, bats use sonar echoes to estimate the distance to objects surrounding them [1]. Dolphins fuse both their visual system and sonar echo system to improve their depth perception [2]. The visual system is however the most common way between different species including humans to extract distance information. Animals with different eye posi-tions use various ways to extract the depth information. Animals with frontal eyes, such as dogs and cats, incorporate stereo vision, whereas animals without frontal eyes such as rabbits use motion cues.

Since humans have frontal eyes, they can perceive the depth using their two eyes (stereo vision). However they still incorporate depth information from monocular cues, such as tex-ture gradient, shadow effects and atmospheric perspective. The eyes of humans are aligned horizontally which imposes horizontal shift (disparity) between the corresponding points in the viewed objects of the scene. The disparity is inversely-correlated with the distance of the objects. Hence, it is possible to extract the depth of the scene by using the inverse disparity. The neurons that perform this task reside in the primary visual cortex (V1) of the human brain.

(12)

1

2 1.INTRODUCTION

(a) (b)

Figure 1.1:3Dimpression from2Dpaintings by Giovanni Paolo Pannini: (a) St. Peter’s Basilica, (b) Modern Rome.

Disparity extraction is possible as long as the corresponding points are visible in both views. In case of non-corresponding regions or stereo estimation failure, monocular cues are used to extract the depth information. These monocular cues are also used as a priori knowledge to guide the depth perception of the brain. A typical example is linear perspective. Normally parallel lines in3Dworld that are going away from the observer tend to get closer and intersect at infinity when they are projected on2D. This fact is also long known by artists to achieve more realistic impressions in their paintings as depicted in Fig.1.1. Another well-applied monocular cue in art for the depth is shadows. The shadow of the objects in the front can be observed on the objects in the back which indicates the positions with respect to each other, aka their relative depths as in Fig.1.1b. This fact is especially useful when there is no textural or color difference between two objects. Similarly, the luminance of nearby surfaces appears to be brighter than the luminance of farther away objects. Additionally, the size of objects can be used as an indication of their depth as sizes become smaller when the distance to the observer increases.

The fusion of different depth cues in the brain is a complex procedure. Research shows that HVS achieves the fusion by calculating a weighted linear combination of each cue [3]. The weights are assigned with respect to each cue’s individual reliability. Experiments have further shown that as one of the depth cues becomes less reliable, its corresponding weight in the fusion is decreased in order not to distort the correct depth estimation that is obtained from other cues [3]. As an example, Fig.1.2 shows two examples of illusions where the HVS provides erroneous depth perception. Even though the regions indicated with red in each picture have exactly the same size, the HVS perceives them with different sizes because of the other monocular cues in the image.

Similar to natural vision systems,3Dcomputer vision also makes use of monocular and binocular cues to extract the depth information from captured views of the scenes. For binoc-ular vision, stereo matching algorithms aim to find matching points in both views to calculate their disparity. If the camera geometry is known by means of calibration, the obtained dispar-ity can be converted to depth. Monocular cues are used for both improving the results from stereo and extracting the depth using a single camera. Furthermore, similar to sonar sensors of dolphins and bats, active depth sensors are also used in computer vision to extract the depth information without the need of additional computational effort. A popular example of a depth

(13)

1.2.CHALLENGES INDEPTHEXTRACTION

1

3

(a) (b)

Figure 1.2: Ponzo Illusions [4] for HVS: (a) The lengths of the red lines, and (b) the sizes of the red rectangles are the same. In both examples, the HVS perceives the lengths of the lines and the sizes differently because of the misleading monocular cues and the lack of different cues.

sensor is the Microsoft’s Kinect depth sensor.

In addition to all these individual techniques, there is an active research on improving the depth estimation by fusing different technologies and techniques. An interesting example is automatic driving vehicles, as shown in Fig.1.3a., which are being designed and tested by companies such as Google. Since fatal accident rates are high in traffic, the main aim of de-signing automatic driving cars is to achieve safer transportation. Automatic driving cars have multiple sensors including RGB cameras and depth sensors to measure the distance of obsta-cles in the vicinity of the vehicle. The data is processed by the computers inside the vehicle and the decisions are taken accordingly. There are also smart cars with driver-assistance ca-pabilities that can help the drivers to avoid accidents. While the control of the car is handled by the driver, the processing unit of the car collects data from its visual sensors to interrupt the control of the driver in case of emergency. Also in robotic vision, the knowledge of depth is crucial for the interaction of the system with other objects. Although robotic vision sensors works well for most common objects, objects with properties like transparent or glassy sur-faces are challenging to handle. The fusion of data from different sensors might solve these challenges.

1.2. C

HALLENGES IN

D

EPTH

E

XTRACTION

The existing depth extraction methods can be grouped into two classes based on the sensors that have been used: passive methods and active methods. In passive methods the sensors are not specifically designed for depth extraction. They provide images of the scene without any depth information such as RGB cameras. Depth information is obtained after processing the provided images using depth extraction algorithms. In contrast, active methods do rely on depth sensors such as the Kinect. Active depth sensors are specifically designed for depth extraction. They estimate the depth from the analysis of reflection of the transmitted signal on

(14)

1

4 1.INTRODUCTION

(a) (b)

Figure 1.3: Examples of CV applications for daily life: (a) Automatic driving vehicle of Google [5], and (b) Honda robots [6].

the scene.

1.2.1. D

EPTH

E

XTRACTION WITH

P

ASSIVE

S

ENSORS

Depth extraction with passive sensors is an under-constraint problem which means there is no unique solution for every location of the image. Hence, many algorithms aim to improve the quality of the extracted depth by means of fusing different cues.

Stereo matching is one example of passive depth extraction methods. The aim of stereo matching is to find matching points of the two stereo images of the same scene from differ-ent view-points. The correct matches are expected to be at horizontally-shifted positions and the amount of horizontal shift identifies the disparity. In order to find the disparity of each point, stereo algorithms search the corresponding point in the other image over a reasonable horizontal displacement that is depending on the distance between the cameras. This proce-dure of evaluating every candidate position is called a full-search. An evalution compares the similarity of the image content at the candidate position with the reference point. The best matching candidate for a reference point is found by choosing the point in the full-search with maximum similarity.

Occlusion is an important source of ambiguity for stereo matching algorithms. Occlusion happens when a region in the background that appears in one view, cannot be seen in the other view due to the covering of a foreground region. Since there is no correspondence of the occluded points in the other view, stereo vision fails to find the correct disparity in occluded areas. To tackle the occlusion problem, stereo algorithms incorporates monocular cues. One example is to use the information from the neighboring non-occluded points. As long as there are neighboring non-occluded points of the same object, stereo algorithms can propagate their disparity to the occluded regions.

In addition to occlusion, multiple correspondences can cause significant problems in stereo vision. This problem happens when there is texture repetition in an image. In texture repe-tition, finding the correct correspondence is non-trivial because of multiple similar candidate matches. Additionally, multiple correspondence can also occur when there is not enough dis-tinguishable texture in the region. In homogeneous color regions that do not have a texture, points are indistinguishable which causes ambiguity and inconsistency in the disparity map.

(15)

1.2.CHALLENGES INDEPTHEXTRACTION

1

5 f d Zr Object plane Reference plane b Z_o IR Projector IR Receiver (a) (b)

Figure 1.4: The general principle of triangulation of the Kinect: (a) IR image, (b) Triangulation of the received pattern where the pattern that is reflected from reference plane is already known.

1.2.2. D

EPTH

M

EASUREMENT WITH

A

CTIVE

S

ENSORS

Active depth sensors provide accurate depth measurements without requiring a lot of addi-tional computaaddi-tional effort. The Microsoft Kinect was introduced in 2010 and is arguably the most popular depth sensor. It can measure the depth of surfaces by emitting a structured infrared (IR) light pattern and receiving it back with its IR receiver. The projected pattern is shown in Fig.1.4a. as bright dots. Triangulation of the scattered (received) pattern with the original (transmitted) pattern gives a depth measurement at each location. The triangulation principle is depicted in Fig.1.4b. The depth of the object,Zo, can be estimated by:

Zo=

Zr

1 +Zr_{f b}d. (1.1)

In Eq.1.1,Zris the distance of the reference frame anddis the disparity between the scattered

and the original patterns. f andbare the focal length of IR camera and the baseline between IR sensor and IR transmitter, respectively which are also known approximately and can also be measured via calibration.

The sensor can measure depth as long as the received pattern is not distorted. However, objects that have specular or transparent surfaces distort this pattern which prevents Kinect to measure any depth on those regions as depicted in Fig.1.5. Unfortunately, such challenging surfaces are also problematic for other depth sensors. Even our own HVS suffers from this problem. A nice example of that is glassy doors.

1.2.3. A

DDITIONAL

I

NFORMATION FOR

D

EPTH

E

XTRACTION

To estimate depth of the aforementioned challenging surfaces, depth estimating algorithms could also incorporate additional cues.

Single View

Different methods exist to estimate depth from a single view, such as supervised learning based on a combination of monocular cues [7], structure from texture [8], structure from shading [9] and structure from motion [10]. These methods can be used when there is no depth sensor or

(16)

1

6 1.INTRODUCTION

(a) (b)

Figure 1.5: Kinect depth measurements on objects with specular and transparent surfaces.

more than one camera to record the scene from different view-point. However, these methods are out of scope of this thesis.

Confidence Measures

Due to ambiguities, stereo matching algorithms fail to find correct disparities for different points in the image. To find the location of such errors, stereo confidence measures have been proposed. Such a confidence is a value for each point that reflects the ambiguity of the matching. In case of very distinctive points in both images, the ambiguity is generally low since the point is correctly matched with high probability. In presence of occlusion, repetition of pattern or textureless regions, the ambiguity is higher, and consequently the confidence levels are expected to be lower. The confidence can be used to filter out ambiguous matches.

1.3. T

HESIS

C

ONTRIBUTIONS

In Chapter2, the pattern repetition problem for stereo matching is investigated. Our solution for the repetition of texture problem is to use robust keypoints that provides sparse matching points in both images prior to the stereo matching algorithm. The disparity information that is obtained from those keypoints is used as a boundary constraint for the disparity search range over segmented regions. This constraint prevents the matching algorithm to search for matches in the next repetition and we show that this improves the accuracy of the stereo algorithm substantially. Furthermore, the incorporated global energy minimization framework enhances the disparity estimations on homogeneous color regions that do not have sufficient texture for finding correct matches.

Stereo matching algorithms require high computational resources since full-search is done for each point over the predefined disparity search range. However, since the computation is done for each point independently, stereo matching algorithms are highly parallelizable. Graphical Processing Units (GPUs) are becoming more powerful and widely used by CV algorithms. In Chapter3, we test the accuracy vs the performance of state-of-the-art stereo algorithms that are implemented on a GPU using different optimization methods.

(17)

1.3.THESISCONTRIBUTIONS

1

7

algorithms are not adequate in many cases because of their limited number of depth cues that they incorporate. To have better accuracy, we propose to use both temporal and spatial depth cues in a surface depth modeling framework. The results show substantial improvements on occluded regions and on specular surfaces compared to the state-of-the-art refinement algo-rithms.

Transparent objects are very challenging for all depth sensors. Compared to specular sur-faces, transparent surfaces have no depth information. Therefore, it is not possible to interpo-late depth of the unknown regions using depth from their neighbors. In Chapter5, we tackle this problem by using cross-modal stereo between the IR and RGB sensors of the Kinect to obtain sparse depth estimations on transparent surfaces. The obtained depth information is combined with Kinect’s own depth estimations and we incorporate piecewise smoothness constraints between all pixels in a fully-connected global energy minimization framework. The result of the energy minimization provides dense depth estimations over transparent ob-jects. Furthermore, our algorithm also refines Kinect’s own depth estimations especially on the boundaries of objects which increases overall accuracy of the depth map significantly.

Using multiple stereo similarity measures enhances the accuracy of stereo matching algo-rithms significantly. However, in general, fusion of these similarities is done by calculating their weighted-sum and the weights are set experimentally. In Chapter6, we opt to set these weights automatically rather than experimentally by using stereo confidences. In order to achieve that, we propose a consensus-based strategy where each point in a neighborhood of a center point votes for the disparity of the center point and the weight of the vote is corre-lated with the point’s confidence. The results show substantial enhancement in accuracy for different state-of-the-art stereo matching algorithms.

In Chapter7, we investigate how to apply stereo confidences as developed in Chapter6on medical image registration algorithms. Both matching and registration aim to find correspon-dences between two images and use similar similarity cost measures to find a global minimum for their parameters. The main difference between the two is the number of parameters to optimize. Stereo matching aims to optimize a single parameter which is the disparity. It is feasible to search over all candidate values of this parameter. On the contrary, non-rigid image registration aims to optimize up to millions of parameters which makes full-search for each parameter computationally unfeasible. Rather than applying full-search for all parameters of registration, we apply full-search over all of the spatial directions of the registered image pairs. Since we expect to have aligned pairs of images at the end of the registration, the stereo con-fidences are expected to give high confidence if there is a correct match at exactly the same position in both images. The results of both synthetically-created and real registrations show that there is high correlation between our registration confidence and the amount of error in registration.

Chapter8concludes the thesis with a discussion about the future of the fields of3Dvision and confidence estimations for medical image registration.

(18)

(19)

2

S

TEREO

MATCHING WITH

SURF

K

EY

P

OINTS

The goal of stereo matching is to find an estimate of the depth information inside a scene. This depth estimate can be used for, among others, 3D image reconstruction, virtual view rendering, and 3D object classification.

State-of-the-art stereo matching algorithms (see [11] for an overview) are based on lo-cally matching small image patches to determine the disparity at each image location [12,13]. In many algorithms, the results of the local matching algorithm are refined by incorporat-ing smoothness constraints that suppress noisy disparity estimates [14–18]. Although these algorithms work well in many cases, they have severe problems in the presence of repeti-tive patterns (see Fig. 1): when a repetirepeti-tive pattern is present, the disparity estimates of the neighbouring patches may be widely varying. Employing smoothness constraints is generally insufficient to resolve the repetitive pattern problem. To our knowledge, there is no study that specifically aims to solve this problem for state-of-the art segment based methods.

In this chapter, we introduce a novel stereo disparity estimation algorithm that tries to re-solve the repetitive pattern problem by employing SURF key points to refine local matching algorithms. Incorporation of matching key points between rectified stereo pairs provides ro-bustness against repetitive patterns by restricting the search space of the local matching. To further reduce the problems with repetitive patterns, we propose a global energy minimization formulation that penalizes large discrepancies between the initial and the refined disparity es-timates. As a result, any improvement in the initial disparity estimation has a direct influence on the final disparity estimation.

Our approach comprises four main stages which will be further discussed in Section2.1. Section2.2presents the experimental evaluation of our approach. We draw our conclusions in Section2.3.

This chapter is based on the works published in

G. Saygili, L.J.P. van der Maaten and E. A. Hendriks. Improving segment based stereo matching using SURF key points. In Proceedings of the IEEE Conference on Image Processing, pages 2973-2976, 2012.

(20)

2

10 2.STEREOMATCHING WITHSURF KEYPOINTS

2.1. D

ISPARITY

E

STIMATION

Similar to other state-of-the-art stereo matching approaches, our approach comprises four main stages: (1) colour segmentation, (2) initial disparity estimation with SURF key points, (3) plane fitting, and (4) disparity plane assignment using graph cuts. These four stages are described separately below.

Colour Segmentation. The reference image is segmented into non-overlapping

homoge-neous color regions using mean-shift segmentation [19] resulting in a set of segmentsT. The main assumption of segmenting the image is that disparity discontinuities can only occur at segment boundaries. Therefore, the disparity within a segment can be modelled by a planar surface in later stages of our approach.

Initial Disparity Estimation with SURF Key Points. In this work, we propose to

in-corporate information obtained by key points into the initial estimation of the disparity. Our approach decreases the noise that is caused by repetitions of pattern using a restricted dispar-ity search space, since it reduces the tendency of many algorithms to estimate the disparities wrongly. The details of the algorithm are as follows: Since the images are already rectified for stereo-matching, the matching key points should lie on the same epipolar line. Therefore, the vertical positions of key points should satisfy:

∀s ∈ S : |ysL−ysR| <0.5, (2.1)

whereSis the set of matched key points, andysLandysRare the vertical positions of those

points in left and right views, respectively. The information on the disparity of key points and on the segments in which the key points are located can be used to obtain a new lower and upper bound,dt ,l ow anddt ,hi g h, for the disparity search range of the pixels inside that

segment rather than manually setdmin anddmax:

∀t ∈ T, ∀(x, y) ∈ t : dt ,l ow≤d(x, y) ≤ dt ,hi g h. (2.2)

wheredt ,l owanddt ,hi g hare given by:

dt ,l ow= ½ max{⌊θ1−θ2⌋, dmin} ifKt6= ; dmin otherwise, (2.3) dt ,hi g h= ½ min{⌈θ3−θ2⌉, dmax} ifKt6= ; dmax otherwise, (2.4)

where,θ1,θ2andθ3are given by:

θ1=min{∀(x, y) ∈ Kt: |xL−xR|},

θ2=α × (dmax−dmin), (2.5)

θ3=max{∀(x, y) ∈ Kt: |xL−xR|},

Kt: ∀(x, y) ∈ t ∩ S.

Herein,d(x, y)is the disparity of the pixel at location(x, y)andαis a scaling coefficient that ranges between 0 and 1. Since the disparity search space is bounded bydt ,l ow anddt ,hi g h

(21)

2.1.DISPARITYESTIMATION

2

11

(a) (b) (c)

Figure 2.1: The initial disparity result; (a) reference image (repetition of pattern encircled), (b) without SURF, (c) with SURF.

the repetitions as a result of which the correct disparity is more likely to be found by local matching.

Local pixel matching is based on a matching cost function and an aggregation window around the pixel of interest. As matching costs, we used the sum of absolute differences between images, and between their image gradients:

The cost of a pixel inside the box is aggregated if and only if that pixel resides in the same segment as the center pixel. The non-occluded pixels are used in the global energy minimiza-tion as trustful pixels in terms of disparity and similar to [16], these pixels are found by using cross-check validation. Figure 1 illustrates the enhancement in quality by using SURF key points for initial matching. The red circle indicates a region in which there is repetition of pattern.

Plane Fitting. Based on the initial disparity estimate, we model each segment by a plane

and estimate the parameters of this plane using RANSAC. Since RANSAC works best when there are at least 50 percent of inliers and because large regions provide larger clusters of reliable disparities than smaller regions, we opt to apply the RANSAC to segments that contain more than 100 pixels, and of which at least 50 percent of the pixels are non-occluded. The planes that have similar surface normals and mean disparities are eliminated heuristically, which results in a small set of planes that are sufficient to represent the scene.

Disparity Plane Assignment Using Graph Cuts. A disparity plane is assigned to each

image segment by minimizing an energy function that incorporates both data costs and smooth-ness constraints. The energy minimization problem is solved using a graph-cut approach in which each node corresponds to a segment. LetP be the set of disparity plane parameter la-bels. Our aim is to find a labelling f that assigns each segmentt ∈ T to its plane labelp ∈ P

(22)

2

by minimizing the following energy function:

E ( f ) = Ed at a( f ) + Esmoot h( f ), (2.8)

whereEd at a( f )is the cost of assigning plane labels to the segments.

In most of the state-of-the-art algorithms, such as [15,16], the matching cost of Eq. 6 and Eq. 7 is used as the data term. In this work, we propose to use the following Disparity Estimate Discrepancy Cost (DEDC) instead:

Ed at a( f ) = X t X (x,y ) λ|df (t )(x, y) − d(x, y)|e−n/m, (2.9) ∀t ∈ T, ∀(x, y) ∈ t − Ot,

in whichOtis the set of occluded pixels int,nis the number of non-occluded pixels that have

the same initial disparity as the disparity after plane fitting,mis the number of non-occluded pixels inside the segment,λis the scaling coefficient, anddf (t )represents the disparity of the pixel(x, y)after fitting a plane with labelf (t )on pixels for segmentt:

df (t )(x, y) = af (t )x + bf (t )y + cf (t ). (2.10)

By penalizing large discrepancies between the initial and the final disparity estimates,Ed at a( f )

favors solutions that are close to the initial estimate. As a result, the final estimate will not “jump” to the wrong part of repetitive patterns, which is what traditional data costs would do.

Esmoot h( f )is a smoothness term that penalizes the discontinuities in plane labels of

neigh-bouring segments. We defineEsmoot h( f )as:

Esmoot h( f ) = X t X q γ(t , q)(1 − δ( f (t ), f (q))), (2.11) ∀t ∈ T, ∀q ∈ N (t ).

Herein,N (t )is the set of neighbors oftandγ(t , q)is:

γ(t , q) = wβe(−τ2/σ2), (2.12)

wherewandσare scaling parameters,βandτare the boundary length and the mean colour difference betweentandq, respectively.

2.2. E

XPERIMENTS

Experimental Setup. To evaluate the performance of our algorithm, we performed

experi-ments on the Middleburry benchmark data set. We evaluate the algorithm by measuring the percentage of pixels that have erroneous disparity values. Herein, a disparity value is defined to be erroneous if the absolute difference from ground truth is larger than 1. As in common practice in the evaluation of stereo algorithm, we look at results for (1) non-occluded pixels only (nonocc), (2) all pixels (all), and (3) pixels in image regions that are close to a disparity discontinuity (disc). In all experiments, we setαin Eq. 5 equal to 0.25,αin Eq. 2.10equal to 10 andw= 25,σ= 150. Additionally we choose two mean-shift parameters,hs andhr as

(23)

2.3.CONCLUSION

2

13

Table 2.1: Percentage of erroneous disparity values of the disparity estimations for Tsukuba scene.

Algorithm nonocc all disc

baseline 2.64 3.26 11.8

KP 1.56 2.23 7.42

DEDC 1.25 1.75 6.28

DEDC and KP 1.08 1.59 5.82

(a) (b) (c) (d) (e)

Figure 2.2: Estimated final disparities: (a) Ground truth, (b) without KP and without DEDC, (c) with only KP, (d) with only DEDC, (e) with DEDC and KP.

Experimental Results. In order to show the effect of using key points (KP) and DEDC,

all four possible variants of our algorithm on the Tsukuba scene are evaluated. Table 2.1 presents the quantitative results and Fig. 2.2shows the corresponding disparity images. The experimental results of the algorithm without KP and DEDC illustrate the problems of current stereo-matching algorithms with repetitive patterns. In particular, the disturbance of repeti-tions of patterns is clearly recognizable in the final disparity image in Fig. 2.2. When key points are used, the quantitative results get better because the disturbance of repetition of pattern is suppressed. When both KP and DEDC are incorporated, the best performance is obtained. These results illustrate the ability of our approach to deal with the repetitive pattern problem. Fig. 3.4shows the performance of the best variant of our algorithm (KP+DEDC) on all four Middleburry scenes and Table2.2compares the performance of our algorithm with four state-of-the-art disparity matching algorithms. The results in Figure 3 and Table 2 indi-cate that our algorithm performs on par with the state-of-the-art on the first three scene; and that it outperforms all other algorithms on the Venus scene.

2.3. C

ONCLUSION

In this chapter, we presented a novel stereo disparity estimation algorithm with two main contributions. The results indicate that the proposed algorithm perform on par with the state-of-the-art algorithms on all four scenes of Middleburry and that it outperforms all state-of-the art algorithms on the Venus scene.

(24)

2

Table 2.2: Percentage of erroneous disparity values of proposed algorithm with top performing algorithms.

Algorithm Avg. Rank Tsukuba Venus Teddy Cones

nonocc all disc nonocc all disc nonocc all disc nonocc all disc Proposed 24.2 1.08 1.59 5.82 0.08 0.16 1.11 4.49 8.06 12.2 3.59 9.4 11.0 ADCensus [20] 5.8 1.07 1.48 5.73 0.09 0.25 1.15 4.1 6.22 10.9 2.42 7.25 6.95 AdaptingBP [15] 7.2 1.11 1.37 5.79 0.10 0.21 1.44 4.22 7.06 11.8 2.48 7.92 7.32 CoopRegion [14] 7.2 0.87 1.16 4.61 0.11 0.21 1.54 5.16 8.31 13.0 2.79 7.18 8.01 DoubleBP [18] 9.7 0.88 1.29 4.76 0.13 0.45 1.87 3.53 8.3 9.63 2.9 8.78 7.79

Figure 2.3: Results on Middleburry scenes. From top to bottom: the Tsukuba, Venus, Teddy and Cones scenes. From left to right: reference images, ground truth disparities, the results of the proposed algorithm and the error images where the black regions represents the erroneous pixels.

(25)

3

GPU O

PTIMIZATION

PERFORMANCE OF

STEREO

A

LGORITHMS

Stereo matching algorithms aim to find an estimate of the depth inside a scene based on recti-fied stereo pairs of images. Knowing the stereo geometry of the cameras, the depth of the scene can be estimated using the disparity estimation. Hence, stereo matching is crucial in all computer vision algorithms that are using stereo cameras and require depth as an in-put. Szeliski et al. [11] provided a taxonomy and evaluation of stereo-matching algorithms. Stereo algorithms can be divided into two main types: (1) local algorithms and (2) global algorithms. Local algorithms [20–23] use aggregation of similarity measures around a local support region, i.e., the energy minimization is performed over local patches. By contrast, global algorithms [24–26] incorporate smoothness constraints on the depth estimates, lead-ing to an energy minimization problem that involves all pixels in the images. In general, global algorithms outperform local algorithms in terms of accuracy due to the smoothness constraints, especially in non-textured image regions. However, global algorithms are gen-erally not fast enough to be used in applications in which real-time performance is essential, e.g., for stereo algorithms used in automatically driving cars that have to operate at a minimum of 100 fps [27]. Since local algorithms can be effectively parallelized and are computationally cheaper, they are preferred over global algorithms in real-time applications.

Local stereo-matching algorithms comprise three main steps and an optional (refinement) step: (1) matching cost computation, (2) cost aggregation, (3) disparity estimation. The match-ing cost computations involve calculatmatch-ing pixel-wise differences, commonly based on similar-ity measures such as image intensities or gradients between the two images in the stereo pair.

This chapter is related to the works published in

J. Fang, A. L. Varbanescu, J. Shen, H. Sips, G. Saygili,and L.J.P. van der Maaten. Accelerating cost aggregation for Real-Time stereo matching. In Proceedings of the International Conference on Parallel and Distributed Systems, pages 472-481, 2012.

(26)

3

16 3.GPU OPTIMIZATIONPERFORMANCE OFSTEREOALGORITHMS

There are more complex similarity measures which improve the accuracy of stereo matching algorithms significantly [20] such as census and rank transforms [12]. Hirschmuller et al. [28] provided an extensive comparison of different similarity measures. These initial matching costs are aggregated over a support region of interest in the cost aggregation step. Depending on the complexity of the incorporated techniques, this step and the initial cost calculation are the most computationally expensive steps of the stereo matching algorithms. In the disparity computation step, the disparity that minimizes the aggregated cost at a location is selected as the disparity estimate for that location (generally, using a winner-takes-all, WTA, procedure). In terms of computational requirements, WTA is the cheapest step of the stereo matching algo-rithms. The last step is an optional step to refine the final disparity estimation using commonly a simple type of smoothing.

As mentioned above, various applications of stereo matching, e.g., in automatic driving car and robotic applications, require high-performance algorithms that can run at frame rates of 100 fps or higher. Such framerates are difficult to achieve with serial implementations, even for the simplest local algorithms. Hence, to obtain the required framerate, it is essential to paral-lelize stereo-matching algorithms, for instance, with a GPU implementation. Wang et al. [29] optimized the aggregation algorithms and implement them on a GPU to measure the perfor-mance of them in terms of speed and accuracy. Wang et al. [30] applied vector processing and parallelism to improve the speed of their dynamic-programming based stereo matching algorithm. Mei et al. [20] improved the speed of their algorithm by maximizing its paral-lelism on a GPU. Their algorithm is one of the best performers in the Middleburry benchmark because of the fusion of similarity measures and cross-based aggregation technique that they utilized. Yu et al. [31] utilized four main GPU optimization techniques namely maximizing parallelism [32], coalesced memory access, local memory access [32] and loop unrolling, on their two stereo algorithms to improve their speed performance for the first time in literature. With these four optimization techniques on a GPU, they achieved a speedup of up to8.5. In our previous work [33], we applied the same optimization techniques on three different cost aggregation methods to test the effect over the speed improvements and presented a model to predict the aggregator speed on target machines for stereo matching algorithms. In this work, we further improve our test to state-of-the-art similarity metric calculation and fusion based on the experimental results of Mei et al. [20]. We test the effect of using optimization techniques with respect to naive GPU parallelization on different aggregation methods and similarity metrics in stereo matching. Furthermore, we investigate the accuracy of different aggregation and similarity metrics on the Middleburry benchmark dataset with respect to their speedup performance under GPU optimizations. Since there is a fundamental trade-off between accu-racy and speed of analysing stereo algorithms, the speed of different stereo algorithms is not meaningful without an analysis of their accuracy. Our experiments provide insights into how much the optimization approaches speed up current state-of-the-art similarity metric fusion and cost aggregation methods while preserving their original accuracy, and into which type of aggregation methods and similarity metrics are preferred for robotics applications.

This chapter is organized as follows: In Section3.1, we describe the similarity measures and cost aggregation strategies that we implemented in our study. Section3.2 introduces the optimization techniques that we used to speed up the stereo-matching algorithms with-out sacrificing accuracy. Section3.3presents the results of the experiments with our novel implementations. Section3.4concludes the chapter.

(27)

3.1.SIMILARITYMEASURES ANDAGGREGATIONSTRATEGIES

3

17

3.1. S

IMILARITY

M

EASURES AND

A

GGREGATION

S

TRATEGIES

In this section, we describe the similarity measures and aggregation strategies we investigate in our study.

3.1.1. S

IMILARITY

M

EASURES

A cost measureC (x, y, d)is a function that measures the penalty for assigning disparityd

to the image location(x, y), based on the difference between the intensity at that location in the reference image I (x, y) and the corresponding location in the target image I′(x − d, y). Below, we introduce five such cost measures: (1) absolute difference, (2) gradient, (3) rank, (4) census, and (5) combined similarity measures based on [20].

Absolute Difference (AD) measures the sum of the absolute intensity differences between

pixels in the reference imageI (x, y)and corresponding pixels in the target imageI′(x − d, y). In general, the result is truncated with a constant threshold value,T:

CAD(x, y, d) = min¡|I (x, y) − I′(x − d, y)|, T¢ . (3.1)

Gradient measures the sum of absolute differences between the image gradients of the

reference and the target image. Focusing on gradient differences has two main advantages: (1) it leads to a much stronger focus on correct localization of edges and (2) it is more robust than AD because the gradient information does not vary with the change in radiometric difference in between stereo pairs. We used 5 × 5 difference of Gaussian (DoG) filter with σ = 1 to compute the image gradients▽x,yI (x, y)and▽x,yI′(x, y). The similarity measure is given by:

CGR AD(x, y, d) = | ▽x,yI (x, y) − ▽x,yI′(x − d, y)|. (3.2)

Rank [12] is a non-parametric image transforms that models the structure of the

neigh-borhood of pixels by exploiting the intensity variation. Eq. 3.3represents the rank transform equivalent of a pixel at location,(x, y), inside a local neighborhood,N_{(x, y)}_{and the pixel-wise}

cost is calculated by Eq.3.4:

RT (x, y) =¯¯∀(x′, y′) ∈ N (x, y)|I (x′, y′) < I (x, y)¯¯, (3.3)

CR ANK(x, y, d) = |RT (x, y) − RT′(x − d, y)|. (3.4)

Census [12] transforms model the structure of the neighborhood of pixels. We used a9×7

window around each pixel as the neighborhood. For each neighboring pixel,k:

C T (x, y)[k] =½1, iffI (xk, yk) > I (x, y)

0, otherwise. (3.5)

The similarity measure is calculated by using the Hamming distance betweenC T (x, y)and

C T′(x − d, y). Experimental results of Hirschmuller et al. [28] show that Census is one of the

top performing similarity measures.

Combined Similarity Measures are used for gaining robustness against noise as well as

increasing the accuracy of single similarity measures [20,34]. Different similarity measures are combined using a variety of calculations. Wegner et al. [34] introduced a product of two different similarity measures whereas the best performer in Middlebury proposed a summa-tion,Csim, of exponentials of single similarity measures,C1andC2, given by:

(28)

3

3.1.2. A

GGREGATION

S

TRATEGIES

An aggregation strategy combines the cost measures over a (local) support region in order to obtain a reliable estimate of the costs of assigning a disparityd to image location(x, y). We investigate three such aggregation strategies: (1) constant window aggregation, (2) locally adaptive support weight aggregation, and (3) cross-based aggregation.

Constant Window Aggregation (CW) is the straightforward aggregation of any

similar-ity measure costC over a constant-size neighborhood:

CCW(x, y, d) =

X ∀(x′_,y′_{)∈N (x,y )}

C (x′, y′, d), (3.7)

whereN_{(x, y)}_{represents the set of pixels that are neighbors of pixel}_{(x, y)}_{. Aggregation over}

constant size windows has an intrinsic assumption that every pixel in the window should share the same disparity. This assumption fails in the neighborhood of disparity discontinuity loca-tions.

Cross-Based Aggregation (CROSS) The main assumption of block matching does not

hold at disparity discontinuity locations for CW. Cross-based aggregation overcomes this problem by constructing variable support for aggregation that depends on color similarity [22]. The first step of the algorithm is to construct a cross for each pixel inside the image. Four pa-rameters are stored for each pixel,h−p,hp+,v−p,v+p, which identify the lengths of the four arms

of the cross as shown for pixelpin Fig. 3.1a The decision on whether or not a pixel,p′, is included is made based on the color similarity with pixelpas given in Eq.3.8, and the length of the arm,l∗_{, is measured by Eq.}_3.9_[₂₂_{]. After calculating the arm lengths in four directions}

for each pixel, the final aggregation support is constructed integrating the horizontal arms of each neighboring pixel in vertical direction as shown by green color in Fig. 3.1a. The same steps are also performed for the pixels of the target image. At the aggregation, the intersection of the reference and the target support regions is aggregated. Orthogonal Integral Image (OII) technique is proposed in [22] to achieve faster aggregation. The technique is based on sep-arating the 2D aggregation procedure into two 1D aggregation steps, horizontal and vertical aggregation as shown in Fig.3.1b. The horizontal aggregation can be efficiently performed via the horizontal integral of similarity measures. Finally, horizontal sums are aggregated over vertical direction as depicted in Fig.3.1c.

δ((x, y), (x′, y′)) =½1, iff|I (x, y) − I (x ′_{, y}′_{)| ≤}_τ 0, otherwise. (3.8) l∗=argmax l ∈[1:L] Ã l Y i=1 lδ((x, y), (xi, yi)) ! (3.9)

Locally Adaptive Support Weight Aggregation (AW) [21] uses color similarity and

proximity based weights as aggregation coefficients for the similarity measures pixels around pixels of interest. In order to have simplified notation, letpandpddenote the pixels at(x, y)

in reference and(x − d, y)in target images respectively.CAW can be calculated as:

CAW(p, pd) = P q∈N (p),qd∈N(pd)w(p, q)w ′_(p d, qd)CS I M(p, d) P q∈N (p),qd∈N(pd)w(p, q)w ′_(p d, qd) , (3.10)

(29)

3.2.GPU OPTIMIZATIONTECHNIQUES

3

19

(a) (b) (c)

Figure 3.1: Cross-based aggregation: (a) Cross Construction, (b) Horizontal Aggregation, (c) Vertical Aggregation.

wherepd and qd are the corresponding pixels in the target image ofp and qwith

dispar-ity shift of d. The weights, w(p, q), depends on the color similarity,∆cpq, and Euclidean

distance,∆spq, in between pixels,pandq,:

w(p, q) = exp µ −µ kI (p) − I (q)k 2 γc + kp − qk2 γs ¶¶ , (3.11)

whereγcandγsare constant parameters that are set empirically. Calculation of target image weights,w′(pd, qd), is the same for target image pixels,pdandqd.

3.2. GPU O

PTIMIZATION

T

ECHNIQUES

This section outlines the GPU optimization techniques that we employed in our implementa-tion of the stereo-matching algorithms. We first outline the GPU programming model in3.2.1, and then discuss the four main optimizations that we implemented in our code in3.2.2.

3.2.1. GPU P

ROGRAMMING

M

ODEL

Originally designed for computer graphics, GPUs are receiving a lot of attention in general-purpose computing due to their large performance potential. Currently, CUDA and OpenCL are the prevailing programming models on GPUs. CUDA is limited to NVIDIA GPUs, while OpenCL is an open standard for computing and it allows parallel programs to be executed on various platforms, addressing the issue of portability.

An OpenCL program has two parts: kernels that execute on one or more OpenCL devices (typically, accelerators such as GPUs), and a host program that executes on the host (typically a traditional CPU). The host program defines the contexts for the kernels and manages their execution. The computational task is coded into kernel functions. When a kernel is submitted for execution by the host, an index space is defined; for each point in this space, a kernel instance is executed. Such an instance of the kernel is known as a work-item, which is typically executed by a thread on the device. Work-items are organized into work-groups, providing a more coarse-grained decomposition. Each work-item has its own private memory space, and can share data via local memory with the other items in the same group. All work-items can read/write the global device memory.

(30)

3

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 19x19 21x21 23x23 25x25 27x27 29x29 31x31 33x33 Aggregation Time[ms] Window Size CW AW CROSS

(a) Naive GPU

0 100 200 300 400 500 600 700 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 19x19 21x21 23x23 25x25 27x27 29x29 31x31 33x33 Aggregation Time [ms] Window Size CW AW CROSS (b) Optimized GPU 0 5 10 15 20 25 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 19x19 21x21 23x23 25x25 27x27 29x29 31x31 33x33 Speedup [x] Window Size CW AW CROSS

(c) Speedup Naive GPU

0 20 40 60 80 100 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 19x19 21x21 23x23 25x25 27x27 29x29 31x31 33x33 Speedup [x] Window Size CW AW CROSS

(d) Speedup Optimized GPU

0 500 1000 1500 2000 2500 3000

cones teddy tsukuba venus

Cost Calculation Time [ms]

AD Census

Gradient Rank

(e) Sequential CPU

0 50 100 150 200

speedup[x]

AD Census

Gradient Rank

(f) Speedup Naive GPU

0 50 100 150 200

speedup[x]

AD Census

Gradient Rank

(g) Speedup Optimized GPU Figure 3.2: Speedup performances of naive and optimized GPU implementations of stereo similarity and

aggregation methods.

3.2.2. P

ERFORMANCE

O

PTIMIZATION ON

GPU

S

Below, we summarize the four main techniques that are used in our implementations to im-prove their performance. Since the stereo cost initialization and cost aggregation approaches are memory-bounded (analyzed from Roofline Model [35]), we focus on improving memory bandwidth and maximizing parallelism.

MAXIMIZINGPARALLELISM

When using GPUs to accelerate applications, we consider parallelism maximization as the first priority. Maximizing parallel execution starts with structuring the algorithm in a way that exposes as much data parallelism as possible [32]. When we parallelize stereo algorithms, letting each work-item process one pixel is a natural choice. However, it becomes difficult to calculate the prefix sum when using the integral image technique, because there is a data dependence inx ory direction. In this case, we select(d, x)or(d, y), whered is the third dimension (representing disparity range), as the parallelization strategy to maximize paral-lelism.

(31)

3.3.EXPERIMENTALEVALUATION

3

21

COALESCEDMEMORYACCESS

Neighboring GPU work-items/threads prefer accessing spatially close data elements, which is known as coalesced memory access (different from the memory access as traditional CPUs do). For example, for an 8-element array(1, 2, 3, 4, 5, 6, 7, 8), and two threadsT1andT2, CPUs

prefer the memory access pattern:T1{1, 2, 3, 4}, T2{5, 6, 7, 8}, while GPUs prefer(T1, T2){(1, 2), (3, 4), (5, 6), (7, 8)}

- coalesced. Coalesced memory access can improve memory bandwidth by avoiding chan-nel/bank conflict. We apply this principle where possible to our stereo implementations. LOCALMEMORYACCESS

Local memory is located on-chip, having much lower latency and higher bandwidth (more than10×higher than off-chip global memory) [32]. A typical programming pattern is to stage data coming from the global device memory into the local memory, process the data, and then write the results back to global device memory. Local memory offers significant advantages when the data is re-used, and it can relax the requirement for coalesced memory accesses. In our case, data aggregation on each pixel needs to load its neighboring data elements when using window filtering. Thus, neighboring work-items within one work-group can share data by loading a block of data elements into local memory, leading to a bandwidth improvement. LOOPUNROLLING

Loop unrolling is an optimizing technique in which the body of a loop is replaced with multiple copies of itself, and the control logic of the loop is updated at the same time. In this way, we can reduce the number of dynamic instructions, thus significantly improving performance especially for data-parallel applications on GPUs [36].

3.3. E

XPERIMENTAL

E

VALUATION

In this section, we analyze the speed and accuracy of our stereo-matching algorithms. The setup of our experiments in presented in3.3.1. The speed and accuracy analyses are presented in3.3.2and 3.3.3, respectively.

3.3.1. E

XPERIMENTAL

S

ETUP

We tested different similarity measures and cost aggregation strategies on the four benchmark image datasets provided by Scharstein et al. [11]. The accuracy of algorithms are evaluated by measuring the percentage of pixels that have absolute disparity difference of more than 1 with the ground truth. The experimental testbed configurations are shown in Table3.1. The Quadro GPU is connected to the CPU via a PCIe connection. We have implemented four cost initialization methods (AD, Census, Rank, Gradient, and their combinations), three cost aggregation strategies (CW, AW, CROSS), WTA disparity estimation, and multiple refinement optimizers in OpenCL.

3.3.2. S

PEED

A

NALYSIS

Fig.3.2aand Fig.3.2cshows the speedup and aggregation time when using naive implementa-tions of cost aggregation algorithms. Since we use the integral image (prefix sum) for CW and CROSS, the performance is dependent on image-size but independent of window-size. As can be seen from Fig.3.2c, the speedups stabilize at4×for CW and19×for CROSS respectively. As for AW, the speedup starts with9×because of smaller window-size, and then decreases to

(32)

3

Table 3.1: Experimental Testbed

Name Value

GPU Type NVIDIA Quadro5000

CPU Intel Xeon X5650

OS UBUNTU 10.04

GPU Compiler CUDA v4.0 CPU Compiler GCC v4.4.3

5×and finally becomes fixed around4×. In terms of aggregation time, it increases exponen-tially when extending the windows (it needs about 16 seconds with a window of33 × 33) for AW, while it keeps steady around 30 ms and 67 ms for CW and CROSS, respectively.

When it comes to the optimized GPU versions (shown in Figures 3.2band3.2d), we can clearly see the speedup compared to sequential and naive implementations due to utilizing the optimizing techniques. Specifically, the speedups for CW, AW, and CROSS increase to17×,

90×, and35×, respectively (shown in3.2d). AW benefits most from the utilization of coa-lesced memory access and local memory. As can be seen from Figure3.2b, The optimization techniques decrease the AW aggregation time to 670 ms when the window is33 × 33. At the same time, the aggregation time for CW and CROSS stay around 9 ms and 37 ms, i.e.,3×for CW and2×for CROSS faster than the naive implementations.

Fig.3.2eshows the execution times for different similarity measures using sequential cod-ing on CPU. Census is significantly the most computationally expensive metric. Rank is also computationally more demanding than the rest of the metrics. However, Rank is the most efficient metric for GPU optimization as show in Fig.3.2gand3.2f. Census is not efficient for naive parallelization but it better utilize the optimization implementation than AD and Grad, and has a similar speedup to them after the optimization.

3.3.3. A

CCURACY

A

NALYSIS

Each aggregation method is tested with four different similarity measures. The percentage of erroneous pixels over all image regions with respect to various window sizes for four bench-mark datasets is presented in Fig. 3.3. Since the cross-based aggregation is an adaptive-size support aggregation technique, the results do not vary over the window size. The results indi-cate that AW outperforms slightly the other two aggregation strategies when the window sizes are large. As the window size increases, the accuracy of AW increases while, the accuracy of CW starts to decrease (after an optimum size around 11) as the blocks cross over disparity discontinuity locations more frequently.

Similarity measures affect the performances of cost aggregation methods significantly. The accuracies of CROSS and CW depend on similarity measures more than that of AW. In terms of resulting accuracies, AD is the worst performer among the others, which sup-ports the results of Mei et al. [20] about the enhanced performance of combined similarities. Our results indicate that AD+CENSUS outperforms other similarity measures for almost all aggregation strategies and datasets. However, CENSUS is the most time consuming simi-larity measure as depicted in Fig. 3.2e. The accuracy obtained using AD+RANK is very

(33)

3.4.CONCLUSION

3

23 0 5 10 15 20 25 30 3x3 5x5 7x7 9x9 11x11 13x13 15x15 17x17 19x19 21x21 23x23 25x25 27x27 29x29 31x31 33x33 Error Rate [%] Window Size (AD)+AW (AD+Census)+AW (AD+Gradient)+AW AD+Rank)+AW (AD)+CW (AD+Census)+CW (AD+Gradient)+CW AD+Rank)+CW (AD)+CROSS (AD+Census)+CROSS (AD+Gradient)+CROSS AD+Rank)+CROSS

Figure 3.3: The percentage of erroneous pixels with respect to window size.

close to the accuracy of AD + CENSUS, since they both exploit the structural information of the local neighbourhood of pixels. However, AD+RANK is computationally efficient and it is more suitable for GPU parallelization and optimizations than AD+CENSUS. Therefore, AD+RANK is the most promising similarity metric for stereo matching applications that re-quires high speed and accuracy.

The top accuracies for each aggregation strategy are shown in Fig.3.3. AW and CROSS perform similarly and they outperform CW significantly. From the speed results, cost aggre-gation strategies with constant window size such as AW have higher speedup opportunities, however AW cannot be implemented with integral image technique which provides the fastest setup with CROSS and CW. Since CROSS has similar performance with AW and faster than AW significantly with the GPU optimization techniques, CROSS is more promising for stereo matching applications that requires high speed and accuracy.

3.4. C

ONCLUSION

In this study, we have developed and evaluated highly optimized implementations of modern stereo-matching algorithms by exploiting GPU-specific optimizations. Our evaluation of the performance of combinations of various similarity measures and aggregation strategies shows that such GPU-specific optimizations can lead to significant computational gains whilst attain-ing state-of-the-art accuracies. In particular, our fastest algorithm produces good results on the Middlebury benchmark stereo images while running at more than 110 fps. Moreover, our re-sults have shown that “slow” adaptive weighting strategies can be speeded up much more than faster cross-based approaches, thereby significantly reducing the computational difference be-tween the two aggregation techniques. In terms of similarity metrics, we showed that the most computationally heavy metric, Census transform, can be as effectively parallelized as other metrics with the optimization strategies. However, Rank transform is a better option in terms of accuracy vs. speed tradeoff for applications that have high frame-rate requirements.

(34)

3

Figure 3.4: Results on Middlebury scenes. From top to bottom: the Tsukuba, Venus, Teddy and Cones scenes. From left to right: reference images, ground truth disparities, the best results of CW, CROSS and AW respectively.

(35)

4

K

INECT

DEPTH

M

AP

REFINEMENT

The Kinect depth sensor was released in 2010 as a human-machine interaction interface tool by Microsoft mainly for gaming purposes. It is succeeded by a newer version in late 2014 which has active IR technology to further use IR images instead of RGB when the lighten-ing is not adequate in the environment for accurate tracklighten-ing. Both Kinect versions provide high-resolution depth maps in real-time. The first Kinect measure the depth of the scene by measuring the scattering of the reflected infrared (IR) pattern from the scene. The Kinect 2 (Durango) measures the depth based on its Time-of-Flight (ToF) sensor. In contrast to their adequate depth quality for tracking applications, the accuracy of the measured depth map is strongly influenced by the IR reflectance characteristics of the surfaces inside scenes and the amount of occlusion that is caused by the horizontal shift between Kinect’s IR transmitter, receiver and RGB camera. The specular and occluded surfaces, where Kinect cannot measure the depth, lead to depth gaps in the depth map as shown in Fig.4.1.b. Additionally, Kinect’s depth maps may not preserve sharp depth discontinuities accurately because of its hardware set-up. As shown in Fig.4.1.c, Kinect projects an IR pattern onto the scene from which it can calculate the depth using triangulation between the received pattern and the reference pattern stored inside the Kinect [37]. Measurement errors, lightening conditions and surface charac-teristics effect the received IR speckles and can create incorrect depth measurements on the depth discontinuities. Additionally, since the IR transmitter, receiver and RGB camera are shifted, the depth and RGB images are misaligned which is also a problem for the new Kinect. This can be partly solved in a calibration step but cannot be completely fixed due to incorrect depth measurements on the depth discontinuities. We call this problem the mismatch problem

This chapter is based on the works published in

G. Saygili, C. Balim, H. Kalkan, and E. A. Hendriks. Hierarchical Grid-Based Learning Approach for Recovering Unknown Depths in Kinect Depth Maps. In Proceedings of Image Analysis and Recognition, pages 658-667, 2013. G. Saygili, L.J.P. van der Maaten, D. M. J. Tax, and E. A. Hendriks. Robust Kinect Depth Map Refinement. submitted to Signal Image and Video Processing (SIVP).

(36)

4

26 4.KINECTDEPTHMAPREFINEMENT

(a) (b) (c)

Figure 4.1: Kinect views: (a) color, (b) depth, (c) IR.

since the edges inside color view do not match exactly with the discontinuities of the depth view. The mismatch problem is illustrated in Fig.4.2where the edges between green (fore-ground) and red (back(fore-ground) regions in Fig.4.2.a do not match exactly with the synthetically generated depth view in Fig.4.2.c. The black region represents the unknown (occlusion) re-gions around the foreground region of which depth needs to be predicted. The mismatch region is indicated with blue in Fig.4.2.d. Fig.4.4shows this problem in a real Kinect depth map. The region in red does not contain any foreground (human) pixels in color image,4.4.c, whereas the depth map include foreground depth as depicted in green in Fig.4.4.d.

The gaps and mismatch problem may negatively influence the accuracy of applications such as virtual-view rendering [38]. In Fig.4.5.a, the region indicated with a red rectangle is missing in the raw depth map and does not exist in the rendered view. The depth of the pixels with unknown depth values in the raw depth map can be recovered using bilateral-filtering based inpainting algorithms [39–41]. Bilateral filtering algorithms are used commonly in im-age filtering to smooth the imim-age while preserving the edges [42]. The main disadvantage of using bilateral filtering in depth map inpainting is the problem of misalignment. Kinect depth inpainting algorithms fail to correctly estimate the missing depth values because they do not assume any presence of outliers. This type of misalignment between Kinect views causes inpainting algorithms to propagate wrong depth estimates through the occlusion regions be-tween the foreground and the background, as shown in Fig.4.2.e and4.4.e. The wrong depth propagation at depth discontinuities perturb the virtual-view rendering accuracy as depicted Fig.4.5.f and point cloud reconstruction performance as shown in Fig.4.5.h. Depth discon-tinuities are also a problem for ToF depth sensors. The red in Fig.4.3indicates wrong depth measurements of Kinect 2 at depth discontinuity regions.

In this chapter, we introduce a novel way to estimate the missing depth values by apply-ing a depth-variation modelapply-ing algorithm. The proposed algorithm uses depth-guided seg-mentation to find regions that are unlikely to have depth discontinuities. The assumption of homogeneous color regions having no depth discontinuity is used by many state-of-the-art stereo matching algorithms [14,15,26,43]. Our algorithm uses iteratively reweighted least squares (IRLS) to model the depth variation inside each segment and recover the unknown depth values. Using IRLS over multi-layered depth-guided statistical region merging (DSRM) segments iteratively is computationally expensive. To speed up the DSRM-based refinement in sequences, we introduce a way to recover subsequent frames of a sequence without using

Depth extraction, refinement and confidence estimation from image data

D

EPTH

E

XTRACTION

, R

EFINEMENT

A

ND

D

EPTH

E

XTRACTION

, R

EFINEMENT

A

ND

C

ONFIDENCE

E

STIMATION

F

ROM

I

MAGE

D

ATA

Proefschrift

Görkem S

AYGILI

CONTENTS

1

I

NTRODUCTION

1.1.

HVS

AND

3D

V

ISION

1

1

1.2.

C

HALLENGES IN

D

EPTH

E

XTRACTION

1

1.2.1.

D

E

P

S

1

1.2.2.

D

M

A

S

1.2.3.

A

I

D

E

1

1.3.

T

HESIS

C

ONTRIBUTIONS

1

2

S

TEREO

MATCHING WITH

SURF

K

EY