Exploration and Learning for Cognitive Robots

(1)

Exploration and Learning for

Cognitive Robots

(2)

(3)

Exploration and Learning

for Cognitive Robots

PROEFSCHRIFT

ter verkrijging van de graad van doctor

aan de Technische Universiteit Delft,

op gezag van de Rector Magnificus Prof. ir. K.C.A.M. Luyben,

voorzitter van het College voor Promoties,

in het openbaar te verdedigen op dinsdag 8 januari 2013, 10.00 uur

door

Maja RUDINAC

Diplom ingenieur in Electrical Engineering

University of Belgrado, Servi¨

e

(4)

Samenstelling promotiecommissie:

Rector Magnificus voorzitter

Prof. dr. ir. P.P. Jonker Technische Universiteit Delft, promotor Prof. dr. ing. T. Asfour Karlsruhe Institute of Technology (KIT) Prof. dr. R. Babuˇska Technische Universiteit Delft

Prof. dr. F.C.T. van der Helm Technische Universiteit Delft Prof. dr. C. Jonker Technische Universiteit Delft Prof. dr. L. de Witte Universiteit Maastricht (UM)

Dr. M. Loog Technische Universiteit Delft

Prof. dr. J. Dankelman Technische Universiteit Delft, reservelid

This work has been carried out as part of the FALCON project under the responsibility of the Embedded Systems Institute with Vanderlande Industries as the carrying industrial partner. This project is partially supported by the Netherlands Ministry of Economic Affairs under the Embedded Systems Institute (BSIK03021) program.

Copyright c 2012 by M. Rudinac Cover Design by J. Popadi´c

All rights reserved. No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author.

isbn 978-94-6186-111-5

(5)

(6)

(7)

1 Introduction

1

1.1 Why does the world need cognitive robots? . . . 1

1.2 The concept of cognition in the history of philosophy . . . 2

1.3 Development of cognition in contemporary psychology . . . 5

1.4 State of the art on cognitive robots design . . . 8

1.5 Requirements for cognitive robot design . . . 11

1.6 Cognitive Robot Architecture . . . 13

1.7 Thesis outline . . . 18

2 Keypoint Extraction and Selection for Object Recognition

23 2.1 Abstract . . . 24 2.2 Introduction . . . 24 2.3 Related work . . . 25 2.4 Combining keypoints . . . 25 2.4.1 Building blocks . . . 25 2.4.2 Syntheses . . . 26

2.5 Method for keypoint reduction . . . 26

2.5.1 Reduction using spatial criteria . . . 26

2.5.2 Reduction using entropy . . . 27

2.6 Testing and analysis . . . 28

2.6.1 Repeatability results . . . 28

2.7 Conclusion . . . 33

3 A Fast and Robust Descriptor for Multiple-view Object Recognition

35 3.1 Abstract . . . 36

(8)

3.2 Introduction . . . 36

3.3 Descriptor extraction . . . 37

3.3.1 Descriptor components . . . 37

3.3.2 Feature combining . . . 40

3.4 Normalization step . . . 40

3.5 Testing and analysis . . . 41

3.5.1 Training and testing datasets . . . 41

3.5.2 Overall recognition performance . . . 44

3.5.3 Testing under various conditions . . . 44

3.5.4 Comparative analysis with the state of the art methods . . 45

3.5.5 Speed analysis . . . 47

4 Multiple-View Object Recognition with small number of training examples

51 4.1 Abstract . . . 52

4.3 Object description . . . 54

4.3.1 2D Global Feature Vector . . . 54

4.3.2 3D Global Feature Vector . . . 56

4.4 Dominant feature selection . . . 56

4.5 Experimental setup and results . . . 59

4.5.1 Baseline normalisation steps . . . 59

4.5.2 Columbia Object Image Library Dataset (COIL-100) . . . . 59

4.5.3 ALOI dataset . . . 61

4.5.4 RGBD dataset . . . 62

5 Saliency Detection and Object Localization in Indoor Environments

67 5.1 Abstract . . . 68

5.3 Saliency detection in a scene . . . 69

5.3.1 Saliency map generation . . . 69

5.3.2 Detecting interest points in the saliency map . . . 70

5.4 Clustering of the interest points . . . 71

5.5 Results, Evaluation and Analysis . . . 72

6 Active grasping of unknown objects using an underactuated gripper

77 6.1 Abstract . . . 78

(9)

6.4 Initial object localization . . . 82

6.5 Visual tracking . . . 84

6.5.1 Robust tracking based on online learning . . . 84

6.6 Active grasping . . . 87

6.6.1 Approach and grasp phase . . . 87

6.6.2 Recovery Mechanisms . . . 90

6.7 Experimental Setup and Results . . . 91

6.7.1 Tracking results . . . 91

6.7.2 Grasping experiments . . . 94

7 Learning and Recognition of Objects Inspired by Early Cognition

103 7.1 Abstract . . . 104 7.2 Introduction . . . 104 7.3 System layout . . . 107 7.3.1 Visual Attention . . . 107 7.3.2 Smooth pursuit . . . 108 7.3.3 Object description . . . 109 7.3.4 Novelty detection . . . 111

7.3.5 Incremental learning of objects . . . 112

7.3.6 Active recognition of objects . . . 113

7.4 Experimental Setup . . . 113

7.4.1 Learning and recognition approaches . . . 113

7.4.2 Dataset . . . 114

7.4.3 Applying the SIFT descriptor . . . 114

7.5 Experimental Results . . . 115

7.5.1 System performance in uniform illumination conditions . . 115

7.5.2 System performance in classification of novel objects . . . . 115

7.5.3 System performance in the variable illumination conditions 116 7.5.4 System performance in the presence of clutter . . . 117

7.5.5 Performance dependence on the number of learned obser-vations and dominance weighting . . . 118

7.5.6 Multi-modal descriptor analysis . . . 118

7.7 Acknowledgements . . . 121

8 Item recognition, learning and manipulation in a warehouse input station

125 8.1 Abstract . . . 126

8.3 System layout . . . 128

(10)

8.5 Item descriptors . . . 134

8.5.1 Local descriptors . . . 134

8.5.2 Global multiview descriptors . . . 135

8.5.3 Elliptic Fourier Descriptors . . . 138

8.6 Item learning . . . 140

8.7 Detecting grasping points . . . 142

9 Real Time Fall Detection and Pose Recognition in Home Environments

145 9.1 Abstract . . . 146

9.3 Related work . . . 147

9.4 Motion detection . . . 149

9.4.1 Background extraction . . . 149

9.4.2 Motion History Images . . . 149

9.4.3 Area measurements . . . 150

9.4.4 Shape measurements . . . 150

9.5 Action recognition . . . 153

9.5.1 Action Triggering . . . 153

9.5.2 Fall and walk detection . . . 153

9.5.3 Action/Pose Models . . . 154

9.5.4 Bend and Collapse Detection . . . 154

9.6 Testing and Results . . . 155

10 Conclusion

159 10.1 Research goal . . . 159

10.2 Part 1: Knowledge formation . . . 160

10.3 Part 2: Sensory motor integration . . . 161

10.4 Part 3: Knowledge acquisition and formation . . . 163

10.5 Part 4: Real world applications . . . 165

Bibliography

167

Summary

181

Samenvatting

183

Acknowledgements

185

Curriculum Vitae

191

Colophon

193

(11)

Chapter

1

Introduction

Intelligence is the ability to adapt to change. Stephen Hawking

1.1 Why does the world need cognitive robots?

Over the last decades, robots have been replacing humans to perform dangerous or tedious tasks in many situations. First in industry, but also for remote explo-ration, disaster intervention, military, etc.. But in all those situations robots were kept away from humans and were often also remotely controlled by humans. Two trends are about to change this paradigm: on the one hand, new technologies and designs makes it possible to design affordable, safe and efficient machines. On the other hand the evolution of modern societies brings the need for robots oper-ating together with humans in a human environment. Besides consumer market robotics, whose extension mainly depends on an affordable offer for extremely versatile machines, there is one domain were the development of service robotics is a growing need: daily assistance and monitoring for the elderly or disabled people.

One of the main concerns in the current European society is the rapidly grow-ing number of elderly [25]. Due to the modern cure system people tend to live longer, but eventually health problems pile up, resulting into a growing number of frail elderly. Diabetes, Alzheimer syndrome and dementia are known topics, but also everyday life issues such as putting on socks and shoes or unable to turn around in bed or going alone to the bathroom at night demand for monitoring and intervention. Disabled people on the other hand need help with executing basic household jobs and are in need for constant care. Finally, people that need care and live alone are often craving for the attention and someone to communicate with, and robots such as the therapeutical Paro robot might be a solution to that

(12)

[47]. Therefore robots can also serve as companions to monitor and guard peo-ple, execute everyday household tasks such as cleaning, cooking, shopping, etc., interact with them and serve as a communication bridge to their families and to - internet - information.

However, in order for service robots to enter human-centered environments, it is indispensable to equip them with manipulative, perceptive and communicative skills necessary for real-time interaction with the environment and humans. They also need to expand and reuse their knowledge in order to adapt to novel situations and users. The goal of our work and this thesis is to provide reliable and highly integrated robotic platforms which facilitates on the one hand the realization of service tasks in household scenarios and on the other hand the implementation and tests of various research activities.

1.2 The concept of cognition in the history of philosophy

The concept of cognition first appeared in ancient Greece. It originates from the old Greek verb gnsko meaning ’learning’ (noun: gnsis = knowledge), which in the broad sense means ’to conceptualize’ or ’to recognize’. The earliest philoso-phers tried to define the concept of knowledge. Beginning with Thales in the 6th century BC, philosophers constructed various theories about the ultimate nature of reality, and the elements of which the world is composed. Already by about 500 BC, Heraclitus asserted that we can never say anything true about anything and therefore there is no real knowledge, since everything is constantly changing. Later, throughout the history of western philosophy, philosophers challenged this opinion and assumed that there were two kinds of knowledge: everyday knowledge, often referred to as mere belief, and real knowledge, which is absolutely certain, and true for all time. The Greeks called this episteme (hence epistemology), and the Romans called it scientia (hence science).

The first philosopher that defined real knowledge was Socrates. His general approach was that if we are to have genuine knowledge, we must first have valid concepts; and we can only know that our concepts are valid if we can give a correct definition of them. The only way to derive a correct definition is through dialogue and constant revision of the definition. His successor, Plato [99] further extended this idea and stated that our sensory experiences consist only of floating images, which at best give a highly misleading picture of the nature of reality. The only realities are unchanging, abstract entities, such as concepts (ideas or forms), mathematical entities (numbers, and perfect geometrical figures), and immaterial souls. Therefore, reality resides not in the concrete objects we perceive but in the abstract forms that these objects represent. Plato also put foundations to the first theory on perception and defined the following 3 principles:

• Everything is motion and our sensory experiences of the external world are constantly changing

(13)

• To see an object is to act on it

• Plato reconciles the outward motion towards the object and the inward motion from the object, by saying that perception takes place in the space between the eye and the object through their interaction

Today, more than two centuries later, the legacy of Plato can still be found in modern science. Mathematical representations are used to describe the visual properties of objects, while the principle of embodied cognition states that action is a crucial part of learning.

Platos theory on perception was further broadened by Aristotle who proposed the first theory of mind [9]. Aristotle describes mind as nous, often also rendered as ”intellect” or ”reason”, as the part of the soul by which it knows and under-stands and distinguishes the ”practical mind” / ”practical intellect” / ”practical reason” from the ”theoretical mind” / ”theoretical intellect” / ”theoretical rea-son” [10]. He also devotes a great deal of attention to perception, discussing both the general mind - senses connections and the individual senses. He states that perception is a case of interaction between two suitable agents: objects capable of acting and capacities capable of being affected. And, to be able to perceive subjects need to have a suitable body. However, perceived objects that are con-stituted by matter in reality are configured by form / concepts in the mind. The process of concept formation involves intellect being an agent that proceeds at a higher level of abstraction than perception, and is in fact comprehending the structural features of the objects of thought. Hence, the state of the various sen-sory operations, and of the bodily organs that carry them out, profoundly affects intellectual cognition and affection. He also stated that the route to knowledge is through empirical evidence, obtained through experience and observation, and by this laid foundations for contemporary psychology.

Starting from the Renaissance times, the modern philosophers / scientists refined old theories on perception [113] while preserving the following principles:

• Reality is not the same as our perceptual images of it

• There is a real world underlying the illusory world of sensory experience, and its only genuine properties are those which are measurable such as size, shape, weight, etc.

• It is only through mathematics (in particular, geometry) that we can know and understand the real world

In line with the assumptions above, Galileo [43] makes a sharp distinction between primary incidents such as shape, size, and motion, which are logically inseparable from bodily substance, and secondary incidents which are only in the mind of the observer. Decartes [31] in addition postulates that all the things in the world are only accessed indirectly and that we only have access to the world

(14)

of our ideas. This world includes all of the contents of the mind such as per-ceptions, images, memories, concepts, beliefs, intentions, decisions, etc.. He also introduces local memory and states that it is placed outside the physical human body. Furthermore, he states that humans only remember the most significant characteristics and memories of the objects which do not need to resemble the object itself. This is the first time that the idea of selective features is proposed. He further defines a cognitive being as a thinking thing (Cogito ergo sum) with intelligent body and, as a something that doubts, understands, affirms, denies, wills, refuses, senses and has mental images.

Bertrand Russells provided in his theory of knowledge [112] an analysis of the differences which may occur between various cognitive relations, attention, sen-sation, memory, and imagination, and gave an explanation of how cognitive data such as perceptions and concepts may become elements of knowledge. He also explained how knowledge is formed from logical or empirical facts, and described the difference between direct and indirect knowledge (i.e. knowledge by acquain-tance and knowledge by description). Knowledge by acquainacquain-tance is obtained through a direct interaction between a person and the object that the person is perceiving. Similar to Decartes he claims that sensory data from that object are the only things that are learned and we can never know the physical object itself. He also investigated whether we can obtain knowledge of things that are beyond our own personal experience, i.e. knowledge by description. Russell argued that such abstract knowledge is possible, because we can describe things which we have not experienced if we use terms which are within our own personal experience.

Immanuel Kant first introduced the term of a-priori knowledge in his work ’The critique of pure reason’ [61], cited as the most significant volume of metaphysics and epistemology in modern philosophy. He elaborates that our understanding of the external world has its foundations not merely in experience, but in both perception and a-priori knowledge, thus offering a non-empirical critique on the rationalist philosophy of the renaissance. He makes a distinction between the a-posteriori, being contingent and particular knowledge, and the a-priori being universal and necessary knowledge. He further states that the external world provides those things that we sense, but it is our mind that processes this infor-mation about the world and gives it order, allowing us to comprehend it. He also defined two different types of knowledge representation; intuitions and concepts, which are even used nowadays in modern theories on vision. Concepts are medi-ated representations which represent general characteristics of things. If we take the example of a chair, the concepts ”brown”, ”wooden”, ”chair,” and so forth are, according to Kant, mediate representations of the chair. They can represent the chair by representing general characteristics of the chair: being brown, being wooden, being a chair, and so forth. Intuitions are immediate representations that represent things directly. The perception represents the chair directly, and not by means of any general characteristics. He also postulated that both representation are essential for complete knowledge on the object. Hence his famous statement, ’Thoughts without content are empty, intuitions without concepts are blind.’

(15)

In contrast to Kant, John Locke [75], founder of the empirical movement, presumed that humans are born without knowledge, like a tabula rasa (a blank slate) and are formed by experiences in society / environment. He believed that humans are analogous to machines and that the human mind can be modelled as a set of sensory inputs leading to outputs. In this thesis, a similar principle will be applied to cognitive development in robots, where we assume that a robot has no initial knowledge and that all data is acquired through exploration.

In the 20th century, different disciplines emerged that tried to solve the ques-tion of how brain and mind cooperate and how cogniques-tion developed. Karl Lashley [72], one of the world’s foremost brain researchers, tried to locate the area in the brain where engrams or memory traces are stored. He sliced or removed sections of rat brains after teaching the rats to run mazes. He proved that most brain tissue is highly specialized and that a typical cognitive act does indeed activate many places in the brain, but that each area is doing something different from the others: something for which it is specialized.

Donald Hebb [4], neuropsychologist, connected the biological function of the brain as an organ together with the higher function of the mind and explained adaptation of neurons in the brain during the learning process. He proposed the theory behind associative learning, in which simultaneous activation of cells leads to evident increases in synaptic strength between those cells. Such learning is known as Hebbian learning. His research opened up the way for the creation of computational machines that mimicked the biological processes of a living nervous system and served as the inspiration for design of artificial neural networks and other machine learning algorithms.

Alan Turing [132], mathematician and computer scientists made an analogy between computers and human minds. He stated that hardware is the brain and software is the mind and further speculated that thinking can be described in terms of algorithmic manipulation of some information. This showed how a machine, capable of being in a finite number of conditions, could compute any computable numbers through the processing of symbols. He also devised the Turing test, which proposed that a machine should be declared intelligent if a human questioner was unable to distinguish the responses from an unseen human and an unseen machine. The subsequent development of computers from the 1950s onwards also revolutionized the fields of neuroscience and psychology which started using mathematical models to describe complex processes behind cognition. In the following section existing models of human cognition will be explained.

1.3 Development of cognition in contemporary psychology

The most advanced example of a cognitive system is a human being and by observing cognitive development in infants, many questions about the design of intelligent AI systems can be answered. Emergence of intelligence in humans

(16)

comes from both development of neuronal pathways in a body and from a person’s dynamic interactions with his environment [129]. A similar postulate holds for robotics and renown researchers such as Rodney Brooks [21] and Rolf Pfeifer [97] who argued that true artificial intelligence can only be achieved by machines that have both sensory and motor skills and are connected to the world through a body.

Humans are very complicated and it is difficult to mimic the well-developed human by robotic technologies. Therefore, it is necessary to understand how humans developed complicated functionality during their growth process. We may be able to mimic infants functions on robots and make them evolve by tracing the human developmental process.

Piaget [98], proposed the first theory of cognitive development of children. The main postulates of his theory are that children construct their own knowledge in response to their experiences and that they are capable to learn alone, without interference from other children or adults. He also stated that an exploratory motive to acquire new knowledge gives strong intrinsic motivation for learning and that reward from adults is not explicietly necessary in this process. According to him, the first stage is a sensory-motor stage that occurs from 0-2 year of the baby’s life and at this stage infants learn through their senses and from the interactions with the environment. During this time, Piaget postulated that a child’s cognitive system is limited to motor reflexes at birth, but the child builds on these reflexes to develop more sophisticated procedures. They learn to generalize their activities to a wider range of situations and coordinate them into increasingly lengthy chains of behaviours. Therefore it is very suitable period to observe and to mimic it on a robot. According to Piaget the cognitive development of infants during their first year is further divided in several periods as depicted below:

1. 0-1: Reflex Schema Stage

Infants are born with inherited reflexes, and it is through those reflexes that the infant begins to build meaning and understanding. Babies first learn how their body can move and work. Their vision is blurred, but their attention models are developed although the spans remain very short. There is no object understanding, however, babies have inherited preference for faces. The three primary achievements of this stage are: sucking, visual tracking, and hand closure.

2. 1-4: Primary Circular Reactions

Babies notice objects, start following their movements, and are able to fix-ate to object showing that basic object recognition exists. They ’discover’ their eyes, arms, hands and feet in the course of acting on objects. This stage is marked by responses to familiar images and sounds and anticipatory responses to familiar events. The infant’s actions become less reflexive and intentionality emerges.

(17)

The infant learns to coordinate vision and comprehension. Babies will reach for an object that is partially hidden, indicating knowledge of the object location and appearance. Actions are intentional but the child tends to repeat similar actions on the same object. Novel behaviours are not yet imitated.

4. 8-12: Coordination of Secondary Circular Reactions

This is deemed the most important for the cognitive development of the child. At this stage the child understands causality and is goal directed. The earliest understanding of object permanence emerges, as the child is now able to retrieve an object when its concealment is observed.

Renee Baillargeon [12] criticized the Piaget theory and stated that cognition and more advanced knowledge of objects and surrounding might arise sooner than Piaget assumed. All cognition experiments were based on the infant’s performance in object manipulation tasks which they are unable to perform before 9 months, since their coordination, planning and execution of action sequences is not yet developed. She introduced experiments to measure cognitive development by measuring the infant’s gaze. She showed that infants tend to look longer at novel than familiar events and objects, which indicates that there must exist more advanced cognitive functionality at that stage of development.

Spelke extended Piagets theory and by measuring the gaze of infants she ob-served how they form knowledge on objects and their environment [63]. She believes that humans are endowed with a small number of separable systems that stand at the foundation of all our beliefs and values and that new flexible skills, concepts, and systems of knowledge build on these core foundations. The first core system is for object representation; independent of the cultural surrounding of infants all of them perceive the boundaries and shapes of objects in a similar way. Also the infanfs are able to predict when objects will move and where they will come to rest. A second core system represents moving agents and their ac-tions and is defining how the infant should act upon the objects or with the other agents in the surrounding. The third core system is for numerical relationships of ordering, addition and subtraction. The fourth core system captures the geome-try of the environment and is for both object localization and navigation. Finally, the last core system is to identify members of ones own social group in relation to members of other groups and to guide social interactions with in- and out-group members.

Hadders-Algra stated that spontaneous motor activity is a major driving force in the development of the nervous system [48]. Motor development is based on the continuous interaction of genetic information and experience. During early human life spontaneous motor activity in general is not goal directed. The most frequently observed movement is the General Movement, which is a movement in which all parts of the body participate. Typical General Movements are characterized by complexity, i.e. the simultaneous exploration of degrees of freedom in all

(18)

participating joints, and variation; i.e. the ability to continue this exploration over time. Goal directed arm and hand activity emerges around 3-4 months after birth. The infant develops the ability to reach and grasp. The first reaching and grasping movements have an irregular and fragmented trajectory consisting of multiple movement units. The early reaches have a probing nature; the infant explores its repertoire of arm and hand movements [50]. During the following months, the reaching movement becomes increasingly fluent and straight, and the orientation of the hand gets increasingly adapted to the object [51]. Concurrent with the emergence of reaching and grasping also visual acuity, stereopsis and postural control show substantial improvement. Another major accomplishment during infancy is the development of postural control, resulting in the ability to stand and walk without support [49]. Furthermore, the development of looking in infants requires the ability to shift and maintain attention on specific objects and events. The ability to engage and disengage attention on targets is present at birth and develops rapidly over the first half year of life. The saccadic system for shifting gaze develops ahead of the system for smooth tracking, it is functional at birth and newborn infants are fairly skilled at moving gaze to significant events in the visual field, such as very attractive objects. The ability to control these actions is a basic aspect of cognitive development [138].

Most contemporary psychologists agree on the following facts regarding the cognitive development in infants. This development depends crucially on moti-vations which define the goals of actions. The two most important motives that drive actions and development are social or external and explorative or internal motivation. There are at least two exploratory motives: (a) the discovery of nov-elty and regularities in the world, and (b) the discovery of the potential of the infants own actions. In the development of perception, there are two processes: (a) the detection of structure or regularity in the flow of sensory data, and (b) the selection of information which is relevant for guiding action. The loop of learning is as follows; the infant first preselects the most relevant / salient information, builds an efficient representation of it, keeping in mind the object’s structure, and further learns its properties through exploration.

Before we can propose a cognitive robot architecture inspired by the principles outlined above, it is important to examine how they have been used already to design state of the art cognitive robots. This is the topic of the next section.

1.4 State of the art on cognitive robots design

This section introduces a series of integrated humanoid robot platforms that on the one hand allow testing of cognitive development in humans and on the other hand allow the realization of service robot tasks for household or public spaces. These platforms can be divided in full humanoid platforms, wheeled robots with a humanoid upper-body, and finally robots with child appearances [137]. All robots described in the text below are depicted in Figure 1.1.

(19)

Asimo REEM PR2 Robovie-II Armar-III

Cosiro SOINN robot CB2 Icub

Figure 1.1: State of the art cognitive robots

The most advanced robot in the world is certainly Asimo by Honda [1], de-signed to help elderly or disabled people in their homes. As one can see from Figure 1.1 it is a very complex and heavy humanoid platform with 57 degrees of freedom, that can autonomously walk, jump and run both forward and back-wards. It is equipped with two hands and independent finger control to perform sign language and efficient manipulation of small objects. It can also integrate information from multiple sensors to navigate in open space, better track and predict the motion of multiple humans, and - from auditory input - perform voice recognition in noisy and crowded environments. Besides, it can recognize faces, gestures and objects using offline learned data. Although, the new generation of Asimo robots showed significant progress in intelligent body design and complex locomotion, it is mainly executing preprogrammed tasks and is lacking the real intelligence, necessary to adapt to novel environments. Another disadvantage is its cost of over one million Euro.

Another commercial platform, the Reem robot from PAL robotics [134] low-ers the costs to about 200 000 Euro in an attempt to attract a larger market. Designed as an amusement robot for public spaces, it has a wheeled platform with a humanoid torso able to safely navigate in crowded environments. Since it is mainly focusing on human robot interaction, it has people detection and face recognition integrated as well as speech recognition and syntheses. Addition-ally, it is equipped with a dynamic information point system that can be used in a wide variety of multimedia applications; such as to display an interactive map of the surrounding area, look up tourist information, offer tele-assistance via video-conferencing, etc.. Although it has basic social interaction skills, the main drawbacks of the REEM robot are its inability to manipulate small objects and

(20)

its lack of algorithms for learning and adapting to new environments.

The PR2 robot from Willow garage [2] is a first commercial platform that approximately costs half a million Euro. It is a two-armed wheeled robot with complex 7 DOF arms and actuated grippers able to grasp household objects. It is capable to safely navigate and perform many advanced household tasks such as folding clothes, opening the fridge and grasping a drink, pushing elevator buttons, etc.. Another advantage is its advanced state machine with incorporated semantic representations and reasoning methods, which allows object search in large scale environments. Common sense knowledge acquired from Internet users is utilized to bootstrap probabilistic models about typical locations of objects which are updated during the robots previous experiences and observations. Also optimal search paths are chosen depending on the situational context of the robot. Such methods allow the PR2 to perform complex fetch and delivery tasks, such as bringing a sandwich to the user at a multi-floor building [115]. There are also attempts to model the episodic and semantic memory on the robot. However, this robot is still lacking the capabilities for online object and activity learning.

Beside commercial companies, various universities and research institutes have developed custom made platforms for specific robotic research purposes.

Robovie-II [54] is a robot shopping assistant with a humanoid torso on a wheeled platform with vergant eyes in its head. The robot follows a shopper around the store, carrying the load, reminding the shopper of the items on his shopping list, and recommending additional products to pick up. What is very interesting is that a robot is part of a larger network of sensors and wireless devices and that it uses network observations of the environment to accordingly update its behaviour. From a cognitive perspective, the robot was designed to test a joint attention module, shared by the robot and a human.

The humanoid robot Armar-III has 43 DOF, is a wheeled holonomic platform with a humanoid torso with 7 DOF arms and actuated 5 fingered grippers. The platform was specially designed to test planning and execution of actions at all levels of its cognitive architecture [141]. Most important, it has implemented autonomous acquisition of visual object representations and autonomous grasping of objects from offline learned grasping models [100]. This was also the first trial to online broaden the robot’s knowledge through exploration.

The Nimbro [124] is another robot that is very similar in hardware configura-tion to Armar in the sense that it can perform complex manipulaconfigura-tion and nav-igation tasks. Its main advantage is its robot-robot coordination and advanced robot behaviour control, so robots can efficiently communicate and distribute and share work among themselves. An additional advantage is its ability to perform cooperative tasks with humans, such as carrying a large object together [123]. This is a step towards development of collective intelligence in robots.

The SOINN Robot from the Tokyo Institute of Technology is programmed with reasoning processes similar to those of human thought and capable of learning new tasks and actions on its own. It uses an algorithm called SOINN (self-organizing incremental neural network) to adapt to new situations and continuously learn

(21)

new information. The robot takes visual, auditory, and tactile data as input and when confronted with a new task, it uses its past experiences in conjunction with sensory data to determine how to act in a specific situation. It can also ask for help and communicate through internet with other robots to share its knowledge [126]. However, the large drawback is that the robot’s a-priori knowledge must be learned offline and extensively labelled by human experts.

Finally, the last group of robots are specially designed having a child’s body to test the development of cognition in infants. CB 2 [53] has a human-like appear-ance similar to a child, 56 actuators and a soft silicone skin with tactile sensors. It was developed to test mechanisms of development of sensory-motor coordination in a social context by measuring the somatosensory map based on tactile inter-actions with people. Another study tested a mechanism of motor development of infants with a persons help. Another child robot, Icub [84] has a body of a size of 3.5 year old child and 53 DOF, specially designed to research embodied cognition. It is able to crawl, sit, manipulate objects and has fully articulated head and eyes. It also has visual, vestibular, auditory, and haptic sensory capabilities. The robot is designed to act in cognitive scenarios, performing tasks useful for learning while interacting with the environment and humans. [138].

To summarize, most state of the art robots developed so far have a very complex body design and advanced control algorithms to manage and program its complex structure. However, they still do not manage to display real intelligence, development of knowledge, and application of corresponding previous knowledge in novel environments or situations. In this thesis we follow a different approach and present a very simple easy controllable robot, able to develop and bootstrap its declarative knowledge, assuming that in beginning it has no incites on the world around it. In that way, we follow John Lock’s empiristic approach that our robot is born with a tabula rasa and that all its knowledge is formed by experiences from interaction with its environment.

The design of an artificial system capable of developing cognitive abilities demands satisfying multiple requirements. We will elaborate on them in the next section.

1.5 Requirements for cognitive robot design

Observing the development of cognition in infants and contemporary robot archi-tectures, several cognitive processes arise that a robot needs to possess in order to augment its intelligence. As is displayed in Table 1.1 they are divided in mul-tiple groups, ranging from smart body design, perception and action capabilities to robots able to adapt to new environments and having their own motivation. Furthermore, we can argue whether all the robot’s knowledge should be devel-oped and shaped by experience as advised by empiricists, or the robot should be pre-equipped with basic functionalities. Although current research indicates that the peripheral nervous system and hence the reflexes are not entirely built up

(22)

during birth, we decided to limit our research to mimic the cognitive development of infants till 6 months and to assume that robot already possess basic perception and action abilities to control its own embodiment and receive sensor inputs.

Table 1.1: Guidelines for the Development of Cognitive Systems

Cognitive process Embodiment

Rich array of physical sensory and motor interfaces Morphology integral to the model of cognition

Perception

Attention fixated on the goal of an action Perception of objects

Discrimination and addition of small number of objects Attraction to people (faces, their sounds, movements, and features)

Recognition of people and actions

Involvement of the motor system in discrimination between percepts Mechanism to learn hierarchical representations

Pre-motor theory of spatial attention Pre-motor theory of selective attention

Action

Early movements constrained to reduce the number of degrees of freedom Navigation based on dynamic ego-centric path integration

Adaptation

Autonomous generative model construction Partial learning of affordances

Transient and generalized episodic memories of past experiences Motivation

Explorative motives Autonomy

Autonomy preserving processes of homeostasis Minimal set of innate behaviours for exploration and survival

Encode space in motor and goal specific manner

Moreover, learning of motor actions and skills like reinforcement learning [117] to walk is not part of this thesis. Therefore we will further focus on the prerequi-sites for the development of looking, reaching and grasping abilities as is depicted in Figure 1.2.

Regarding the development of looking, the first step is a robot’s ability to redirect its gaze and move his eyes and head towards the object of interest [119]. For that an attention model needs to be developed that will preselect the most interesting object in a scene. The next step is the development of a smooth

(23)

a)

b)

c)

Figure 1.2: Example of looking, reaching and grasping abilities in infants

pursuit model, which will allows the robot to fixate on an object of interest while it remains static, or to dynamically track the object when both robot and object move. Finally, the robot needs to be able to track the object throughout occlusion and to predict where and when it reappears.

Regarding the development of reaching, the robot needs to be able to reach towards a visual target, first with a hand and than with its entire body [136]. Finally, it needs to adapt its reach towards the moving target and to learn when reaching is not possible.

For grasping development, the robot should be able to reach for static or moving objects, to perform grasp closure during approach and to efficiently match the grasp pose to an objects axis of symmetry [92].

Besides reaching and grasping, the robot should also be able to learn how to recognize objects both according to their appearance and function, and to bootstrap its declarative knowledge from experience. Finally, imitation learning needs to be developed, where the robot will enrich its manipulation repertoire by observing humans performing specific actions and mimicking his/her behaviour [57]. For that, both people tracking, face recognition and action recognition need to be present.

In the next section we discuss a cognitive robot architecture and explain how this satisfies the requirements above.

1.6 Cognitive Robot Architecture

Inspired by the cognitive development of infants up to the first year of their life and following the design requirements elaborated in Section 1.5, we propose a cognitive robot architecture that allows a robot to develop and bootstrap its

(24)

declarative knowledge by interaction with the environment. In contrast to the state of the art robots of Section 1.4, we assume that a robot has no incites in the world in its beginning and that all its knowledge is formed online by explo-ration. To prove the feasibility of our architectural proposal and following Platos postulate [99] that to learn the object is to act on it, we have designed a simple and affordable platform, robot Robby, as depicted in Figure 1.3. In its design we followed Kants principle [61] that the robot need to posses universal and neces-sary knowledge about the control of its own body and therefore we programmed the robot’s basic functionalities in terms of basic sensor input and basic actuator output systems. We are aware of the fact that robot skills can be learned by e.g. reinforcement learning [117], however, the focus in this thesis is on cognitive functionality. Consequently all knowledge on the world and its objects in it, is acquired by the robot itself.

Figure 1.3: Robby, cognitive robot of the Delft Robotics team

Robot Robby is a personal robot able to perform more tasks than this the-sis describes. It is also used for the Robocup@Home competition [14], which

(25)

requires robots to perform household tasks such as cleaning a room, bringing drinks, following people, etc.. Our robot achieved much recognition through its demonstrations in several popular shows at Dutch national TV, healthcare fairs and children manifestations [3]. One of the main advantages of our robot is its friendly appearance and simple robot body, which allows to quickly understand its control. It is equipped with a head and neck used for information acquisition and sensing, a mobile base and an arm with an underactuated gripper to grasp almost any object no matter its shape or weight [68]. For its control a single stan-dard laptop is used. A detailed description of the robot can be found in Chapter 6.

In Figure 1.4 a detailed scheme of the working memory model of Baddeley is presented [11]. The main processing unit is central executive that receive in-puts from senses and long term memory and decides how to act. For processing of acquired information, a visuospatial sketchpad is responsible. It can process visual information on object attributes such as color, texture and shape as well as information on facial features. The spatial functionality indicates geometry processing and localization, while the kinetic functionality allows processing of motion information such as gesture and human action recognition. Speech in-puts are processed in phonological loop using learned information from auditory long term memory. Finally, an episodic buffer is informing in which state the robot is and allows for plan execution. Our robot architecture is inspired by this architecture, but does not follow it directly.

In Figure 1.5 a cognitive architecture of the robot is presented. With this architecture we mimic an imitating infant brain as Spelke suggested [63]. The main structure is a sense-think-act loop that operates on the world. The main actuators are the motors plus encoders and joint encoders. The motors with their controllers are daisy-chained such that we can control them through a single USB port on the robot’s on-board laptop that houses all software to control the robot. The controllers control the wheeled base, arm, gripper, neck and eyes. As such the ACT module represents the human’s peripheral nervous system and it control and proprioceptive pathways. The main sensors are the visual (USB-connected) sensors, in our case a laser scanner apt for collision avoidance during movements of the base, a Microsoft Kinect for the navigation in 3D space of the arm and a high resolution camera for foveated vision during object recognition and tracking. Especially the saliency detection and attention modules make that the SENSE module mimics somehow the human’s subcortical structures that implement vi-sual reflexes and attention handling. The THINK layer above the SENSE and ACT layer represents the working memory / central executive of Baddeley’s model [11]. It contains basic skills modules for navigation, reaching, grasping, tracking and speech synthesis that we all implemented (for now hardcoded without -reinforcement - learning). The THINK layer also contains a cognitive module that implements speech recognition, speaker recognition, face recognition, object recognition and action recognition, of which only speaker recognition is not im-plemented on Robby. The functions in this block are all based on declarative

(26)

CENTRAL EXECUTIVE

VISUOSPATIAL SKETCHPAD

SPATIAL KINETIC

VISUAL

COLOR TEXTURE SHAPE FACES _PROCESSINGGEOMETRY GESTURES ACTIONS

LONG TERM MEMORY: SEMANTIC

STATE PLAN

EPISODIC BUFFER _{LONG TERM MEMORY:}

EPISODIC BUFFER LONG TERM MEMORY:

AUDITORY

PHONOLOGICAL LOOP WORKING MEMORY

Figure 1.4: The working memory model of Baddeley

semantic learning and form the focus of interest of this thesis. On a higher level the skills module contains the functionality of: automotion tracking, object track-ing, people tracktrack-ing, gesture tracking and emotion tracking. For their functioning they need information from the recognition modules and the skills modules but also directly from the sense and act modules. In Figure 1.5 we omitted drawing relations (lines) between functions that use each other as this makes the pic-ture unreadable. On an even higher level one can identify activity recognition and activity planning as functions from the class of declarative episodic learning. We have not implemented this functionality. Finally, the top layer hosts various forms of long term memory in which proven concepts and persistent information on objects, people, scenes, actions and activities can be stored. We have not implemented this explicitly in Robby, although we touch on this subject when treating object recognition and leaning. Finally social (external) and explorative (internal) motivation finds its place in the top layer of the THINK module. Again, we only touch on this implicitly when we treat object recognition.

Consequently, regarding sensing, our robot is able to acquire spatial informa-tion on the environment, and it uses the Microsoft Kinect in its head for object segmentation and localization, similar to the peripheral vision system in humans, while a laser scanner on its base is used for navigation and dynamic obstacle avoidance. For the focal vision system we use a high resolution camera to

(27)

in-Figure 1.5: Proposed cognitive robot architecture

crease the precision of the recognition. The robot’s gripper is equipped with an infra-red sensor to the provide the robot with some touch sense, while speakers and a microphone allow verbal communication between users and the robot. All information that is sensed is selectively processed using the attention module. The robot is able to act on its environment and it has several embedded behaviours. With its mobile base it can safely navigate in its environment, with its arm it can reach towards interesting objects, both known and unknown, and efficiently manipulate them with its underactuated gripper. Its neck allows its head to effi-ciently track objects, faces and persons, while its microphone in the head allows the robot to express its attentions to users. The robot is further able to learn and store information in its long term memory. It can learn objects and actions and store them in its semantic memory, as well as new words from conversations with users and therefore uses some form of auditory memory. We also mimic the

(28)

episodic memory which stores plans that robots need to execute. Finally it is able to feel and motivate itself. Therefore we now implicitly use two initial motives. Exploratory, to constantly update its knowledge from interaction with the envi-ronment and socially to execute commands given by users. Currently, the plans of the robot activities are given offline, but in future work we plan to dynamically recognize and learn them online.

1.7 Thesis outline

In this thesis we treat the development of algorithms that contribute to our cog-nitive robot architecture. As we only treat perception and cognition and its declarative learning, while we omit learning of actions, skills and activities, we divide the thesis in four main parts:

• The first part is on knowledge formation; how knowledge on the world is formed and described

• The second part is on knowledge acquisition; how a robot can explore its environment to gain knowledge

• The third part is on learning; how online learning of objects is achieved • The fourth part is on real world applications; it treats two realized

applica-tions using our approach

In the first part on knowledge formation, we define how knowledge about the world is formed and described. In Chapters 2-4, we introduce various ways to describe visual information and to combine it, despite challenges of cluttered en-vironments and varying illumination conditions. We propose how to efficiently combine information on color, texture and shape attributes of the unknown ob-jects in order to register only the most representative features. As Decartes sug-gested [31], one of the characteristics of intelligent beings is to store only the most selective and representative information on objects in their surrounding.

The second part about sensory motor integration is outlined in Chapters 5-6 and describes how the robot can explore its environment based on sensor input signals. In Chapter 5 we first introduce an attention model that is able to find salient parts in the environment and efficiently group them in order to detect and segment objects without any prior knowledge on their appearance or their back-ground. In Chapter 6 we extend this model with spatial processing of the scene and propose a method that allows a robot to efficiently explore and manipulate unknown objects. This manipulation is based on a constant visual pursuit of the selected object and adaptation of the robot’s posture to the target in order to reach and grasp it. As was defined in the requirements from Section 1.5 both grasping of static and dynamically moving objects is achieved. In all situations the robot has no prior knowledge on the environment nor on its objects.

(29)

The third part of this thesis is on knowledge acquisition and learning, and we explain how online learning of objects is achieved. In Chapter 7 memory models of working and long term memory are proposed and the loop from knowledge acquisition to knowledge learning explained. Additionally, we propose methods to efficiently use learned knowledge in novel situations.

In the fourth part on real world applications, we present two realized applica-tions based on our architecture. In Chapter 8 online learning of objects is used for an input station of a fully automatized warehouse. Currently, in such an input station, human workers are manually labelling new items entering the warehouse. They are replaced by a robotic arm that is automatically learning new object models through exploration. In Chapter 9 we present smart cameras for elderly homes which monitor daily actions of elderly while informing responsible care givers in case accident happens. The actions of people are described using the change of motion information during the execution of a specific action, and there-fore the privacy of the users is fully protected. The system is tested in several elderly care centres and received recognition by entering the finals of the Delft Innovation Award.

Finally, in Chapter 10, we present conclusions as well as recommendations for future work.

(30)

(31)

KNOWLEDGE

REPRESENTATION

(32)

(33)

Chapter

2

Keypoint Extraction and Selection

for Object Recognition

Chapter modified from article:

Maja Rudinac, Boris Lenseigne, Pieter Jonker: Keypoints extraction and selection for object recognition, in Proc of IAPR Conference on Machine Vision Applica-tions (MVA 2009), Japan, May 20-22, 2009

(34)

2.1 Abstract

In order to improve the performance of affine invariant detectors, approaches that combine different keypoint extraction methods can be found in literature. However, such a combining has two major drawbacks: a high computational cost in matching similar objects, and a large number of false positives because only a relatively small subset of those keypoints is really discriminative. In this chapter we propose a method to overcome these limitations: First we combine different keypoint extractors in order to obtain a large set of possible interest points for a given object. Then, a two step filtering approach is applied: First, a reduction using a spatial criterion to reject points that are close in a specified neighbourhood, and second, filtering based on the information entropy in order to select only a small subset of keypoints that offer the highest information content. A qualitative analysis of this method is presented.

2.2 Introduction

In cluttered real world scenes, object recognition is a demanding task and its success depends on the algorithms invariance to partial occlusions, illumination changes and main object variations. For these situations, local invariant features seem to provide the most promising results [76], [13] since they are robust to oc-clusions, background clutter and content changes [87]. Variations of these features are successfully used in many applications. They are used to describe the object appearance in order to determine the object class in bag of words models [36], where the information about feature location is neglected. Or they are applied in applications where the information about the spatial feature distribution is crucial, such as in localization of autonomous mobile robots [45].

In our research, recognition is used for the purpose of object localization and grasping with various robotic arms, so both information about the object class and its current location must be provided in real-time. The conditions present while creating object models differ a lot from the situation when an object should be recognized and grasped. Moreover, our recognition framework should work in industrial as well as in lab environments. For these reasons, experimenting with local invariant features is a logical step. In order to gain as much information as possible, we decided to combine different keypoint extraction methods for de-tecting the object and then to reduce the number of found keypoints using an independent measure for information content. This reduction is performed for two reasons: to keep the most discriminative points, and to speed up the match-ing. For creating the object models, the most representative keypoints are then described using SIFT [76] and GLOH [85] descriptors. This chapter is organized as follows: section 2.3 gives a short overview of related work. The detailed expla-nation of our approach and the test results are presented in sections 2.4, 2.5, 2.6. The final conclusions are drawn in the section 2.7.

(35)

2.3 Related work

Robust and affine invariant keypoint extraction is a well known problem and re-cently intensive research in this area has been done. A very detailed evaluation of affine region detectors made by Tuytelaars et al. [133] gives a framework for test-ing future detectors as well as the state of the art and their performance. Analysis showed that the detectors extract regions with different properties and the overlap of these regions is so small, if not empty, that one detector can outperform others only in one type of scenes or one type of transformation. In order to obtain the best performance, several detectors should be used simultaneously. This observa-tion inspired us to experiment with different keypoint extracobserva-tion methods. The best overall results were obtained using MSER [80] followed by the Hessian Affine detector [133]. Apart from these two, several other evaluations of detectors were published, e.g. detectors for 3D by Morales [89] and local features for object class recognition by Mikolajczyk [86]. Experiments showed that the Hessian-Laplace in combination with GLOH gives the best overall result. Stark [122] confirmed this and concluded that the choice of the detectors is much more important for the overall performance of the recognition than the choice of descriptors. For this reason we limited our descriptor set to just two that proved to be the best: SIFT and GLOH. Several authors tried to combine local invariant features but without significant results [103], [40]. Experiments showed that combinations of detectors perform better than one detector alone, if they produce keypoints in different parts of the image. However, the main problem they encountered is the matching speed of the detected keypoints and a high number of false matches due to the fact that only a small number of points is really discriminative. The conclusion that rises is that a reduction must be applied. In this chapter we proposed a method to overcome the mentioned limitations.

2.4 Combining keypoints

Several detectors and descriptors were used as building blocks in our combined algorithm. A short description of every one of them follows below.

2.4.1 Building blocks

1. Hessian Affine: It spatially localizes and selects the scale and affine invariant points detected at multiple scales using the Harris corner measure on the second-moment matrix. On each individual scale, interest points are chosen based on the Hessian matrix at that point [87], [85].

2. Harris Affine: This relies on the combination of corner points detected through Harris corner detection, multi-scale analysis through Gaussian scale-space, and affine normalization using an iterative affine shape adaptation algorithm. It makes it possible to identify similar regions between images

(36)

that are related through affine transformations and which have different illumination [87], [85].

3. Hessian Laplace: A method that responds to blob like structures. It searches for local maxima of the Hessian determinant and selects a characteristic scale where the Laplacian attains an extremum in scale-space [85].

4. MSER: A method for blob detection in images which denotes a set of distin-guished regions which are defined by an extremal property of its intensity function in the region and on its outer boundary. It was originally used to find correspondences between image elements from two images with different viewpoints [80].

2.4.2 Syntheses

In our approach we decided to combine different detectors in order to extract a large set of points which offer as different information about the object as possible. Combinations of either two or three different detectors were applied simultaneously on the image and all extracted keypoints were saved together in the same subset. We combined mostly the detectors Hessian Affine, Harris Affine and MSER while the other combinations were used for comparison. Since the number of keypoints is extremely large we consequently apply a reduction method, so that only the n most representative points for every extracted combination are selected. For this reduced set of keypoints, a SIFT or GLOH descriptor is calculated, forming the feature matrix for a given image. The entire method is displayed on Figure 2.1.

2.5 Method for keypoint reduction

We propose to use a two step algorithm for keypoint reduction: First, we ap-ply reduction using spatial criteria to reject points that are close in a specified neighbourhood, and then we filter based on the information entropy, in order to select only a small subset of the most representative keypoints offering the highest information content.

2.5.1 Reduction using spatial criteria

As keypoints close to each other represent redundant information, these points are first filtered using a spatial criterion. Every keypoint is represented by an ellipse that defines the affine region. For every pair of keypoints we evaluate whether the absolute distance between the centres of their ellipses lies within a certain threshold. If this is so, it means that those points lie in the neighbourhood of each other and only one point from the pair is kept. A threshold is determined by manual tuning and in the end we established the 9 neighbourhood of a point as a

(37)

Figure 2.1: Scheme of the proposed method

measure of closeness for our application. For a more restrictive reduction higher thresholds can be chosen.

2.5.2 Reduction using entropy

Since the set of extracted keypoints is computed using different techniques, an independent measure of keypoint relevance must be applied. It is shown that the probability of correct matching increases with increasing information content [41], [121]. That inspired us to use the entropy of local regions for distinctive keypoint selection. We propose the following algorithm:

1. For every keypoint a region of interest is set being the 9 neighbourhood around it.

2. If the keypoint is on the edge of the image and its region of interest is out of the image boundary, we clone the pixel values from the existing part and fill in the missing values of the 9 neighbourhood (see Figure 2.2).

3. Calculate the local entropy using (2.1) for every pixel within the region of interest. In this formula Pi is the probability of pixel i.

H = −X

i

(38)

4. The entropy of the region of interest is now estimated as the Euclidean norm of entropy values calculated for every pixel in the previous step.

5. Repeat steps 1 till 4 for every keypoint.

6. Sort keypoints in descending order, according to the entropy values of the region of interest around them.

7. Select only the first n ranked keypoints with the highest entropy values. 8. Calculate the SIFT or GLOH descriptor only for those most representative

keypoints.

In testing we used n = 200 as a threshold value but this number depends on the application and it is really difficult to predict how many keypoints are really necessary for an efficient recognition. One should also bear in mind that a higher number of extracted keypoints leads to a larger number of false positives. Obvi-ously, a trade-off must be made between this high threshold and a small number of features that allow fast matching and a low threshold and a higher number of features which will provide more information about the image content.

Figure 2.2: Cloning of the missing pixels

2.6 Testing and analysis

2.6.1 Repeatability results

In order to determine the quality of selected keypoints, we tested our algorithm using a standard framework for detector performance evaluation proposed by Mikolajczyk et al.[85]. The detector performance is characterized using the re-peatability defined as the average number of corresponding regions detected in images under different transformations. The repeatability score is calculated us-ing the ground truth data for three different types of scene that represent main transformations. Results are shown for the scene with a viewpoint change, a zoomed and rotated scene and a scene with varying lighting conditions. Since we work with object recognition in real world situations with a constant change of

(39)

environmental conditions, good results under these transformations are of crucial importance. We tested the following detectors and their combinations, labelled: hesaff - Hessian Affine; mshesa - MSER + Hessian Affine; harhes - Harris affine + Hessian Affine; mseraf - MSER; kombaf - MSER + Harris affine + hessian affine; heslap - Hessian Laplace. For the purpose of testing we reduced the num-ber of keypoints approximately 10 times compared to its original size and the repeatability is calculated separately for the reduced set and for the original one. The results are shown in Figures 2.3-2.8.

20 25 30 35 40 45 50 55 60 0 20 40 60 80 100 repeatebility % viewpoint angle hesaff mshesa harhes mseraf kombaf heslap

Figure 2.3: No reduction, scene with viewpoint change

The overall conclusion can be drawn that if the original set is reduced even 10 times in size, the repeatability score will decrease no more than 10% for all three types of scenes, while the speed-up in matching is significant! We also tried higher reduction thresholds and noticed a linear decrease in repeatability, meaning that depending on the application a different number of keypoints can be selected. The best results in the repeatability tests were achieved by MSER, which underpinned the conclusions from literature. Since MSER shows significant drawbacks in clustering and localization [103], we looked at combined detectors as an alternative solution. Our simulation results justified our hypothesis, since the second best is kombaf a combination of three different detectors, which also gives a more discriminative representation of the image.

In order to localize the objects in the scene, all keypoints were described with a 128 dimensional vector of either SIFT or GLOH features. For matching we deploy a simple keypoint voting using the Euclidean distance as the measure of the closeness of points. The computational complexity of such a matching is proportional to the squared number of keypoints, so a reduction of 10 times gains

(40)

20 25 30 35 40 45 50 55 60 0 20 40 60 80 100 repeatebility % viewpoint angle hesaff mshesa harhes mseraf kombaf heslap

Figure 2.4: Reduction, scene with viewpoint change

2.5 3 3.5 4 4.5 5 5.5 6 6.5 0 10 20 30 40 50 60 70 80 90 100 repeatebility % decreasing light hesaff mshesa harhes mseraf kombaf heslap

(41)

2.5 3 3.5 4 4.5 5 5.5 6 6.5 0 20 40 60 80 100 repeatebility % decreasing light hesaff mshesa harhes mseraf kombaf heslap

Figure 2.6: Reduction, scene with light change

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 0 20 40 60 80 100 repeatebility % scale change hesaff mshesa harhes mseraf kombaf heslap

(42)

1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 0 20 40 60 80 100 repeatebility % scale change hesaff mshesa harhes mseraf kombaf heslap

Figure 2.8: Reduction, scene with light change

(43)

a significant speed-up. An example of matching the object model with the scene is shown in Figure 2.9.

2.7 Conclusion

In this chapter we proposed an algorithm for the reduction of a large set of keypoints which we collected using different keypoint extraction methods and their combinations. Our approach consists of a two step filtering: spatial filtering to reduce close points as a first step and a selection of the most discriminative points with the highest information content as the second step. The overall performance of the method was tested using a standard framework for testing the quality of detectors. Our results showed that reducing the set of keypoints to only 10% of its original size leads to a less then 10% decrease in the repeatability score, while the matching speed is significantly improved. We also showed the application of this approach for the localizing of objects in a scene. In our future work we would like to expand our feature set with more global descriptors such as shape context and different color and texture descriptors, and to try to combine that information with the one gained from the keypoints. We hope that using such a versatile approach, a more precise information about the object appearance as well as its location in the scene could be obtained.

(44)

(45)

Chapter

3

A Fast and Robust Descriptor for

Multiple-view Object Recognition

Chapter modified from article:

Maja Rudinac, Pieter Jonker: A Fast and Robust Descriptor for Multiple-view Object Recognition, International Conference on Control, Automation, Robotics and Vision, ICARCV 2010, Singapore 7 -10 December, 2010