Human-computer supporting interfaces for automatic recognition of threats

(1)

Poznań University of Technology Faculty of Computing

Chair of Control and Systems Engineering Division of Signal Processing and Electronic Systems

Human-computer supporting interfaces for automatic recognition of threats

Ph.D. Dissertation

Julian Balcerek

Supervisior: Prof. dr hab. eng. Adam Dąbrowski Auxilary supervisor: Dr eng. Paweł Pawłowski

Poznań 2016

(2)

2

1. Introduction

1.1. Research area

Urban areas are inhibited by 3.9 billion people all over the world. Moreover, in the next 30 years, the number of people living in the cities is expected to raise to more than 6 billion [Hera2014]. Only in the EU more than 350 million people live in agglomerations bigger than 5 thousand inhabitants [EurC2011]. There is a causal relationship between the large number of inhabitants per squared kilometer and the density of the traffic on roads. Inhabitants use roads to travel or to transport goods and the more inhabitants per squared kilometer the more cars on the roads. This situation generates many issues related to safety of people, and to safety and efficiency of transportation. For example, in Warsaw (Poland), the average travel time is by ca. 40% longer than it could be with uncongested traffic [Stat2015].

Due to high density of traffic, many undesirable communication events occur like collisions, accidents, and pedestrian hitting. Only in 2013 in the EU there were approximately 26 thousand road fatalities with more than 3 thousand in Poland. In consequence, every year in Poland a population of a small city dies due to traffic fatalities. Statistically, for every person killed in a traffic accident, there is about four permanently disabling injuries, eight serious injuries, and 50 minor injuries [EurC2013].

In addition, there are the threats of contraventions, misdemeanors, and felonies. In the EU there are approximately 24 million crimes committed per year, with more than million in Poland [EurC2014]. Examples of these include thefts, beatings, robberies, kidnapping, drug trafficking, and terrorism.

Fig. 1.1. Typical systems for threat prevention in urban environments

Typical contemporary systems for risk prevention in urban environments operate in such a way that if a threat occurs, a relevant information about the risk will be collected in a form of audio-visual and text data (cf. Fig. 1.1). Automatic signal processing algorithms are often used in many typical situations. For example, detection of a dangerous motion results in the immediate alarm activation. However, in more complex situations concerning, e.g., decisions of sending an ambulance, an intervention of the experienced operators of the information centers can occur to be mandatory. Hence, an operator should reliably use her/his expert knowledge for the quick and proper reaction. An adequate reaction results in the elimination of hazards.

(5)

5

There are two important systems for threat prevention with two different kinds of human- computer interfaces. The first are video monitoring systems and the second are systems of emergency telephone number operation. They rely on a processing of a variety of information and the operator plays a decisive role. These systems are presented and discussed in the following paragraphs.

The human-computer interfaces developed by the author support automatic detection of threats. The focus of this thesis is put on hazards in urban environments. Problems in Poznań and Wielkopolska Region in Poland are taken under examination. However, the presented results can also be used in the whole European Union (henceforth EU).

In the contemporary times, one can observe a quick and intensive progress in closed- circuit television (henceforth CCTV) monitoring applications. For example, in 2002 in London there were approximately 0.5 million monitoring cameras [McCa2002]. In Poznań, the home city of the author of this dissertation, the agglomeration population with about million inhabitants, there are currently several tens of thousands of cameras, including over 500 cameras of the urban video surveillance system [Pozn2016]. This monitoring system consists also not only of the recording equipment, but of the base and node stations, and is composed of more than 50 operator workplaces in 18 locations. It is prepared to support more than 1.5 thousand cameras. The information is collected statically for fixed cameras, and in the defined positions of view for pan–tilt–zoom (henceforth PTZ) cameras. Moreover, the aforementioned monitoring system infrastructure of Poznań is being quickly expanded.

Based on a location of a camera, it is more or less known what kind of information the data stream will deliver. However, due to large amount of the collected data to be analyzed [Velt2012], the demand for new visualization techniques and for novel visual analytics for the multimedia content in the collected data is huge. In consequence, the development of new supporting tools for the detection of threats using a vision system is one of the most important tasks and still an essential research problem. However, some methods, referred to as the standard approaches, have already been developed. They include e.g. detection of: motion, left luggage, smoke, fire, people crossing roads on red street lights, to mention a few of them [Data2012, Cetn2012]. Other important tasks are e.g.: tracking and counting people, detection of crossing the security lines, face, gender, and gait analysis, including recognition of people using various biometric parameters [Data2012, Dabr2010]. Vision systems are also used for the detection, tracking, and counting vehicles, including recognition of vehicle types, license plate numbers, traffic congestions, prohibited maneuvers, and accidents [Chen2012, Cetn2012, Zhao2012].

It has already been proven that standard methods for automatic detection of threats substantially improve operators’ work efficiency, especially for complex tasks. On the other hand, for simple tasks, manual operator observations, analyses, and decisions can even be more efficient and effective [Rank2012]. Moreover, in the current systems, distracting areas may be reduced by only zooming in rectangular image areas i.e. without any possibility of choosing the selected areas and objects. However, with the use of nonstandard visualization techniques such as the 3D quality the monotonous, boring, arduous, and cumbersome operator job can become much more attractive. In result it is much easier to constantly keep focus and concentration, and stay conscious.

The second way of reacting to threats and their elimination is based on reporting by people from emergency telephone services. Since 1991, the 112 number has become

(6)

6

a standard emergency number in the EU [OBri2013]. The same standard emergency number was officially implemented in Poland in 2009 [Supr2013] and operates in parallel with the elder system, based on a set of telephone numbers compatible with particular services.

Nevertheless, many technical and organizational problems are associated with the emergency telephone services and related systems. For example, proper reaction must be based on plausible, relevant, and consistent data despite that the same event may be repeatedly reported and the persons who called might have given inaccurate or incorrect data.

In addition, the operator can make mistakes during the data entry and the analysis.

Moreover, telephone emergency services are quite often abused. For example, according to the report of the European Emergency Number Association (henceforth EENA), the number of false emergency calls in 2011 ranged from 15% for Luxemburg, through 43%, 53%, 65% for the UK, Italy, and Ireland, respectively, and up to 70–80% for the Netherlands, Poland, Hungary, and Romania [EENA2011]. Another organization, the Communications Committee (henceforth COCOM) from the European Commission, which monitors the introduction and the functioning of the 112 number in the Member States of the EU, reported that the proportion of false and hoax calls to all emergency calls ranged from 1% for Estonia, through 4–30% for Germany, 16% for Finland, 55% for Spain, up to 99% for Greece in 2011 [EENA2011].

In Poznań, the provincial Emergency Notification Center (henceforth ENC) reported in Q1–Q4 of 2013 more than 1.5 million notifications, and more than 75% of them were false or unjustified. This gives more than 3000 false or unjustified notifications per day. In comparison, in Q1, Q2 of 2012 there were about 1700 such false notifications per day, with almost the same proportion of false and hoax emergency calls equal to ca. 76% [ITPC2012].

The technical and organizational infrastructure of emergency services has been steadily developed and improved over the last years. These days, the notification telephone centers are becoming integrated with operations of the emergency services [Ziaj2011]. Many technical capabilities of supporting operators of notification centers are applied in commercially available systems [Buch2013]. For example, it is possible to identify the telephone number, contact history, and the location of a caller. The list of rescue units and position data can be visualized on a map. All the information about the notification, the operator decisions and messages is being saved, and after the completion of the rescue operation, various reports can be generated [Buch2013]. Unfortunately, the information about the location may be inaccurate and the contact history may be false due to cheating. As an illustration, a real accuracy of the GSM positioning system may vary from 30 meters to even 5 kilometers and the latter case, i.e. when the caller location is practically unavailable, is valid in about 10% of calls [ECCC2015]. On the other hand, one can find manual search, which is typically performed off-line by police investigators. The off-line processing is costly and time- consuming.

The operator of the ENC has the ability to respond to threats. There is a problem with checking whether the response is required or not. There is also a problem with the correct selection of rescue units. An important issue is the recognition of the situations reported earlier by other callers and the verification whether the incident had already been reported or not. The second issue concerns recognizing if the current caller had called earlier even if it had been a year or even more ago. This issue is difficult because it relates to long periods of time.

(7)

7

1.2. Aims and scientific thesis

The scientific aim of this Ph.D. dissertation is to develop automated mechanisms for audiovisual information processing to support operators of city monitoring and emergency information centers. Moreover the application aim is to develop respective algorithms especially for evoking stereovision effects in the city monitoring and for recognition of events and suspects (callers) in the emergency telephone systems.

The scientific thesis can be formulated as follows: the developed human-computer interfaces (stereovision in video monitoring, recognition of events and callers on the basis of the telephone calls to emergency notification centers) substantially support the work of operators of information centers and thus improve safety in urban areas.

The author proved this thesis via numerous experiments on the proposed 2D to 3D image conversion schemes, real 3D images, and event/caller search/recognition mechanisms, using captured monitoring scenes and recorded emergency telephone conversations.

The work is organized as follows. In Chapter 1, the research area, aims of the dissertation and scientific thesis are described. Next, the author presents the scope of his research. The problems, which are solved in this work, are introduced with reference to the state-of-the-art solutions. Then, the main author’s scientific achievements are also presented.

Chapter 2 describes the mechanism of perception, generation and visualization of stereovision impressions by humans. The author proposes a formation of the virtual view, schemes for evoking stereovision impressions on the basis of monocular images and solutions of the problem of information gaps in artificial 3D images. Chapter 2 contains also the experiments on human perception of the 3D effect [Balc2012, Balc2014a] and the proposed approach to fast 2D to 3D image conversion schemes based on the reduced number of controlling parameters. This conversion is dedicated to the monitoring operators. An additional experiment on the image quality and depth perception verified the proposed schemes of stereovision impressions [Balc2014].

In Chapter 3 the author compares results of recognition of important threat details watched by trained (skilled) as well as untrained observers in 2D and 3D monitoring scenes.

In Chapter 4 the author proposes a mechanism of an automated selection of information from emergency telephone conversations. For improvement of the recognition quality, the recognition procedures are then extended to multidimensional dependences between features.

Moreover, the author presents a machine learning procedure for training the searching procedure and an automatic classification of conversations by artificial neural networks.

Chapter 5 is devoted to experiments on caller and event recognition, using the proposed search mechanism. An extra prepared experimental collection of emergency notifications with the detailed description is shown. Sets of parameters of the search mechanism are selected and trained using the proposed learning procedure. A case study for the automatic recognition of abnormal cases with the help of relations between features is then performed [Balc2015, Balc2015a]. Finally, a series of experiments on the classification of conversation records using artificial neural networks are conducted.

In the last chapter, the author presents conclusions of the research.

(8)

8

1.3. Scope of the research

The typical systems for threat prevention in urban environments are presented in Section 1.1. The author proposes to extended them with mechanisms improving the workplace of modern operator who is overburdened with work, difficulties, and responsibilities. In this dissertation the author focuses on operators of two most important information centers: video monitoring with the assistance by the stereovision effects (referred to as vision subsystem) and emergency telephone with supporting by recognition of events and callers based on the analysis of reports from the conversations (referred to as emergency telephone subsystem).

The proposed mechanisms are illustrated in Fig. 1.2.

Fig. 1.2. Proposed systems for threat prevention in urban environments

Introduction to the two aforementioned subsystems with reference to the state-of-the-art solutions are described in the following subsections.

1.3.1. Vision subsystem

Stereovision is used in engineering for visualization and for acquisition and processing the data by machines. Stereovision techniques imitate the ways of human stereoscopic vision.

In automatic control and robotics, stereovision systems are used for navigation [LinC2005]

and detection of objects [Bota2009], even of a such specified object like a road surface [Wang2013].

Most images and sequences of images are still recorded using 2D techniques, i.e., using monocular devices. Professional stereovision i.e. 3D cameras are expensive. Thus they are used rather rarely. In some situations, it is even impossible to use stereo recording as a special movie effect called “the forced perspective” exists, which is an optical illusion with an object appearing closer or farther away from the camera, perceived as larger or smaller than it really was. In some other situations, a typically recorded stereo effect is weakly visible. This occurs, for example, in the case of distant objects in the scene, due to too small stereo separation between the left and the right stereo images. The camera lenses are typically placed too near to each other for the far recorded objects.

(9)

9

In the case of 2D images, the same scene views are provided to both human eyes.

Nevertheless, humans have an ability to obtain some information about the depth in the scene even from this 2D information. The depth is provided by characteristic features such as a perspective, occlusions, sizes of objects, textures, lights and shadows. In the case of 3D images, human depth perception is strongly increased through the differences between the left and the right eye view. Thus, the 3D effect provides the depth as a new intrinsic image feature, which is salient and considerable. Depth from viewer to objects in an image, calculated by the human brain, based on the features of occlusion or objects and light, and depth calculated from disparity may differ. Even when it comes to the conflict between these different depths, human visual system allows to correct the perception of a 3D scene [Huyn2011].

After some time, higher cognitive load may cause visual discomfort to the viewer, manifested by eye strain, and, in extreme cases, headache or nausea [Huyn2011]. Thus, the 3D monitoring visualization with a high range depth effect should be an optional tool in situations that demand careful attention.

There is an initial study on the impact of stereovision on the perception of the sizes of objects. The ability of suspects to match the object to the correct size is better under the stereo than under the mono conditions [LuoK2007]. It is demand for conducting experiments on recognition of details of the situations by the viewer, for example, counting people or estimating the distances.

One of the major requirements for vision-monitoring operators is the ability to sustain concentrated. The presence of two video channels and the information about disparity is influential and changes the deployment of human visual attention and attentive behavior.

Increasing the depth effect is a way of increasing the sensation of presence and immersion, which is one of the main targets and advantages of modern 3D television [Huyn2011]. It allows to focus attention in the best possible way. The strong 3D effect with well-seen values of perceived depth is the way to enhance the operator's concentration. It may allow the operator to have a sensation of being a participant in an observed situation. This personal attitude to the observed incident may be investigated as the facilitation in allocation of the operator's attention, the recognition of unsafe or suspicious behavior in the CCTV video and the processing of information in order to make possibly the best decision in given conditions.

There is no need to replace the existing mono-cameras by stereo-cameras because the conversion from monoscopic to stereoscopic sequences of images can be executed.

Commercial television methods use effects-oriented conversion and are not appropriate for monitoring systems. Moreover, typical stereoscopic rendering results in a high computational complexity and the load on the graphics engine [Bart2008]. Thus, an efficient 2D to 3D conversion is an important issue and there is an urgent demand for the preparation of a simple and effective (i.e. real-time) tool for the 2D to 3D image or sequence of images conversion, with some control of the perceived depth [TamZ2006, KimD2007, LiWa2010, Chan2012].

Stereovision technique for video monitoring operator may be accompanied and complementary to other image processing methods. Among them are noise removing from digital images or high dynamic range (HDR) for enhancement details of images obtained in difficult lighting conditions [Koni2014, Koni2016].

The 3D effect can be visualized for a human with various types of 3D displays [Urey2011]. The application of 3D visualizations in security systems using portable head

(10)

10

mounted displays (henceforth HMD) to improve the presentation of the user's information is mentioned in [Garc2007], but these components were used only to get the access to the environment information added by the system. Taking into account the simplicity of visualization for experiments, which is adequate for the easy and approximate approach, the anaglyph technique can be considered [Dubo2001, Gall2010]. Results obtained for anaglyph- based imaging are referred and may be transformed to other visualization techniques like, for instance, the autostereoscopic screens. Using autostereoscopy, there is no need to use any special headgears or glasses by the viewers, but the parallax barrier or lenticular-based displays are required [Urey2011].

1.3.2. Emergency telephone subsystem

Although, emergency telephone systems are very commonly used, the details of their advanced processing procedures are not public. The only published examples of intelligent ENC systems, to the knowledge to the author of this dissertation, were presented in [Klem2009, Witk2014, Grzy2014, Galk2015]. In the first system [Klem2009] the experiments were conducted using the database of 25 thousand calls in order to find the notifications related to the hurricane Emma, but only three original features of calls were analyzed. The second system [Grzy2014] was designed for identification of callers dialing emergency number using various voice features. The real conversations of about 20 hours from Małopolska Region (Poland) were stored. In this system the identification procedure uses voice characteristics only.

An accurate speaker and/or event recognition requires selection, pre-processing, and finally the comparison of the relevant information taken from the available data sources. For example, the recognition of the set of similar calls should be based on similarities of various but as well as relevant data (even multimedia data like audio, video, and text) among conversations. Clear data description for the recognition purpose should include the content description using metadata [Reve2008, Rawa2010, Sici2014]. All object representations should be noted and possibly the best representations for modeling the objects should be chosen [Krie2006, Rash2009]. For example, a mechanism for searching personal information using multimedia metadata was described in [Koji2013]. The metadata was divided into three levels of importance, namely attributes of files, content headers (e.g. a name), and content meaning (understood as the information extracted using image processing and sound analysis).

Typically, notifications are registered in the ENC. The witness or participant of an event calls to the ENC and informs the telephone operator about what has happened. Possible kinds of information sources include a report written by the operator, digital call recorder, the provider of telephone line services, transcriptions, annotations, and results from multi-level sound analysis. The last three ones are usually made off-line by humans or machines, but must be performed in a limited time, because the ENC can store the calls for a prescribed period of time, only [Dabr2012, Drga2015].

Typically, during the emergency telephone notification, the operator asks a series of questions in order to obtain necessary information to proceed with the case. According to the procedures of the Polish Police Department, the main questions that should be asked by the

(11)

11

officer are presented in Table 1.1. Depending on the answers to these questions, many crucial features, which describe the notification, are directly obtained. They are also presented in Table 1.1 in right column.

From the police officers' point of view, possible kinds of events are intervention, traffic, or crime. Each kind of event may be divided into several categories. The event of the intervention can fall into two categories, namely home and public. The traffic kind can be classified into: a pedestrian crash, a collision, and an accident. The last one, the crime, can be divided into: a falsification and fraud, punishable threats, theft, pick pocketing, a robbery and bodily injury, drug trafficking, and other offences. Accurate information about the kind of the event, together with the event localization, is a crucial issue, which implies sending appropriate emergency services.

Table 1.1 Main questions in Polish Police operator procedure with related features

Question Extracted feature

What happened?

Type of event, event category How did this happen?

What is the impact of the event?

Where did it happen? Event location (city, street, house and staircase or apartments number)

When did this happen? Time of event What are your personal data?

Name, surname and address data of the caller (city, street, house and staircase or apartments number and zip code)

The operator, who is receiving the emergency telephone calls, can also add some information to the emergency notification database. The telephone operator can recognize some characteristic features of the acoustic background and the voice of the calling person, for example: gender, age, pace of speech, linguistic errors, foreign accent, stuttering, word repetitions, hoarseness, filled intervals, and logical errors in the caller's speech. The data can be useful to support the recognition of the conversations, including the voices of possibly the same person. Some data, entered manually by the operator, may be alternatively calculated by multi-level automatic tools.

The gender of the speaker is one of the most distinguishing features to support the recognition mechanism because it can significantly reduce the number of suspects. People are more likely to determine the gender for adults than for children [Trau1997]. Gender types can be classified automatically based on the speaker speech with high accuracy rate [Alsu2011, Alsu2012, Ichi2010, Nguy2010, Abdo2009]. Information about gender may be used to select an appropriate person model in the automatic speaker recognition and may decrease the time of recognition [Drga2015, Ichi2010]. Bearing in mind the problem of gender determination, three values of gender parameter available for the operator are: female, male, or unspecified (unrecognized).

Another feature, namely the age of the speaker, can be judged by most of listeners, but in some cases the error of the estimation may be higher than ten years [Scho2001]. On the other hand, the rate of the automatic classification of age into just three groups: young, middle, and

(12)

12

old is high [Nguy2010]. For more accurate results, the classification of age may be extended to four groups: child, young, adult, senior, or undefined (unrecognized).

Speakers may be also distinguished by the minimum and the maximum speed of speech.

The pace of speaking may enhance automatic speaker identification accuracy [Mirz2007]. For the operator's judgment the following possible speech pace may be chosen: slow, medium, quick, and undefined.

The characteristic features of the caller's speech like the occurrence of linguistic errors, foreign accent, stuttering, word repetitions, hoarseness, filled intervals, and logical errors also seem to be important for comparison and the discrimination of call recordings. The data should contain the information about a particular feature occurrence.

In [Grzy2014] each call was tagged by several voice features. These features were e.g.

gender, age class, speech rate, emotional state, acoustic background and conversation style.

The authors proposed some rules, understood as possibility of occurrence particular feature value in case of occurrence other feature values. These rules were intended to use not for recognition of a caller but for prediction of other features which could not be detected automatically. Unfortunately, the rules were not established on the real statistics but on a limited database content only.

The additional information about the acoustic background is very important for reliable speaker and speech recognition. It can allow to choose an adequate audio processing method and increase the rate of the reduction of noise or distortion effects [Dabr2013]. The reduction of noise also improves the performance of recognition [AhnK2005]. The acoustic backgrounds of the call in emergency conversations may be divided into the following categories: silence, forest, sea, other conversations, breaking glass, street noise, interior of a car, interior of a rail vehicle, shot, or undefined.

In fact all the judgments made by an operator are not objective. Moreover, entering a lot of data is time-consuming and there might be not enough time to write every single detail down between successive calls. If the telephone operators try to do their job faster, they may make some errors. Thus, there is a demand for the hint system implemented in software.

A default selection list should be embedded into the graphical user interface (henceforth GUI) and available during entering the values into the questionnaire.

Also hardware equipment of the ENC can support the recognition procedures. A digital call recorder and telephone line services provide additional information concerning the recorded calls. The telephone number of the caller, date, time and the duration of the conversation are usually entered automatically. Additional information such as audio files and transcription files may also be stored in the described emergency telephone database [Dabr2012].

Data comparison may be executed using both metric and non-metric spaces [Zezu2006, Chen2008]. In the first case, the most similar objects in the analyzed set are recognized by the distances between the reference object and the compared objects [Alla2007]. The distances in the metric spaces may be considered for the binary, numerical, and term based data classes [Alla2007, Gant1999]. For example, speakers may be recognized using spectral features (numeric metrics) [Beig2011] or various multilevel features (term based metrics) [Drga2015].

Semantic description of context may also be necessary for efficient data comparison and recognition [Hyou2011]. Appropriately weighted functions can also be included in order to represent the importance of various features for all object representations [Krie2006]. In

(13)

13

intelligent recognition systems, the analyzed objects should be arranged using a number of similarity descriptors, and the largest scoring numbers should be assigned to those objects which are the most similar to the reference object [Berr2003]. It is possible to divide the recognized events to sub-events and assign separate metadata to them. The identification of sub-events based on texts, audio files, images, and videos in order to support the management of crisis situations was described in [Pohl2012]. The metadata for the event name, the description, and the tags differ in the level of importance and thus, sub-events have to be detected. However, due to high computational complexity, this system does not operate in real time. Nevertheless, modern cloud computing technologies and/or large data centers can be used to achieve required high computational efficiency. Objects search in the cloud using a metadata model is described in [Imra2013]. It is possible to select types of documents, sizes of data, owners, and custom descriptions. However, the cloud computing should be rather avoided in the emergency telephone system due to the confidential character of the processed information.

In the recognition system designed for the ENC and described in [Klem2009], the Kohonen Self Organizing Map was used to recognize the notification of the only one threat event. The features such as the time and the class of a call, together with the region of an incident were taken into account. Other features, calculated with the use of the introduced features, became the derivative characteristics.

Concluding, the comparison of the calls based on the aforementioned features performed in order to recognize the caller or event and it is not a trivial task. Furthermore, even if the features have different values, it may not mean that they are related to different persons. For example, the age changes over the years, consequently the age range may change, e.g. from young to adult.

1.4. Main scientific achievements

The main scientific achievements of the author are several innovative modifications of the two most important threat prevention systems, namely CCTV monitoring and emergency telephone, regarding the proper presentation of the information to the operator. They include the stereovision visualization for the video monitoring operator and the mechanism of events and people recognition for the operator of emergency telephone system.

The well-known commercial stereovision methods, used e.g. in television, cannot be directly used for monitoring, as they are too much oriented on evoking spectacular 3D effects.

In contrast, the stereovision methods for monitoring systems should offer plausible and rather smooth 3D effects. Moreover, their intensity should be entirely under the operator’s control.

Thus, the research and experiments were primarily concentrated on the study of perception of the stereovision effect.

The approach to the generation of the 3D impressions from source 2D images with the realization schemes of 2D to 3D conversion are described in [Balc2008, Balc2011b, Balc2012]. The functional relations between parameters that control the 3D effect allow to reduce the number of those parameters [Dabr2012b]. The author’s conversion schemes enhance the depth range with acceptable quality [Balc2013, Balc2014]. The simple and effective (i.e. real-time) tool for the 2D to 3D conversion of images or sequences of images,

(14)

14

based on simplified depth maps was implemented and tested [Balc2014a]. The experiments on the perception of quality, depth, and the recognition of important details in selected 3D scenes taken from the CCTV and the comparison to 2D scenes were performed. Image processing methods used for the 2D to 3D conversion offer the approximated 3D effect only.

However, they may be sufficient when considering people’s ability to visual perception. In the video monitoring system, the proposed approach to stereovision presents the selected parts of the scene, closer to the operator with balanced 3D effect. The distance to the observed object decreases, while the object is not disturbed. The author’s experiments confirm that the perception of details is fully preserved and the 3D enhancement of selected suspicious object can be regarded as a distinguishing marker.

The author expects that due to the 3D quality, the detailed inspection such as e.g. people counting or estimating the distances between objects should be faster and more exact. These advantages of the stereovision visualization quality are fully confirmed with author’s experiments.

Taking into account all aspects related to the emergency call services, the author proposes the recognition engine that may support ENCs in the recognition of threats. The recognition of threats in the system of emergency telephone number includes verification whether the event had already been reported and the recognition of events reported earlier by the particular caller.

Due to law restrictions, the author with the cooperation of the staff and students of Poznań University of Technology (henceforth PUT) prepared the experimental collection of emergency conversations. The idea was to represent, as closely as possible, a sample of a real database of emergency calls [Dabr2012]. Before executing the call, the caller was watching a movie, which was randomly chosen from a set of previously prepared short movies containing various crime scenes. The speaker's task was to report the case to the operator. The collected database [Balc2015a] contains 669 conversations (sound files and metadata) with total length of more than 21 hours. The conversations, as in real ENC database, differ in length, annotation quality, data reliability, and completeness. This database allows to conduct a bunch of experiments on the caller and event recognition, as well as the speaker recognition.

Relevant conversation data was chosen and analyzed [Balc2009]. The method of comparing the features describing the calls was introduced. Additional dependences between the features, referred to as the correlations, which can increase the efficiency of the system, were also defined [Balc2010].

A suit of the proposed procedures contains binary, numerical, term-based and correlation- based comparisons. In all types of comparisons the feature values of the record, which is newly added to the database, are compared with the suitable feature values of all other records, which are already stored in the database. During the recognition process, the global score, prepared as a sum of weighted partial matching scores, is calculated for each conversation, and displayed with the referred percentage of matching to the reference conversation. This matching percentage determines the position of the call record on the ranking list of results. The best results of the recognition are those with the highest matching scores. The area of the operator's exploration is narrower and the threat is properly indicated and identified. Resulting high reliability of the system was presented in [Balc2011a].

(15)

15

Additional extensions, like graphical interpretation of various types of the search procedures, similar to the image processing techniques, were presented in [Balc2015, [Balc2015a]. The author also proposed the extension of this relational mechanism by introducing a multidimensional space of features. All previous formulas were generalized in order to use any type and/or any number of the dependent features for the comparison. Such approach provides a more detailed description of events and improves the recognition results.

Moreover, it allows to fluently scale the recognition accuracy, depending on a given case and the availability of details from the database of emergency notifications.

The efficiency of classification by the artificial neural network was also examined and the best results were obtained on the limited subset of records.

Concluding, the proposed methods will increase efficiency of work and enhance functionality of the workplace of modern CCTV and ENC operators.

(16)

16

2. Evoking stereovision impressions

2.1. Human binocular perception

The human has an ability to perceive two provided channels of images in the form of the one 3D image with specified depth values. There are some differences in viewpoints between two 2D images observed separately by the left and the right eye simultaneously.

The same real world scene is observed from two different viewing angles, what corresponds to distance between human eyes. The average distance between human eyes is about 65 mm [Cell2013].

The information from two channels of visual signal is provided to brain and interpreted as a three-dimensional (3D) scene (Fig. 2.1), i.e., as a collection of objects at different distances to the viewer [Onur2007]. The 3D shape is extracted from disparity in human parietal cortex [Dura2009]. Thus, for generating the stereoscopic impressions, it is necessary to provide two different views, one for the left and one for the right eye.

Fig. 2.1. 3D scene perception based on information from particular eye views

(17)

17

Human brain allows people to perceive the 3D scene with a complete information of both eyes. As example, two markers p1 and p2 are added to the stereoscopic image (Fig. 2.2). The first marker p1 is put in a scene region of left eye view, which was not seen by the right eye.

The second marker p2 is put in a scene region of right eye view, which was not seen by the left eye. The 3D effect for image is clearly and strongly seen. All markers are seen although there were not in common left and right eye view area [Balc2011b].

Fig. 2.2. Perception of markers put in particular eye views

On the other hand, 3D impressions are still and strongly seen even if the noise between two images and regions of images without visible matches in two views occur [IpDo2012].

2.2. Selected methods of stereovision visualization

Stereoscopic displays provide separate images for left and right eye. These images are needed for fusion and depth based reconstruction of 3D scene by human brain. There are technologies which require or do not require any eyewear equipment. Technologies with eyewear requirement are direct-view method based on color-, polarization- or time- multiplexed approaches, and head-mounted displays. Autostereoscopic direct-view displays provide stereoscopic images without need of using any eyewear [Urey2011].

The author, taking simplicity, availability and usefulness of methods based on color- multiplexed approach and autostereoscopic displays, chose them for experiments.

The anaglyph method is based on the color-multiplexed approach. An interesting feature of human 3D perception is that, in spite of some incomplete or even – to some extent – inconsistent information, evoking the three-dimensional illusion is still possible. For instance, with three color components such as RGB (red, green, blue) we can model almost entire visible color space. However, even if one eye (e.g., the left one) reaches only a single (e.g. the red) component of the left eye image, and the right eye sees the two remaining components (i.e. green and blue) of the right eye image, the human brain perceives not only a 3D scene but sees it in quasi-true colors [Dumb1992, Dubo2001].

(18)

18

This kind of the 3D effect can be realized by observing specially prepared flat images referred to as the anaglyph images with, for example, typical red and cyan filter glasses. The red glass removes blue and green components from the left eye view while the cyan glass filters out the red component from the right eye view (Fig. 2.3).

Fig. 2.3. Example of a 3D visualization based on 2D anaglyph image and color filter glasses for red (R), green (G), and blue (B) color components

The anaglyph method is simple and offers low quality of color reproduction only.

Stereovision image projection for audience is easy because there is no need of using any additional equipment except passive color filter glasses. Notice that the dependences and parameters of evoked 3D effect obtained for color-multiplexed approach may be transformed to other methods of visualization.

Autostereoscopic displays do not require of wearing any user-mounted devices like glasses and provide separate images in left and right eye viewing zones. The autostereoscopic system may control and form left and right eye view image zones using, i.e. parallax barrier or lenticular arrays components.

In the following paragraphs the author presents basic mathematical dependences describing autostereoscopic displays.

The first presented autostereoscopic method of 3D visualization is based on the parallax barrier. In displays the parallax barrier is used to blocking of left display pixels from the right eye view and right display pixels from the left eye view [Kong2011]. Display pixels and the parallax barrier arrangement is shown in Figs 2.4, 2.5 and 2.6. Presented geometry of the autostereoscopic visualization with the parallax barrier is based on [Kong2011].

(19)

19

Fig. 2.4. General idea of the 3D visualization based on the display with the left and right eye view pixels and the parallax barrier

There is the assumption that the viewed left pixel is visible at the center of the left eye viewing window and the viewed right pixel is visible at the center of the right eye viewing window. It means that the pair of corresponding left and right view pixels on display is visible at the center of the both eyes viewing windows [Kong2011].

In Fig. 2.5 and 2.6 the parallax barrier pitch b is a distance between two following centers of parallax barrier holes, i is a display pitch as a distance between centers of two following display pixels, g is a distance between parallax barrier and display, Zv is a distance between display pixels and viewer eyes and e is an eye separation distance.

In Fig. 2.5 two triangles ABC and ADE are similar, because every angle of ABC triangle has the same measure as the corresponding angle in the ADE triangle.

(20)

20

Fig. 2.5. Geometry of the autostereoscopic visualization based on the parallax barrier Corresponding sides of two similar triangles, ABC and ADE, have lengths that are in the same proportion

BC. ED AC

AE = (2.1)

Using distances illustrated in Fig. 2.5, the relationship (2.1) can be rewritten as

. 2 2 e i

g Z

g

v

− = (2.2)

The formula (2.2) can be transformed to the form ) . (

e i g g Z^v −

= (2.3)

The formula (2.2) can be also transformed to the form i.

e i g Z^v

= + (2.4)

(21)

21

As a further consideration, in Fig. 2.6 two triangles JFG and JIH are similar, because every angle of the JFG triangle has the same measure as the corresponding angle in the JIH triangle.

Fig. 2.6. Geometry of the autostereoscopic visualization based on the parallax barrier, illustrated for the derivation of the formula for the barrier pitch b

Corresponding sides of two similar triangles, JFG and JIH, have lengths that are in the same proportion

IJ . IH FJ

FG = (2.5)

Using distances illustrated in Fig. 2.6, the relationship (2.5) can be rewritten as

2 . g Z

b Z

i

v

v = − (2.6)

The formula (2.6) can be transformed to the form ) . (

2

v v

Z i g

b Z −

= (2.7)

(22)

22

The formula (2.6) can be also transformed to the form 2i .

Z bZ

g = _v − ^v (2.8)

There is an assumption that distance values b, i, g, e and Zv, are identical in Fig. 2.5 and Fig. 2.6. Thus, formulas (2.4) and (2.8) can be compared as follows

2i . Z bZ i e

i

Z _v

v

v = −

+ ^(2.9)

The formula (2.9) can be transformed to the form 2 .

i e b ei

= + (2.10)

The formula (2.9) can be also transformed to the form 2i b.

e bi

= − (2.11)

Formulas (2.7) and (2.10) can be compared as follows 2 . ) (

2

i e

ei Z

i g Z

v v

= +

− (2.12)

The expression (2.12) can be transformed to the form ).

( g

g Z e i ^v−

= (2.13)

Formulas (2.11) and (2.13) are equal

). (

2 g

g Z i b i

bi _v −

− = ^(2.14)

The expression (2.14) can be transformed to the form ).

1 ( 2

Zv

i g

b= − (2.15)

The expression (2.14) can be also transformed to the form .

) 1 ( 2

Zv

g i b

−

= (2.16)

(23)

23

According to the formula (2.15), increasing distance Zv between pixels of display and viewer eyes, increases value of the parallax barrier pitch b. Otherwise, decreasing distance Zv

between pixels of display and viewer eyes, decreases value of the parallax barrier pitch b.

According to the formula (2.16), increasing distance Zv between pixels of display and viewer eyes, decreases value of the display pitch i. Otherwise, decreasing distance Zv between pixels of display and viewer eyes, increases display pitch i.

For more precise calculations, there is a need to describe two different isotropic media, one between display pixels and parallax barrier, and one between parallax barrier and human eyes. It can be glass, air or water. From the Snell's law [Smit2000], the relationship between angles of incidence and the refraction for light is described

. sin

sin ₁ ₂ ₂

1 θ n θ

n = (2.17)

In case of autosteroscopic displays based on parallax barrier:

– n1 is the refractive index of the medium between display pixels and parallax barrier, – n2 is the refractive index of the medium between parallax barrier and human eyes, – θ1 is the angle measured from the normal of the boundary for light beam in the

medium between display pixels and parallax barrier,

– θ2 is the angle measured from the normal of the boundary for light beam in the medium between parallax barrier and human eyes.

For the relatively small θ1 angle, as the angle between line segments AE and AD, in Fig. 2.5, it can be approximately written

2. sin ₁

g i

θ = (2.18)

For the relatively small θ2 angle, as the angle between line segments AC and AB, in Fig. 2.5, it can be approximately written

2 . sin ₂

g Z

e

v −

θ = (2.19)

As the result of the comparison of formulas (2.17), (2.18), and (2.19), it can be written

2 .

1 Z g

n e g n i

v−

= (2.20)

The expression (2.20) can be transformed to the form, which is also the extended version of the expression (2.13)

). (

2 1

g g Z i n

e n ^v −

= (2.21)

(24)

24

For air, which is the typical medium between the parallax barrier and human eyes, the refractive index n2, is equal to 1.000293 value. It can be rounded to 1 value. Thus, the expression (2.21) can be rewritten to the form

). (

1 g

g Z n i

e ^v−

= (2.22)

Formula (2.22) describes dependences between display pitch as a distance between centers of two following display pixels (variable i), distance between display pixels and viewer eyes (variable Zv), and distance between parallax barrier and display (variable g) for autostereoscopic visualization based on the parallax barrier.

The second presented autostereoscopic method of 3D visualization is based on the lenticular lens. The autostereoscopic lenticular system is based on the array of cylindrical lenses. Cylindrical lenses form an array of line images of the object at a focal distance [Cell2013]. The light beam generates left and right eye viewing zones [Urey2011]. The simplified geometry of the lenticular autostereoscopic visualization is illustrated in Fig. 2.7.

In Fig. 2.7 two triangles are similar, OKL and OMN, because every angle of the OKL triangle has the same measure as the corresponding angle in the OMN triangle.

Fig. 2.7. Geometry of the lenticular autostereoscopic visualization based on [Kong2011]

In Fig. 2.7 the focal length is denoted as f, i is the display pitch which is the distance between centers of two following display pixels, l is the lenticular pitch which is the distance

(25)

25

between centers two lenses, Zv is the distance between display pixels and viewer eyes and e is an eye separation distance.

There is the assumption that corresponding sides of two similar triangles, OKL and OMN, have lengths that are in the same proportion

MO. MN KO

KL = (2.23)

Using distances illustrated in Fig. 2.7, the expression (2.23) can be rewritten as 2 .

f Z

l Z

i

v

v = − (2.24)

The expression (2.24) can be rewritten to the form [Kong2011]

. 2

v v

Z f iZ

l −

= (2.25)

The focal length equation, known as the lensmaker’s equation, for the refractive index of lens material denoted as n1, the air refractive index rounded to 1, thickness of the lens denoted as t and radiuses of curvature denoted as R, is defined, as follows [Smit2000]

. 1 )

1 )( 1

1 1 (

2 1 1 1 2 1

1 RR

t n n R n R

f

+ −

−

= (2.26)

In case of lenticular lens the radius of curvature for plano-convex lens occur. A plane surface has radius of curvature (R2) as infinity and the equation (2.26) can be rewritten, using only one radius of curvature denoted as R, as

1. ) 1 1 (

1 R

f = n − (2.27)

Thus, the focal length is described using the following equation 1.

1 −

= n

f R (2.28)

As the result of the comparison of formulas (2.25) and (2.28) the following equation, where R is the radius of curvature, can be written

)).

1 1 (

( 2

1−

−

= Z n

i R l

v

(2.29)

The formula (2.29) describes dependences between lenticular pitch (l), display pitch (i), radius of curvature (R) and the distance between display pixels and viewer eyes (Zv) for lenticular autostereoscopic displays.

(26)

26

In Fig. 2.8 the example of the standard marketplace digital stereoscopic camera with color lenticular display is presented. The 3.5-inch autostereoscopic LCD display of the FujiFilm FinePix Real 3D W3 camera was used for previewing the captured photographs.

(a) (b)

Fig. 2.8. FujiFilm FinePix Real 3D W3 as the example of digital stereoscopic camera:

(a) color lenticular display, (b) attached to a tripod

2.3. Generation of virtual view

In the case of the 2D to 3D conversion only a single view is available. From this view, the two camera views, i.e. for left and for right eye are generated. Generation of virtual left and right images from one center image and depth information using parallel configuration of cameras is described in [Flor2005, Lian2005]. There was an assumption that cl is a viewpoint of left eye, cc is a viewpoint of central image, cr is a viewpoint of right eye, and tx is a distance between left and right camera (cf. Fig. 2.9). Point p, with depth Z, of the real world scene is projected into image plane of three cameras with focal length f at pixel with horizontal coordinate shift xl for left image, xc for central image and xr for right image [Lian2005]. In Fig. 2.9 the distance value a is an auxiliary variable, introduced by the author, just to determine the relationship between variables.

For possible low conversion complexity, the author assumed that original 2D image will not be a center image but just the right eye image. In consequence the processing of only one image of the stereo pair is required, what is much easier and in most cases also more effective than transforming of both images. This approach also offers better quality, sharpness, and plausibility of the 3D effect [Stel2000]. It was assumed that the virtual left image is generated directly from the input image.

(27)

27

Fig. 2.9. Parallel configuration of cameras used for 2D to 3D conversion and depth enhancement, based on illustration in [Lian2005]

From the affine transformation, the following expressions can be written:

f , x Z

a

t_x _l

− =

(2.30)

2 ,

f x Z t a

c x

=

−

(2.31)

and

f . x Z

a − _r

= (2.32)

The expression (2.30) can be transformed to the form

f x Z t

a= _x − _l . (2.33)

The expression (2.32) can be transformed to the form

f x Z

a=− _r . (2.34)

(28)

28 Formulas (2.33) and (2.34) are equal

f x Z f x Z

t_x− _l =− _r . (2.35)

The expression (2.35) can be transformed to the form Z.

t f x

x_l = _r + _x (2.36)

In formula (2.36) the dependence between horizontal shift of left and right image pixel coordinate is presented. If depth value Z is increasing then xl value is decreasing. Thus, for increasing value of distance between viewpoint and object, appropriate pixel in left image is shifted left. Otherwise, if depth value Z is decreasing then xl value is increasing.

Consequently, for decreasing value of distance between viewpoint and object, appropriate pixel in left image is shifted right. For assumed parallel configuration, changes in distances between viewpoint and scene objects are projected as left and right horizontal shifts of pixels on image plane.

In other words, for presented camera configuration, the horizontal pixel shift to the right (in the left view) has the effect of decreasing the distance from point of scene to the observer, while to the left, the effect of increasing the distance from point of scene to the observer.

Thus, it is possible to generate the left virtual view based on the real right view and various horizontal shifts of image areas. By the virtual view generation, this constitutes an approach to the 2D to 3D conversion.

In case of images presented to the CCTV operator there is often a not enough knowledge about exact or even approximated values of tx, f and Z parameters. This is a significant problem in 2D to 3D conversion because the real distances are projected on image as depth values based on details of the vision system. The solution of this problem is proposed by the author. The parameters of the best observable 3D impression and possible the highest quality of image are experimentally determined.

2.4. Indication of image content based on depth maps

As it was described in Section 2.3, the left virtual view is generated by appropriate horizontal shifts of selected areas (e.g. related to objects) in the right real view image. By decreasing or increasing of the horizontal shift a distance from the point of scene to the observer also changes what may be observed as a 3D impression. Objects in the scene may be indicated by so-called depth maps. The depth map stores the relative depths for each pixel [TamZ2006]. Information about depth, can be obtained using passive or active methods [Chan2007]. Passive methods are based only on images. Active methods also require additional sensor equipment.

In case of passive methods, the depth information may be obtained from static image parameters like: edges, colors, shadows, contrast, textures, and geometric perspective [TamZ2006, Chan2007, Rede2006, Chen2008a]. The other information source is the data about motion obtained from analysis of differences between consecutive video frames

(29)

29

[KimD2007, Chan2007, LiuC2012]. Computed parameters like motion direction and luminance changes are components of image- or optical-flow [Chen2008a].

A typical depth map is a monochromatic image, which contains, N-bit, e.g. 8-bit grayscale levels. For the depth map generation, it was assumed that the metric depth value Zmax is a distance from the camera lens to the point of the farthest background plan, Zmin is a distance from the camera lens to the point of the nearest foreground, and Z is a distance from the camera lens to the point of the interest. The distance Z lies certainly between Zmin and Zmax. Other designations are: vmin – the pixel depth value of the point of the farthest background, vmax – the pixel depth value of the point of the nearest foreground, and v – the pixel depth value of the point of the interest. The depth value v is between vmin and vmax

[Balc2008, Fehn2004]. For an N-bit depth map an inverse proportionality can be written as:

Zmin related with v_max=2^N −1, (2.37a)

Z related with ,v (2.37b)

Z^max related with v_min=0. (2.37c) The inverse proportionality from expressions (2.37a–c) can be rewritten to a direct proportionality:

Z^max−Z^min related with v_max−v_min=2^N −1, (2.38a) Z −Z^min related with v_max−v=2^N −1−v. (2.38b) Thus using the direct proportionality it can be written as

. ) 1 2 )(

( ) 1 2 )(

(Z_max −Z_min ^N − −v = Z −Z_min ^N − (2.39) For the depth map preparation using the known metric values of distances from the camera lens to the 3D scene objects, expression (2.39) can be transformed into a form

. 1 2 , 0 , )

1 2 ( round

max min

max  ∈ −



 





−

− −

= ^N v ^N

Z Z

Z

v Z (2.40)

Expression (2.40) describes translation of metric values to depth values for linear depth quantization. It can be transformed to the form, which describes translation of depth values to metric values for linear quantization of depth [Fehn2004]

. 1 2 , 0 1 ,

2

max min

max ∈ −

− + −

= ^N

N Z v

vZ Z

Z (2.41)

With 8 bits it is possible to obtain up to 256 quantization levels but only about 20 quantization levels of a depth map is sufficient for excellent 3D effect synthesis and in many cases simpler depth maps are sufficient [Ides2007]. The simplest depth map is 1-bit (bivalent, binary) map which contains two depth values only. Binary depth map is relatively easy for preparation because it contains only one depth value for the object or the group of objects and

Human-computer supporting interfaces for automatic recognition of threats

Poznań University of Technology Faculty of Computing

Chair of Control and Systems Engineering Division of Signal Processing and Electronic Systems