• Nie Znaleziono Wyników

Visual, haptic, auditory, and cross modal recognition of semantic scenes

N/A
N/A
Protected

Academic year: 2021

Share "Visual, haptic, auditory, and cross modal recognition of semantic scenes"

Copied!
7
0
0

Pełen tekst

(1)

Magdalena Szubielska

Institute of Psychology

The John Paul II Catholic University of Lublin

Honorata Nitkiewicz

Institute of Psychology

The John Paul II Catholic University of Lublin

VISUAL, HAPTIC, AUDITORY, AND CROSSMODAL

RECOGNITION OF SEMANTIC SCENES

The present study investigates the accuracy of visual, haptic and auditory scenes recognition, which are presented in the same modality as during the scene learning or in other modality. The scene was a square board, on which six models of animals were arranged. Participants were learning the scene, and then, the position of two or tree models were swapped by the researcher. The subject’s task was to identify which animals were swapped.

The results indicate that scenes are encoding and recognized less effectively by hearing them than by viewing or touching. There were also significant costs of modality change (visual to auditory and haptic to auditory) in compa-rison with learning and recognizing in the same modality.

Key words: scene recognition, vision, haptics, audition, crossmodal cognition

PL ISSN 0081-685X DOI: 10.2478/V10167-010-0119-6

INTRODUCTION

Subject of the study is perception of a smal-l-scale space with visual, haptic or auditory modalities and in cross-modal conditions. Main research problem was to test if mental space representation constructed at the base of hearing is less accurate than these generated using visual and haptic perception. We were also interested in modality-changes costs: from visual to auditory and from haptic to auditory.

Following Millar (1994), we understand smal-l-scalle space as an entirety that can be observed from a single point of view. It can be also under-stood as a scene (Cf. Newell, Woods, Mernagh & Bülthoff, 2005). Francuz (2013) argues that the scenes can be described by two categories: objects and relations. First category is connected with recognizing particular things assembling

with an estimating their location in relation to observer’s body. Division of a scene to catego-ries of objects and relations corresponds with two research strands of cognitive psychology: research on visual imagery and on space per-ception (Cf. Cornoldi & Vecchi 2003). Perfect way to merge these approaches is a concept of Visuo-spatial working memory (Logie, 1995). This memory system specializes in retaining and processing visuo-spacial representation. Charac-ter of this representation changes depending on the kind of task performed at the moment. For example, when the task requires remembering a shape of presented stimulus representation is mainly visual, while when remembering the location of sequentially exposed elements of scene the spatial aspects prevails. Cornoldi & Vecchi (2003) indicate three different ways of remembering the scene, that engage different

(2)

Location of the objects (when we know where they are but do not know what they are) or their characteristics (when we know what they are but do not know where they are) exclusively can be a subject of a memory representation. It is also possible to construct a representation that inclu-des information about both location and features of objects composing the scene. In this situation we exploit so-called binding memory. It should be noted that only using biding memory we can accurately remember the scene that consists of different objects.

Information connected with the scene are stored in working memory as a mental represen-tations that are called spatial images. Loomis, Klatzky & Giudice (2013) argues that the source of spatial images can be visual, haptic or audi-tory perception (we can also use the data from cognitive maps stored in a long time memory or a verbal descriptions). In other words, each of this modalities is a source of information about characteristics and spatial relations between objects creating the scene. Spatial images inte-grate spatial data from different sources but they do not contain modality-specific content, they are amodal. Unlike visual images, they are per-ceived as external of the body (exist in all direc-tions, not only in front of our eyes) and they changes as person moves (they exhibit relative parallax). Creators of the spatial images concept convince that images created using sight, touch and hearing are functionally equivalent (see also Loomis & Klatzky, 2007). However – given that spatial images originates in perception and all mistakes made in process of perception affect also creating of the image – comparing functio-nal equivalence of spatial images developed on the base of different modalities is a big challenge and still remains at the level of theoretical deli-berations (Loomis et al., 2013) .

Results of the experiments show that distor-tions in scene perception affect hearing in gre-ater degree than sight. For example, research on estimating distance from target to observer in a

ge, 1998) showed better accuracy in estimating targets perceived with sight than with hearing. Moreover, Yamamoto and Shelton (2009) found out that it is more difficult to learn scene arran-ged in a room and updating it, i.e. adopting diffe-rent points of view, when the information about succeeding elements of the scene is received by hearing than by sight (dependent variable in the experiment was an error in pointing direction of the stimuli location). Superiority of visual coding over haptic has been proved by Cattaneo & Vecchi (2008) in experiments that required retaining in working memory the abstract scenes (square matrices, on which some areas had been marked) and reconstructing them afterwards. Subjects reconstructed more accurately scenes learned by sight than by touch. Such effect have not been observed in the studies on recogni-zing the scenes that consisted of animals models (Newell, Woods, Mernagh & Bülthoff, 2005; Szubielska, Jaroszek & Kiljanek, 2013). The-refore, it is possible that only asemantic scenes, where subject has to remember only location of the elements are remembered better when they had been seen than touched.

Interesting from the perspective of discussion on amodal spatial images concept are the results of research on cross-modal scene recognition by Newell and others (2005). The authors found out that visual and haptic spatial representations retained in working memory are specific and thereby they are not amodal. Their research on semantic scenes recognition (with models of animals arranged on a round rotating platform) proved that change of modality of learning and recognition from visual to haptic and vice versa causes decline in accuracy of recognizing these scenes, compared to the conditions where the modality had not been changed. Moreover, expe-riments conducted by Newell and team (2005) showed that a factor that can cause modality change effect (i.e. costs caused by recognizing the scene in different modality that it had been learned) is a degree of working memory load.

(3)

a research on scene recognition with difficulty of recognition task manipulation (Szubielska et. al., 2013). Change of modality caused elongation of response time, but only when the scenes were learn by touch (sequential exploration of the ori-ginal scene) and the task was more difficult (i.e. three instead of two elements of the scene had been swapped).

Literature of the subject analysis enables to propose hypotheses and research questions to verify in recent study. Authors expect that the scenes are represented in working memory less accurately when person learned it by hearing than by sight or touch. Moreover, the aim of the study is to test if there are costs of moda-lity change when the modamoda-lity at the stages of learning and recognizing changes from visual to auditory and from haptic to auditory. It is also important to check if the decline of recogni-zing accuracy connected with modality change depends on difficulty of the task.

METHODS1

Participants

135 undergraduate students (111 women, 24 men) aged from 18 to 35 (M = 21.56; SD = 1.99) participated in the study. They had been random-ly divided into 9 equal groups. Table 1 shows percentage of men and women in each group.

Variables

Dependent variable in the experiment was an accuracy of scene recognizing. Modality of lear-ning the scene, modality of recognizing and task difficulty were the independent variables. Varia-bles connected with senses had three levels

deter-mined by the modality in which subjects learned or recognized the scene: visual, haptic and audito-ry. Their combination was different in each group. Task difficulty had two levels: easy, when two elements of the scene were swapped and hard, when position of three elements changed.

Materials and apparatus

In the experiment authors used a LEGO Duplo board sized 38,5 × 38,5 cm with 24 rows of studs and 6 LEGO Duplo animals figures: horse, cow, sheep, pig, cat and her (the same materials had been used by Szubielska et.al., 2013 in their research). Figures were selected in a pilot study, where the ability to identify each animal by touching and by hearing noises they make were tested. They were attached to the board in the places that had been randomly settled before. Additional aids were 3D records

generated with a binaural microphone2.

Succes-sive noises of animals appearing on the record (in a randomly chosen sequence) were emitted from such places of a 400 × 400 cm space that corresponded to arrangement of the figures of these animals at LEGO board. In the study con-ducted at the stage of materials assembling it had been settled that enlargement space of the board in 9,6:1 scale enables to distinguish changes in location of a sound source for succeeding noises. It had been also tested in the pilot study that each sound can be identified as coming from diffe-rent place than previous one in the left-right as well as front-back planes. During recording the microphone was placed in a location analogical to that where the subject’s head would be in front of the board in visual and haptic conditions (taking into account the scale differences).

1 We would like to thank Katarzyna Jaroszek, Bartłomiej Kiljanek, Dawid Wolniczak and Meteusz Czech for their help in

conducting the study. Research report was a result of the work in a frames of Multimodal cognition: perception, memory, imagery grant, founded by Faculty of Social Sciences KUL.

2 Binaural microphone is also called ‘dummy head’, because it is similar to human’s head with the earlobes. There are

capaci-tor microphones inside the ‘ears’, recording sounds in the same way as if they were perceived by a human standing in a place of dummy head. Records made with such microphone give accurate information about direction and distance of sound source and

(4)

Each record was 36 seconds long and consi-sted of 6 succeeding noises of animals the same as in visual and haptic conditions.

Apparatus used in the experiment were com-puter (to play the records), stereophonic head-phones and stopwatch.

Procedure

Each subject went through 8 subtests, pre-sented randomly. Subtest consisted of two sta-ges: learning and recognizing the scene. Subjects learned the original scene in visual – seeing the board for 10 seconds, haptic – exploring it by touch for 60 seconds, or auditory conditions – hearing the record. Exposition time in visual and haptic conditions had been settled at the base of Newell’s et.al. (2005) experiments. It’s adequ-acy has been supported in the pilot study, where exposition time of auditory materials had been also checked – analyzing the time needed to identify a single animal by the noise it makes. After 20-seconds’ interval (time to change figu-res’ location in conditions of visual and haptic recognition) subjects learned the test scene by seeing, touching it or hearing 3D record. The task was to point which figures had been swapped.

Elements of the scene that changed their location had been randomly chosen for each subtest at the stage of materials assembling (the same changes had been presented to each subject). There were two difficulty levels. In 4 subtests two animals were swapped (easier tasks) and in other 4 sub-tests three animals were swapped (harder tasks). In haptic and auditory conditions subjects were blindfolded. Subtest was correctly completed when subject pointed all figures that swapped places and did not named any animal that was on the same place as in original scene. Percentage of correctly completed subtests was counted for easier and harder tasks separately.

RESULTS

Descriptive statistics of each experimental group are presented in a table 1. Author’s aim was to conduct analysis of variance for intra-object factor of task difficulty (easier; harder) and inter-object factors of learning modality (visual; haptic; auditory) and recognizing moda-lity (visual; haptic; auditory) when the experi-ment was planned. However, preliminary

ana-Table 1. Means (M) and standard deviations (SD) of scene recognition accuracy for easier and harder tasks and percentage of men and women in each sub-group

Group Easier tasks Harder tasks Woman

percentage (%) Man percentage (%) M SD M SD Sight-sight 66.67 29.38 41.67 27.82 86.67 13.33 Sight-touch 63.33 26.50 41.67 27.82 73.33 26.67 Sight-hearing 28.33 20.85 5.00 10.35 53.33 46.67 Touch-sight 76.67 25.82 63.33 20.85 93.33 6.67 Touch-touch 66.67 38.58 41.67 30.86 93.33 6.67 Touch-hearing 36.67 22.89 3.33 8.80 83.67 13.33 Hearing-sight 21.67 12.91 10.00 12.68 66.67 33.33 Hearing-touch 10.00 15.81 .00 0.00 93.33 6.67 Hearing-hearing 15.00 18.42 3.33 8.80 93.33 6.67

(5)

lysis of the data showed that they do not satisfy the assumption of equality of variances (Leve-ne’s test significance: p<.001 for both easier and harder tasks). Moreover, subjects were not able to recognize accurately any subtest in ‘auditory-haptic’ group (see table 1). Therefore, the results were analyzed with nonparametric tests.

To test the hypothesis saying that scenes lear-ned by hearing them are represented in memory less accurately than visually or hapticly learned scenes, authors analyzed the data from easier and harder tasks separately.

In case of easier tasks Kruskal-Wallis ANOVA by ranks revealed significant influence of learning modality on recognition accuracy (H = 46.46; p < .001). Mean ranks for visu-al, haptic and hearing modalities were: 80.40; 86.92; 36.68 respectively. Next step was pairwi-se comparison conducted with the Mann–Whit-ney U test. It had been found that the scenes learned by sight were recognized significantly better than ones learned by hearing (z = 5.52;

p < .001). Scenes learned by touch were also

recognized more accurately than these learned by hearing (z = 5.85; p < .001). Recognizing accuracy of visually and hapticly learned scenes was comparable (z = –1.02; p = .309).

Kruskal-Wallis test for harder tasks also shows significant impact of learning modality on recognition correctness (H = 31.22; p < .001). Mean ranks for visual, haptic and hearing moda-lities were: 77.54; 82.83; 43.62 respectively. Mann–Whitney U test revealed that the scenes learned by sight were recognized significantly better than ones learned by hearing (z = 4.33;

p < .001). Scenes learned by touch were also

recognized more accurately than these learned by hearing (z = 4.52; p < .001). Accuracy of visually and hapticly learned scenes recognition did not differed significantly (z = –1.02; p = .309).

To verify if modality change from visual to auditory and from haptic to auditory decreases accuracy of scene recognition (compared to within-modal recognition conditions) the Mann–

‘sight-hearing’ and ‘touch-touch’, ‘touch-hearing’ groups. Analyzes were made for easier and harder conditions separately.

Scene recognition was significantly less accu-rate in ‘sight-hearing’ than ‘sight-sight’ condition (see table 1) for both easier (z = 3.24; p = .001) and harder tasks (z = 3.59; p < .001). More-over, recognition of ‘touch-hearing’ scenes was significantly less accurate than in a ‘touch-touch’ condition (see table 1) for both easier (z = 2.09;

p = .036) and harder tasks (z = 3.42; p < .001).

DISCUSSION

Research hypotheses were confirmed. Seman-tic scenes learned by hearing are recognized less accurate that these learned by sight (what sup-ports previous research: Loomis et.al., 1998; Yamamoto & Shelton, 2009) and touch. More-over, the results did not confirmed the superio-rity effect of visual over haptic coding in visu-o-haptic working memory reported by Cattaneo and Vecchi (2008), thereby they proved findings of Newell and team (2005). Procedure of both; the recent study and the experiment of Newell et.al. (2005) engaged subject’s binding memory: they had to create an accurate representation of objects and their location in a small space (Cor-noldi & Vecchi, 2003), while in the experiment conducted by Cattaneo and Vecchi (2008) only location of the elements was important. Therefo-re, it is possible that superiority effect of visual over haptic coding does not apply to the memory binding “what” and “where” information about a small space.

Regrettably, recent study do not allow us to determine the source of better performance when the scenes were learned visually or hapticly com-pared to auditory learning. Superiority of visual and haptic coding can be a result of auditory perception distortions. But the distortions can occur at a higher levels of processing as well, when auditory data are recoded to, depending on

(6)

(Loomis et.al., 2013), modality-specific repre-sentations (Newell et.al., 2005), or representa-tion in binding memory retained in visuo-spatial working memory (Cornoldi & Vecchi, 2003). It would be highly desirable, but in practice very difficult, to test if accuracy of assessing objects’ localization with visual, haptic and auditory modalities is comparable at the level of per-ception. Firstly, we think that perception tests conducted before recognition tests could distort the results of the study. Secondly, such tests of perception accuracy would de facto include reco-gnition tests – subjects would learn the scene for some time and have to retain it in their working memory. As a result, we still would not know if presumptive differences in performance deri-ves from perception limitations or difficulties of memory task. Besides, it should be noted that in previous research on cross-modal scene recogni-tion conducted by other authors, accuracy of per-ception in different modalities also had not been controlled (Cattaneo & Vecchi, 2008; Newell, Woods, Mernagh & Bülthoff, 2005; Yamamoto and Shelton (2009) somewhat tested it, but not referring directly to perception, but to memory indicators).

Moreover, the experiment showed that within-modal scenes (both learned and recognized by sight or by touch) were recognized more accu-rately than cross-modal scenes recognized by hearing (‘sight-hearing’ and ‘touch-hearing’ con-ditions). Modality-change costs emerged in both easier and harder tasks, so it does not increase along with visuo-spatial working memory load. Therefore, it is possible that relation between modality-change costs and working memory load occurs only in visual and haptic modalities (Cf. Newell et.al., 2005; Szubielska et.al., 2013).

Critical analysis of the procedure and results of current study allows us to conclude that the findings are not a consequence of adopted rese-arch method. Reported modality-change costs are not merely a result of sequentiality in audito-ry data presentation. It is known that sequentially

Carraneo & Vecchi, 2008). However, there were no differences in recognition accuracy between hapticly explored scenes, that is also sequential, and ones explored visually (holistic learning) (Cf. Cattaneo & Vecchi, 2008; Newell et.al., 2005). Such effect is not dependent on the sce-ne’s size that differed auditory from visual and haptic conditions. Scale change should not affect scenes recognition because space images can be updated to different size (Loomis et.al., 2013).

A limitation of the experiment’s procedu-re was a system of performance measuprocedu-rement, which was not accurate enough. Subjects’ responses were coded as 1 (all figures pointed correctly) or 0 (any mistakes). It is possible that with more precise performance indicators (taken, for example, from signal detection theory) we would obtain mildly different results.

To sum up, in the tasks where remembering both what and where are the elements of the scene is required, auditory modality is less accu-rate source of information than senses of sight and touch. However, further empirical research is necessary to verify if it is connected with audi-tory perception’s errors or with higher levels of data processing.

REFERENCES

Cattaneo, Z. &Vecchi, T. (2008). Supramodality effects in visual and haptic spatial processes. Journal of Experimental Psychology: Learning, Memory, and Cognition, 34, 3, 631–642.

Cornoldi, C. &Vecchi, T. (2003). Visuo-spatial working memory and individual differences. Hove, New York: Taylor and Francis/Psychology Press.

Francuz, P. (2013). Imagia. W kierunku neurokognityw-nej teorii obrazu. Lublin: Wydawnictwo KUL. Logie, R. H. (1995). Visuo-spatial working memory.

Hove, UK: Lawrence Erlbaum Associates Ltd. Loomis, J. M. &Klatzky, R. L. (2007). Functional

equivalence of spatial representations from vision, touch, and hearing: Relevance for sensory substitu-tion. In: J. J. Rieser, D. H. Ashmead, F. F. Ebner & A. L. Corn (Eds.), Blindness and brain plasticity in

(7)

navigation and object perception (s. 155–184). New York: Lawrence Erlbaum Associates.

Loomis, J. M., Klatzky, R. L.&Giudice, N. A. (2013). Representing 3D space in working memory: Spatial images from vision, touch, hearing, and language. In: S. Lacey&R. Lawson (Eds.), Multisensory Imagery: Theory & Applications (s. 131–156). New York: Springer.

Loomis, J. M., Klatzky, R. L., Philbeck, J. W.&Golledge, R. G. (1998). Assessing auditory distance perception using perceptually directed action. Perception & Psychophysics, 60, 6, 966–980.

Millar, S. (1994). Understanding and representing space. Theory and evidence from studies with blind and sighted children. Oxford: Clarendon Press.

Newell, F. N., Woods, A. T., Mernagh, M. & Bülthoff, H. H. (2005). Visual, haptic and crossmodal recognition of sce-nes. Experimental Brain Research, 161, 2, 233–242. Szubielska, M., Jaroszek, K. & Kiljanek, B. (2013).

Koszty zmiany modalności we wzrokowo-dotyko-wym rozpoznawaniu scen // Costs of changing moda-lity in visuo-haptic recognition of scenes. Roczniki Psychologiczne // Annals of Psychology, 16, 2, 343–365.

Yamamoto, N. &Shelton, A. L. (2009). Orientation dependence of spatial memory acquired from audi-tory experience. Psychonomic Bulletin & Review, 16, 2, 301–305.

Cytaty

Powiązane dokumenty

przyczy­ ny, że choć dorobiliśmy się już pokaźniej (i ciągle powiększającej się) biblioteczki szczegóło­ wych opracowań i panoramicznych syntez PRL-owskich dziejów,

Przede wszystkim jednak dzięki formie zapisu (zapis cyfrowy) i możliwościom, jakie stwarza Internet, stała się podstawowym narzędziem służącym upowszechnianiu i udostępnianiu

Owo „przełażenie” kota i „w yłażenie” robaka nie jest św iadectwem rzeczyw i­ stego „zarażenia się ” natury rozlazłością sytuacji, lecz jedynie dowodem

The goal of the research was formulated as a research question: What are the possibilities to change the current process of field joint coating application, in order to achieve

Osłabły zatem — choć w nierównym stopniu — wszystkie cechy czyniące w poprzednim okre­ sie ze świadomości rewolucyjnej formę świadomości społecznej, zaczął się proces

Wydawnictwo Naukowe Instytutu Filozofii UAM Druk i oprawa: Zakład Graficzny

przemysłu i handlu inżynier Eugeniusz Kwiatkowski. Poznańska Wyższa Szkoła Handlowa była czwartą samodzielną uczel­ nią handlową na ówczesnych ziemiach polskich.

Traditional Polish legends were the subject of numerous changes over the course of years, and on their basis neolegends were created, including urban leg- ends..