Sonified distance in sensory substitution does not always improve localization: comparison with a 2D and 3D handheld device

by   Louis Commère, et al.
Université de Sherbrooke

Early visual to auditory substitution devices encode 2D monocular images into sounds while more recent devices use distance information from 3D sensors. This study assesses whether the addition of sound-encoded distance in recent systems helps to convey the "where" information. This is important to the design of new sensory substitution devices. We conducted experiments for object localization and navigation tasks with a handheld visual to audio substitution system. It comprises 2D and 3D modes. Both encode in real-time the position of objects in images captured by a camera. The 3D mode encodes in addition the distance between the system and the object. Experiments have been conducted with 16 blindfolded sighted participants. For the localization, participants were quicker to understand the scene with the 3D mode that encodes distances. On the other hand, with the 2D only mode, they were able to compensate for the lack of distance encoding after a small training. For the navigation, participants were as good with the 2D only mode than with the 3D mode encoding distance.


page 3

page 4

page 5


A Novel mapping for visual to auditory sensory substitution

visual information can be converted into audio stream via sensory substi...

Haptic Situational Awareness Using Continuous Vibrotactile Sensations

In this research, we have developed a haptic situational awareness devic...

"The Pedestrian next to the Lamppost" Adaptive Object Graphs for Better Instantaneous Mapping

Estimating a semantically segmented bird's-eye-view (BEV) map from a sin...

Evaluation of a Vision-to-Audition Substitution System that Provides 2D WHERE Information and Fast User Learning

Vision to audition substitution devices are designed to convey visual in...

NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields

Neural Radiance Fields (NeRFs) have been successfully used for scene rep...

GoToNet: Fast Monocular Scene Exposure and Exploration

Autonomous scene exposure and exploration, especially in localization or...

SFU-Store-Nav: A Multimodal Dataset for Indoor Human Navigation

This article describes a dataset collected in a set of experiments that ...

1 Introduction

Sensory substitution is a mechanism by which one sense can stimulate and consolidate brain areas usually associated with another sense, given the extremely plastic nature of the brain [bach1987brain, kokovayNeuron2008]. Initial research conducted by P. Bach-Y Rita [bachYRita1969] established that the back can be used to mediate visual stimulus to the brain. Then, he showed that trained blind people can navigate and “visually perceive” features of the environment via electric pulses on the tongue that encode camera images [bachYRitaTrends2003, ptitoBrain2005]. The substitution from vision to audition has also been studied and Sensory Substitution Devices (SSDs) have been designed for that purpose [csapo2015survey]. We are interested in the potential of visual to auditory SSD to convey 3D spatial information through sounds.

Access to 3D information is necessary to perform everyday life tasks such as detecting obstacles while navigating or localize and grasp objects. Three dimensional information is however mainly provided with vision and is acquired through monocular or binocular vision. Monocular cues include, for example, the relative size (far objects are perceived smaller) or the motion parallax (for a moving observer, nearby objects appear to be moving faster than distant objects). Binocular cues include, for example, binocular disparity (position difference of an object on the retina seen by the left and the right eyes). A complete review of visual cues for 3D perception is available in the work of Palmer et al. [palmer1999vision].

In this work, we study the effectiveness of distance encoding into sounds with a handheld visual to auditory SSDs for localization and navigation tasks. As far as we know, no previous research has quantified the potential advantages of using audio encoded distance in such a context. We review below the main audio to vision substitution devices which have been successfully tested for localization and navigation. We classify these systems into two categories based on whether or not they use distance information. Then, we propose a protocol for comparing them.

Early visual to auditory SSDs convert 2D monocular images from a camera into sounds [Meijer1992, Capelle1998, Auvray2006, abboud2014eyemusic, ambard2015mobile]. We refer to these as “2D systems” (i.e. without distance encoded into sounds). Studies have shown that it was possible to infer the depth with such systems despite not having 3D sensors. Ward et al. [ward2010] conducted a phenomenology study with two long term blind users of the vOICe [Meijer1992]. The vOICe is the first known vision to audition device which turns 2D monocular greyscale images into sound with a mapping of the positions and greyscale values of pixels to frequency and amplitude short-time sine tones. After months of immersive use of the system, the two blind people reported being able to evaluate distances. Auvray  et al. [auvray2007learning] also conducted experiments with the vOICe and showed that blindfolded participants were able to localize objects within a 3D environment. Renier and Volder [Renier2010] conducted localization experiments with early blind and sighted blindfolded participants using the Prosthesis for Substitution of Vision by Audition (PSVA[Capelle1998]. The PSVA

holistically encodes the position of pixels from 2D monocular images into frequencies of sine tones with a greater resolution in the center of the image to mimic the fovea. Participants had to estimate the position of cubes placed on a table. The cubes were then removed from the table and participants were able to correctly put them back at their initial position. People were able to use audio encoded monocular cues of visual to auditory SSDs to perceive distance.

A new generation of vision to audition SSDs takes advantage of the accessibility of 3D sensors to directly encode distance information with sound. We refer to these as “3D systems” (i.e. with distance encoded into sounds). Some sonify the 3D raw data and the user has to interpret the sounds to understand the environment  [Brock2013, stoll2015navigating, Skulimowski2018]. Other provide higher level information to the user. Several systems algorithmically find the free path to allow users to move safely without hitting obstacles [yang2016expanding, yang2018long, aladren2016navigation]. Kayukawa et al. [Kayukawa2019]

proposed a device to detect pedestrians who would collide with the blind user. Recent systems also combine object recognition with machine learning and navigation aid 

[Bai2019, Li2020].

2D systems are cheaper and distance can be inferred with monocular cues. With 3D ones, the distance is given directly to the user and does not need to be inferred. However, this additional information might overload the hearing system which can confuse users [kristjansson2016designing].

To our knowledge, no studies have been conducted on the comparison between 2D and 3D handheld devices. Therefore, we designed a protocol to compare the behaviors and performance of individuals using a 2D and a 3D version of a vision to audition SSD.

The protocol comprises localization and navigation experiments with a 2D and 3D mode of the same handheld vision to audition SSD. One mode encodes distance and the other does not. Both rely on the same approach by detecting and encoding into sounds the position of objects in 2D camera images. With the 3D mode, distance between the object and the system is also encoded with an additional sound parameter. Both modes are real-time. Our experimental protocol is made of 2 tasks. During the first task, participants became familiar with the device and had to locate, by pointing with the index finger, 3 objects on a table. For the second task, they had to navigate through a corridor while avoiding obstacles randomly placed on the floor. The localization performance is measured with the pointing error distance and the time taken to find the 3 objects. Navigation performance is evaluated with the number of objects that participants did not detect on their path and the time they took to complete courses. In the end of the experiment, participants had to fill up a questionnaire.

Experiments were conducted with 16 sighted blindfolded participants. All were able to effectively localize objects and navigate after a short training time (approximately 10 min) with both 2D and 3D modes. At first, they were better at localization with the 3D mode. Then, with learning, similar performance for objects localization were observed with both modes. For navigation, participants were as good with the 2D mode then with the 3D mode. They improved across the courses. Nevertheless, the questionnaire showed that participants generally felt more comfortable with the 3D mode.

2 Material and methods

2.1 Sensory substitution system

The overall system consists of a handheld camera (Playstation Eye Camera), a computer carried in a backpack and a headset plugged in the computer and worn by participants (Fig. 1).

Figure 2 illustrates the principle of the substitution device. We adapted the 2D device See Differenlty [rouatICAD2014] to the needs of our study. The new device comprises 2D and 3D modes. Both encode in real-time the position of objects in images captured by a camera. In addition, with the 3D mode, the distance between the camera and the object is encoded.

We used the AprilTag algorithm [olson2011tags] to detect and estimate 3D relative positions of visual markers that we pasted on objects (Fig. 2). With AprilTag, there is no need to use objects recognition algorithms and thus potential errors are avoided. The AprilTag algorithm is indeed robust since it uses known visual markers. The AprilTag markers were cm squares for the localization task and cm squares for the navigation task. Visual markers are detected at distances between 4 and 200 cm for the localization and between 14 and 900 cm for the navigation task.

The area of 2D images of the camera is divided into 15 rectangular cells. Each cell is associated with prerecorded spatialized piano sounds. The horizontal position of the cells is encoded into the azimuth of the sounds (Fig. 2). The spatialization is made via convolution with Head Related Transfer Function (HRTF) filters. We use the HRTF measured on the KEMAR dummy head from the CIPIC database [CIPIC2001]. To compensate for the poor elevation encoding with non-individual tunings of HRTF [stitt2019auditory], we use pitches of piano notes to encode the vertical position of the cells which provides an estimate of the elevation of the objects. If a marker of an object is detected in a cell, its respective sound is triggered and looped until the visual marker is no longer detected. There is no limit to the number of sound cells that can be played simultaneously. For the 2D mode, each spatialized piano note lasts 2 seconds. For the 3D mode, the duration of the note (in seconds) is equal to the distance (in meters) between the camera and the visual marker. For example, if the marker is  cm from the camera the duration of the note will be seconds.

Figure 1: Sensory substitution system worn by a participant. The camera and the headset are connected to the computer which is in the backpack worn by the participant.
Figure 2: Principle of the modified version of See Differently [rouatICAD2014]. The 3D mode is illustrated here. When the object is far, the sound signal is repeated slowly (object on the left) while it is repeated quicker when the object is closer (object on the right). Three different musical notes are used to encode the rows of cells: C3 (bottom cells), E3 (middle cells) and G3 (top cells). These notes are spatialized with HRTFs filters to encode the columns of cells.

The visual marker recognition algorithm is implemented with Python and the sound synthesis is performed with the Supercollider environment [mccartney2002rethinking].

2.2 Protocol

The protocol was designed to compare the 2D and 3D modes of the system. Experiments comprised two sessions, one for the 2D mode and one for the 3D mode. One session included a localization along with a navigation (three courses) task 111Both tasks were approved by the ethical committee from Letters and Human Sciences from Université de Sherbrooke under reference number 2014-85/Rouat. with the same mode, either the 2D or the 3D (Table 1). Participants were randomly split into two groups. Half of them completed their first experimental session with the 2D mode and their second experimental session with the 3D mode. The other half completed the experimental sessions using the modes in the reverse order. This allowed us to study the impact of the order at which the 2D and 3D mode were used.

Group Session 1 Session 2
Loc. Navig. Loc. Navig.
2D 2D 3D 3D
3D 3D 2D 2D
Table 1: Protocol. One localization task (Loc.) comprises the finding of 3 objects on a table. One navigation task comprises 3 courses. Participants in first completed the localization and the navigation with the 2D mode and then the 3D mode. Participants in completed the tasks using modes in the reverse order.

The two experimental sessions, including the completion of the 2 tasks with the 2D and 3D mode, lasted on average between 30 minutes and 1 hour, depending on the time taken by participants to complete the tasks. Participants could either use their own headphones or the Audio Technica (model ATH-M50X) pair we provided. They did not have any distinctive health-related problems. For the first task, participants had to localize 3 objects on a table (Fig. 3a). Once they became familiar with the device and located the 3 objects on the table, they could proceed to the navigation task.

Figure 3: Localization (a) and navigation (b). (a): A computer mouse, a flashlight, and a cellular phone have to be found on a table. (b) The participant scans the environment by moving the handheld device, while the assistant takes notes of the strategy being used by the participant.

Participants had to fill up a questionnaire after completing the 2 experimental sessions with the 2D and 3D modes. Each participant received the same amount of financial compensation to cover their expenses and participation. None of them were relatives or friends of the authors.

2.2.1 Localization task

Participants were standing in front of a table, blindfolded, and were asked first to gradually increase the sound volume (set at the minimum before the experiment) to a comfortable level. They had to complete a short familiarization phase (5 to 10 minutes). During familiarization, they were allowed to explore and touch the table and the objects while listening to the sounds played by the device through headphones. Then, they were asked to locate three objects on the table: a wireless computer mouse, a mobile phone, and a flashlight (Fig. 3 a). They did not have to identify the objects. We measured their location accuracy and speed in pointing at the 3 objects. We proceeded as follows:

  1. [label=()]

  2. Participants are blindfolded and the 3 objects are placed randomly on the table within a range of 5 to 100 centimeters from participants;

  3. Participants are asked to point in the direction of the first object they find;

  4. Then, they need to touch with their index finger the object, or the table (in case of missing the object) so that we can measure the distance between the object and the index finger;

  5. When the object is missed, the distance between the center of the object and the index finger of participants is labeled as the“localization error distance”;

  6. Then, participants had to find the 2 other objects by following the same procedure from (iii) to (iv).

Once an object is located, it is not removed from the table. The time taken to find each object and the distance to the object are recorded.

2.2.2 Navigation task

Participants were asked to walk from one end of the corridor (15 meters long and 6 meters wide) to the other without colliding with the objects from the scene (Fig. 3 b). Two chairs, two garbage bins, two small bags, and two cardboard boxes were randomly placed in the corridor. Visual markers were pasted to the back and the front of each object. Each participant had to make 6 passes at their own pace: 3 passes with each mode (either 2D or 3D). Participants were allowed to explore slowly or to go straight and fast. We proceeded as follows to measure participants’ performance:

  1. [label=()]

  2. Before each course, we blindfold participants and randomly moved the obstacles.

  3. We start the stopwatch once participants are ready.

  4. One assistant remains close to participants to take notes about their strategy and to make sure that they do not collide with objects or walls. When participants detect an object on their way they have to stop walking and scanning and have to describe to the assistant where they think the object is located. The object is then reported by the assistant as “seen” if the description of the object location is accurate. Objects are classified as missed if the assistant has to momentarily interrupt the experiment to avoid a collision.

  5. After the completion of a run, the time is noted.

  6. Then, participants can rest or begin another run - restarting the process from step (i).

  7. Once the 3 courses are completed, participants have to complete the second experimental session (starting from the localization) with the mode they have not yet used (either the 2D mode if they began with the 3D mode and vice versa). If this is already their second session, participants have to answer a questionnaire to give their feedback.

2.3 Graphical representation and analysis of results

Boxplots [rice2006mathematicalBoxplot] are used to show the distribution of localization errors (Fig. 4), navigation times (Fig. 5), and questionnaire responses (Fig. 6

). The horizontal lines of a boxplot are the first quartile (lower horizontal line), the median (middle horizontal line), and third quartile (top line). The triangle is the average. The ends of vertical lines are Minima and maxima (excluding outliers). Data points that are more than 1.5 Interquartile range (IQR) away from the bottom or top quartile are outliers and are marked with black dots.

We use different versions of the ANalysis Of Variance (ANOVA) 

[rice2006mathematicalANOVA] to analyze the effect of the variables in table 1 (mode, group and session number) on participant’s performance. An ANOVA determines whether the means of two or more distributions are different, by comparing inter- and intra-group variances. For each ANOVA, we give the F statistic 222

The F statistic represents the ratio between the inter- and the intra- group variance. F is computed from Fisher distribution and the degree of freedom of the inter- and intra- group.

and the p-value 333

The p-value is computed from the F statistic and represent the probability of obtaining the observed means by chance.

. The significance level for the ANOVAs is set with the p-value at .

A One-factor repeated 444The repeated measure ANOVA is used when data are collected from the same participants under different conditions or at different times. measure ANOVA is used to analyze the effect of the mode on localization (section 3.1). A Two-factors 555The several factors ANOVA is used for analyzing the effect of several independent variables on one outcome variable repeated measure ANOVA is used to evaluate the impact of the mode (2D or 3D) and the course number on navigation (section 3.2). A two-factors ANOVA is also used to evaluate the effect of the mode and the group ( or ) on the questionnaire (section 4.2).

3 Results

3.1 Localization

Participants completed the localization task twice: once for each mode (2D and 3D). We study the impact of the the mode and the group ( or ) .

The global average localization error (distance between participants’ index finger and the center of objects) is  cm.

Figure 4 shows the distribution of the localization errors for each session number and each mode. To analyze the effect of the mode on the localization error, one-factor repeated measure ANOVAs were performed separately on each group. In the group of participants who began with the 2D mode (), the analysis reveals that the mode (either 2D or 3D) has significant effect on localization errors (). is significantly better with the 3D mode during their second session than with the 2D mode during their first session. In the group of participants who began with the 3D mode (), the analysis shows no significant effect of the mode (). obtains similar results at session 1 with 3D mode and session 2 with the 2D mode. Also, one-factor ANOVAs were computed per session (1 or 2) to study the effect of the group on localization errors. They show that the group has a significant impact during session 1 () but not during session 2 (). Participants of who began with the 3D mode perform well in the first session. They are then able to create their strategy to accurately represent the 3D environment with the 2D mode during session 2. Participants of who began with the 2D mode are initially less accurate and perform significantly better during their second session with the 3D mode.

Nevertheless, by taking into account the two experimental sessions, 11 out of 16 participants (69 %) are more accurate with the 3D mode of the system.

Figure 4: Localization task. Distributions of the participants’ localization errors as a function of the session number (1 or 2) and the system mode (2D or 3D). Groups of participants who began with the 2D () and 3D () modes are shown with the continuous and dashed lines, respectively.

Similar ANOVAs to those of the localization error were conducted to investigate the impact of mode and group on localization times. The analysis shows that neither the group nor the mode has a significant impact on the localization times.

Finally, based on the speed-accuracy tradeoff consideration [heitz2014speed], the hypothesis that participants could have adopted two different approaches to complete the task depending on their willingness was tested: either participant is quick but not accurate or accurate but slower. However, a Pearson’s correlation coefficient shows that there is no relation between the time taken to complete the localization task and the localization error ().

3.2 Navigation

Participants completed 6 courses in total: three courses in a row with the 2D (or 3D) mode during session 1 and then three courses in a row with the 3D (or 2D) mode during session 2 (Table 1). We studied the impact of the mode, the group (i.e. order at which 2D and 3D were used) and the course number on performance.

Figure 5 shows the distribution of the time taken by the participants to complete one course as a function of the session number, the mode (2D or 3D) and the course number. Overall, Seven participants (44%) achieved their best course time with the 2D mode and nine (56%) with the 3D mode.

To analyze the effect of the mode and the course number on the localization error, two-factors repeated measure ANOVAs were performed separately on that began with 2D and that began with 3D. For both groups, no significant effect of the mode (: , : ) or of the course numbers (: , : ) is found. Also, There is no interaction effect of the course number and the mode in () and ().

Yet, no participants is considered an outliers for session 2 (with an average course time above 400 seconds) and average times decrease across courses and sessions (Fig. 5). There is still a learning effect in the sense that performance becomes more consistent through time.

Figure 5: Navigation task. Distribution of the time taken by participants to complete the courses as a function of the session number, the mode and the course number. For session one, some participants are considered to be outliers because of the difficulties they had to navigate. For session 2, we do not get outliers anymore. Averages (triangles) are lower for session 2.

A two-factors repeated measure ANOVAs was also performed on each group for the number of missed obstacles. As described in section 2.2.2, we consider that an object is missed if the assistant had to momentarily interrupt the experiment to avoid a collision. The analysis shows no significant effect of the mode or the course number on the number of missed obstacles. All participants are able to complete the tasks with both modes by missing few objects (global average of missed obstacles).

As with localization, we hypothesized that participants could have been either fast and missed many objects or slow and missed few objects. As previously, a Pearson’s correlation coefficient shows no relation between the time taken to complete the navigation and the number of missed objects ().

Regardless of the mode (2D or 3D), participants manage to efficiently use the system to detect obstacles and became more comfortable and faster. We report in the discussion (section 5.1) results of complementary experiments conducted with 42 blindfolded sighted participants [commere2020evaluation], who were using the initial 2D version of See differently. They are consistent with this work.

4 Questionnaire and qualitative observations

4.1 Description

After the completion of the localization and navigation tasks, each participant was asked to complete a questionnaire (see Table 2). Four types of questions were used: “rating scale question”, “open-ended question” and “multiple choice question”. For “rating scale questions”, participants were asked to provide a numerical response ranging from 1 to 5. For [Understand., Loc. Ease, Navig. Ease] questions, 1 and 5 meant “very easy” and “very difficult” respectively. For [Navig. Afraid], 1 and 5 meant “not at all afraid” and “very afraid” respectively.

The questionnaire was divided into three sections: one for the localization (questions [Understanding,Loc. Ease, Loc. Strat.]), one for the navigation (questions [Navig. Ease, Navig. Afraid, Navig. Strat.]) and one to provide general feedback ( [Comfort] question). Except for the [Comfort] question, participants were asked to answer the questions twice: once for each system mode.

Question Type
Q1 - [Understand.] - How did you find the understanding of the device’s manipulation?
Q2 - [Loc. Ease] - How did you find the task of pointing the object with your index finger ?
Q3 - [Loc. Strat.] - What was your strategy to find the objects on the table ? Open
Q4 - [Navig. Ease] - How did you find the task of moving with the system ? Rating
Q5 - [Navig. Afraid] - Were you afraid to hit the obstacles ? Rating
Q6 - [Navig. Strat.] - What was your strategy to move in the corridor ? Open
Q7 - [Comfort] - In general, were you more comfortable with or without the depth information?
Multiple choice
Table 2: List of questions that were answered after localization and navigation tasks.

4.2 Results

Figure 6 shows the distribution of participants’ answers to rating scale questions by mode and group.

Two-factors ANOVAs were conducted for each rating question to study the impact of modes and the two groups on ratings given by participants. We report below significant ANOVAs. For question [Understand.], the two-factors ANOVAs reveals a significant effect of the group (i.e. order at which modes were used) on ratings (). Beginning with the 2D mode make it easier for participants to understand the system. Indeed, there is less information with the 2D mode (no audio encoded distance) which facilitate the interpretation of the sound. For question [Navig. Afraid], the two-factors ANOVA reveals a significant effect of the mode on ratings (). Participants felt safer with the 3D mode, although they did not perform better during navigation.

Figure 6: Questionnaire. Distribution of participants’ ratings for the questions [Undestand., Loc. Ease, Navig. Ease, Navig. Afraid] as a function of the mode (2D or 3D) and the group (i.e. participants who began with 2D or 3D mode).

Answers to questions [Loc. Strat.,Navig. Strat.] and observations made during the experiment revealed the strategies used for the localization and the navigation.

For localization, a common and effective strategy is to scan first the entire surface of the table (i.e., hold the system high above the table) to get approximate localization. Then, the strategy is to get closer to localize more precisely an object. With the 3D mode, participants keep the sound centered 777i.e., keep the object on the cell that produces the E3 note with an azimuth angle of 0°, Fig. 2) and use the sound rhythm that encodes the distance. With the 2D mode, participants estimate the distance by listening to the speed at which the sound changes when they move the device from right to left or up and down (see Fig. 8 for an“up and down” movement example). When the movement required for the object to be detected in another cell (i.e. for the sound to change) is large, then the object is close, and the other way around.

We noticed that localization errors occurred primarily when participants attempted to point while holding the system away from the object. Indeed, the index finger of the pointing hand is often not oriented exactly in the same direction as the optical axis of the camera. When the camera is far from the object, the localization error is therefore more important (Fig. 7).

Figure 7: Localization task. Illustration of why localization errors are larger when participants try to point while the system is far from the object. The angle between the direction of the index finger and the camera is the same in both situations (close an far). When the camera is far from the object, the localization error is larger.

For navigation, regardless of the mode, participants directed the camera towards the end of the corridor during the first course. The system was then detecting many obstacles on the course which made the sound cacophonous and difficult to interpret. Then, most decided to scan right in front of them so that the device does not detect too many obstacles at the same time. Only 4 out of 16 participants reported using the audio encoded distance in the [Navig. Start] question compared to 12 out of 16 for localization in the [Loc. Start] question. They use a similar strategy to move and detect obstacles with both modes, as shown in Figure 8.

For the [Comfort] question, 13 out of 16 participants reported feeling more comfortable with the 3D mode. Among the 3 who felt more comfortable with the 2D mode, two (one in each group) found that the addition of the audio-encoded distance in the 3D mode overload their audition. A third reported feeling more comfortable with the 2D mode because he began the experiment with it.

Figure 8: Navigation. Illustration of the main strategy used by participants to estimate the distance to an object. The distance is related to the up and down movement of the camera required to detect an obstacle. Top figure: the movement from zone 1 without obstacle (i.e. no sound) to zone 2 with obstacle (sound) is small which means that the object is close. Bottom figure: the movement from zone 1 to zone 2 is large which means that the object is far.

5 Discussion

Early visual to auditory substitution devices encode 2D monocular images into sound while more recent systems increasingly use 3D cameras. However, no studies have investigated the impact of using audio encoded distance on the user experience. The aim of the study was to quantify potential advantages of encoding distance in visual to auditory substitution devices. This question is important to the design of new sensory substitution devices. We therefore conducted experiments involving object localization and navigation tasks with a handheld visual to auditory substitution system. The system comprises 2 modes. Both modes encode the position of objects in 2D images and one of the two mode additionally encodes the distance between the system and the object. The experiments were conducted with 16 blindfolded sighted participants.

5.1 Consistency of the results with prior experiments

In [commere2020evaluation], we conducted prior experiments with forty-two blindfolded sighted participants who were using the initial 2D version of the See Differently substitution device [rouatICAD2014]

. This initial version used a neural-network based image processing algorithm to detect objects. As in this work, the system was encoding positions of objects in 2D images, by triggering sounds associated to the cells. There were 12 cells (3 rows, 4 columns) in that initial version and it did not provide distance information. Participants had to perform the same localization and navigation tasks. We used the same protocols as described in section 

2.2.1 (localization) and 2.2.2 (navigation) of this paper. We also used the same objects for localization (Fig. 3 a) and the same obstacles for navigation (Fig. 3 b) in the same corridor.

For the navigation task, we observed a learning effect across the 5 courses (average time decreased between the first and last trials of seconds) and participants missed very few objects ( ). As for this work, the number of missed objects did not decrease significantly between courses. Overall, the results presented here with the 2D mode are consistent with those from experiments from [commere2020evaluation] that were conducted with 42 participants. We are therefore confident about the generalization of the results presented in this paper.

5.2 The handheld 2D device allows to perceive 3D environment

In line with studies of Ward et al. [ward2010], Auvray et al. [auvray2007learning] with the vOICe, and the study of Renier and Volder with the PSVA [Renier2010], our participants were able to perceive the 3D environment with the encoding of 2D monocular images into sounds. They inferred distance with wrist movements as illustrated in Figure 8. It suggests that participants are able to assess the amplitude of their wrist movements and relate it to the sound changes.

The ability to perceive the position of a joint without visual feedback or other external cues is named “joint position sense” (or joint proprioception). Wrist position sense has been widely studied [gay2010new, marini2016wrist, pilbeam2018test, li2019wrist]. Although it depends on the protocols, measurement systems and the geometrical axis in which the joint is studied, the accuracy of wrist position sense has always found to be below 10°. Marini et al. [marini2016wrist]

found that the wrist position sense allows the assessment of the difference between two positions ("vector coding") rather than an absolute position ("positional coding"). This supports the fact that our participants were able to accurately estimate the wrist movement required to direct the camera from zone 1 with no sound to zone 2 with sound (Fig. 

8). This allows them to evaluate the distance of objects without the need for explicit distance encoding.

5.3 The 3D handheld device allows faster learning

For object localization, an effective strategy is to move the device as close as possible to the targeted object before pointing (Fig. 7). The quantitative results show that the 3D mode is at first more effective than the 2D mode in providing small localization errors. The strategy is easier and faster to implement with the 3D mode thanks to the distance encoded by the sound rhythm. After practicing, the 2D mode allows participants to get close to the object as well and thus make small localization errors. It suggests that sonification of 2D monocular images allows to perceive the 3D environment after some training. On the other hand, the explicit sonification of distances allows the perception of the 3D environment without prior learning.

As reported in the literature, “2D systems” often require more training time than “3D systems” to perform spatial tasks. Participants using the vOICe and the PSVA (which are“2D systems”) needed to train for 5 and 15 hours [auvray2007learning, Renier2010] before successfully completing the spatial tasks. When performing spatial tasks with 3D systems [Brock2013, stoll2015navigating, Skulimowski2018], participants often only need to know how the system works.

5.4 Do we need explicit audio encoded distances?

Qualitative results show that most participants felt more comfortable with the mode that explicitly encodes distances. Indeed, the 3D mode initially allows for better localization accuracy. In addition, participants felt more secure with the 3D mode during the navigation. Although most participants did not exploit the audio-encoded distance of the 3D mode, they felt that they could better anticipate obstacles avoidance.

There are still 18% of participants who were more comfortable without the audio encoded distance. They claimed that it overloaded their audition. Also, for navigation, participants used a similar strategy with the 2D and 3D modes that did not require distance information (Fig. 8). This explains why we did not observe differences between the two modes during this task. Therefore, audio-encoded distance did not improve participants’ performance during the navigation.

On one hand, the 3D mode gives equivalent or better quantitative and qualitative results. On the other hand, practice allowed users of the 2D mode to compensate for the lack of distance encoding during localization. During navigation, performance was similar with and without the explicit encoding of distance. In addition, although they are increasingly accessible, 3D sensors are still more expensive. Overall, we support the idea of Dakopoulos and Bourbakis that an SSD should be “easy to use (operation and interface not loaded with unnecessary features) and without the need of an extensive training period” [Dakopoulos2010]. Thus, removing unnecessary audio-encoded information can make the interpretation of the sound easier for some people.

Since we have shown that the effectiveness of audio encoded distance depends on the task and people’s preferences, we believe that systems should give users the ability to choose which information (especially distance information) will be encoded in sound based on their preferences and on usage contexts.

5.5 Limitations of the work

We chose to investigate handheld systems rather than a head-mounted. As suggested by Auvray et al. [auvray2007learning] handheld systems give the user more freedom to explore the visual scene. In addition, a handheld system can be implemented in a smartphone or similar device. Since nowadays most people have a smartphone, the system would be more discreet in public environments. However, it is still interesting to conduct experiments with head-mounted devices. For example, Brown et al. [Brown2011] showed that the vOICe substitution system was more efficient at identifying objects when held in the hand while it was more efficient at navigating when mounted on the head.

In this work we use the same encoding of 2D characteristics of the visual scenes for a more objective comparison of the 2D and 3D modes. In this context, the 3D mode does not bring much improvement over the 2D mode when the participant is well trained. An interesting strategy that we leave as a future work would be to use the distance information to segregate the closest objects in the scene from the objects that are far. During navigation, this would have made the sound less confusing when participants were pointing the camera towards the end of the corridor. Participants could then have adopted different strategies with the 2D and 3D modes.

Finally, experiments have been conducted with blindfolded but sighted individuals. Indeed, the results should be similar with late-blind participants since they adopt similar strategies to those used by sighted individuals when performing spatial tasks [ungar2000, Pasqualotto2013]. As a future work, it would be interesting to compare our results with further experiments done with late blind individuals and eventually congenitally blind.

6 Conclusion

Vision to audition substitution devices are very promising for blind or low vision people. In this work, we evaluated potential advantages of sonified distances for spatial tasks. Overall, the 3D mode that encodes distances into sounds yield comparable performance to the 2D mode. Indeed, we show that the use of audio encoded distance allows users to understand 3D scenes with less learning compared to not using distance. Users are also more confident and feel more secure when the distance is explicitely encoded. However, results should be nuanced. Without audio encoded distance, users learn to develop efficient strategies to perceive the 3D environment. Also, depending on the type of task, explicit distance encoding does not necessary yield better results. Finally, some participants prefer the 2D mode which does not encode distance. Ideally, we would recommend to the designers of vision to audition substitution systems to let the users decide which information to sonify, depending on their preference and tasks.


D. Lescal for a preliminary implementation and design of See Differently. FRQNT and NSERC-CRSNG for funding this research. A. Yarga and E. Calvet for the assistance during experiments. L. Celotti and S. Wood and the members of the NECOTIS research group for proofreading the paper. The participants, E. Plourde and the members of the NECOTIS research group for testing and providing feedbacks. Franco Lepore and Patrice Voss for stimulating and fruitful discussions for the planning of the protocol. François Côté for the fruitful discussions on the blind community.