Many humans have a need or desire for a robust form of sensory feedback. We seek to provide an effective avenue for that feedback through their ears. This is applicable to those with a visual impairment, those in need of feedback from medical or external devices, (e.g. prosthetics or robotic limbs), and, more generally, anyone who wishes to augment neurological or biological systems and requires a robust form of feedback.
The question that initially motivated this work was “Can visual information be communicated to a human through sound as effectively and concisely as it is through light?" In the opinion of the authors of this work, the answer seems to be, not through English but perhaps through another type of audio. For example, a human with average sight, after a brief (e.g. one second long) glance at a room or face, can describe key features of that room or face. For any description longer than a quick phrase however, it takes more than a second to communicate the description through spoken English words. Consider also that some mammals (e.g. bats) can effectively use their auditory systems for navigation in low-light environments. There are even examples (easily found on YouTube) of humans without sight who have reportedly learned to use a form of echolocation as well. Additionally, recent advancements in machine learning (i.e. deep learning) have provided methods of effectively embedding high-level visual information in relatively low-dimensional Euclidean space (e.g.Schroff et al. (2015)Arandjelovic et al. (2015)Peng et al. (2017) for just a few of the many existing examples).
In this work, we propose an unsupervised neural translation model that translates visual information into a perceptual audio domain. We imagine this as a system that a user would likely need to spend time to learn to use. Our proposed model leverages a pre-trained feature embedding (trained in a weakly supervised fashion using triplet loss) that embeds visual information into a metric space where the Euclidean distance between feature vectors gives a notion of visual similarity. We exploit this learned metric space by mapping the feature vectors into audio signals while enforcing that the Euclidean distance between each pair of untranslated feature vectors is equal (up to scaling) to a mel-frequency cepstrum-based psychoacoustic distance of translated audio signals. To enforce this geometric preservation, we use a simple term given by Eq.1
. This loss term is novel as far as the authors are aware, however, due to it’s simplicity and that it is arguably a natural choice, it seems very likely to have been used in other applications related to machine learning or computer vision.
Our translation model is based on WaveGAN Donahue et al. (2018), a GAN Goodfellow et al. (2014b) designed to synthesize audio which fits into a given distribution – that distribution being defined by a dataset of audio files.
We find that this technique allows the learning of a meaningful translation between two modalities and demonstrate the technique by mapping images of faces to human speech sounds using a model we refer to as “Earballs"111A portmanteau of “ears" and “eyeballs". We include results from a human subject test to support the claim that information is preserved in a perceptually meaningful way.
We call the translation task described here “transmodal translation". The term multimodal is commonly used to describe models and tasks which take into account information from multiple modalities (as inputs), e.g. classifying videos using both the visual content and audio. The term crossmodal has been used in previous works to describe models and tasks which take advantage of pre-existing connections between modalities in nature and human perception, e.g. reconstructing frames of video based on video’s audioDuan et al. (2019) or reconstructing images based on English language descriptions of those imagesQi and Peng (2018). In contrast to crossmodal translation, transmodal translation is the task of finding a novel map between two domains of distinct modalities, as opposed to the also quite interesting problem of learning a pre-existing map or a map which relies on pre-existing interactions between modalities.
The primary contributions of our work can be summarized as follows.
We propose the novel machine learning task of transmodal translation, the task of learning an information preserving map between the domains of two distinct modalities without relying on any pre-existing maps between those modalities.
We provide the implementation details of “Earballs", a novel system for the transmodal translation of feature vectors into an audio domain. To the authors’ knowledge, this work is the first attempt to automatically learn a translation from an arbitrary domain of features vectors (equipped with some meaningful metric) into a non-language-based audio domain.
2 Related Work
2.1 Hearing Visual and Geometric Information
In 1992, Meijer (1992)
proposed a system, vOICe, for translating images into audio (represented in the time-frequency domain) by assigning each row and column of the image to a specific frequency and time, respectively. The amplitude of each frequency at each time being determined by the corresponding pixel’s intensity. This idea has been further studied in a variety of works by other researchers, including a 2015 Nature article,Stiles and Shimojo (2015). In comparison, our system does not seek to encode or preserve the information of each individual pixel, but instead, higher-level information extracted by a learned feature embedding. By focusing on mapping the geometric structure of the feature space into the perceptual audio domain, we can translate information from a wide range of domains into perceptually meaningful audio. Our proposed system can also be applied to translate information from non-visual domains.
There have also been efforts towards exploring other methods of improving humans ability to perceive their environments through their ears, such as Sohl-Dickstein et al. (2015), which investigated the prospect of improving humans’ ability to use echolocation through the use of an ultrasonic emitter coupled with a hearing aid to lower the emitter’s ultrasonic frequencies into the normal range of human hearing.
As mentioned in the previous section, there have been many attempts at using deep neural networks to perform crossmodal translation tasks, such as the reconstructing frames of video based on video’s audioDuan et al. (2019) or reconstructing images based on English language descriptions of those images Qi and Peng (2018). In contrast with these works, our transmodal translation task has the goal of learning a novel map between two domains of distinct modalities, as opposed to learning a pre-existing map or a map which relies on pre-existing interactions between modalities.
2.2 Geometric Preservation
of GAN models and also to alleviate mode collapse and the vanishing gradient problemTran et al. (2018). Benaim and Wolf (2017) used a geometric preservation term similar to Eq. 1 for the task of unpaired image to image translation achieving similar results to CycleGAN Zhu et al. (2017) and DiscoGAN Kim et al. (2017) without the need for a secondary inverted generator. Our geometric preservation methodology and loss term differ from the previous works we (the authors) are aware of in the choice of normalization method (see section 3.4).
In this work, we propose a deep learning based framework that translates high-level information extracted from an image or other signal (e.g. facial identity/expression, location of objects, etc.) into audio. This system can be built on top of any feature embedding model that embeds inputs into a metric space (i.e. any model, , where is meaningful for some metric ). Our proposed “Earballs" system (shown in Fig. 1) starts with a pre-trained feature embedding model capable of extracting desired features from an image (e.g. FaceNet Schroff et al. (2015)). We then train a generative network which maps the features into a target perceptual audio domain. This perceptual audio domain can be determined by any sufficiently large and diverse dataset of sounds (e.g. clips of human speech Panayotov et al. (2015), musical sounds Engel et al. (2017), etc.).
We use a GAN approach Goodfellow et al. (2014b) to train the generative network, enforcing the that both the output sounds fit into the distribution of sounds specified by the target dataset and that the geometry of the source domain (e.g. of feature vectors) is preserved as it’s mapped into the target audio domain.
To generate sounds which fit into a target audio dataset, we use a GAN. As is typical with GANs, this is a convolutional neural network composed of two smaller networks trained simultaneously, a generator network and a discriminator network. The discriminator network is trained as a classifier to differentiate between real samples taken from a target (in our case audio) dataset and samples output by the generator network. The generator, given a latent input vectors (or for our case, feature vector from our source domain), is tasked with generating output which fools the discriminator into predicting it is a real sample from the dataset. For our purposes we use an addition loss term, Eq.1, to enforce that the generator preserves pairwise distances between the feature vectors in the input batch. Except for this geometric preservation component, our network architecture and weight update scheme (described in Sec. 4.6) follow those of WaveGAN Donahue et al. (2018), including the use of a Wasserstein gradient penalty Gulrajani et al. (2017).
3.2 Mel-Frequency Cepstral Coefficients (MFCCs)
The mel scale S. et al. (1937) is a logarithmic scaling of frequency designed to more closely match human hearing – the goal being that if humans perceive two frequencies, and , as being the same distance apart as two other frequencies, and , then and should be the same number of mels apart as and , i.e. we should have that
. There are multiple implementations of the mel scale, ours follow that used in the TensorFlow signal module:.
MFCCs are a spectrogram variant which spaces frequency bands on the mel scale. They are a commonly used audio featurization technique for speech Ganchev et al. (2005) and musical Terasawa and Berger (2005)
applications. Our exact implementation uses 80 bins over the frequency range of 80-7600Hz, window length of 1024 samples, fast Fourier transform (FFT) length of 1024, a frame step of 256.
3.3 Audio Metrics
For our target domain metric we investigate two metrics, distance, and an MFCC-based metric: where is as described above in Sec. 3.2.
3.4 Geometric Preservation Constraint
Here we’ll define our geometric preservation constraint, which encourages our model to learn a translation, which up to a scaling factor, is as close as possible to an isometry. Below we’ll use to denote our translation, to denote the target audio domain we’re mapping into, equipped with some audio metric, , and to denote our source domain. The source domain being any metric space we’d like to learn a translation into audio for (e.g. a space of feature vectors, points from a 3d model, LiDAR/RGB-D data, etc.). Note that the information we wish to communicate from must be contained in the geometry of . E.g. in the example we’re using as a demonstration, is a space of feature vectors output by FaceNet and because triplet loss is used to train FaceNet, the entirety of the information we’d like to translate into audio is contained in the geometry (i.e. the Euclidean distances) of .
Let and be our source and target metric spaces respectively, and let be the mapping between these spaces given by our generator. Our metric preservation loss term is then given by
where is a batch of source samples, and and are the mean pairwise distance of samples in the input and in the output batch respectively.
Note that we normalize by the batch means as we want to enforce metric preservation only up to a scaling constant.
Note that previous works (e.g. Benaim and Wolf (2017)) have normalized over global statistics computed over all of and and use standardization instead of our normalization method of dividing by the mean. Conceptually, dividing by the mean can be thought of as scaling both domains to have a mean pairwise distance of one. Using global statistics to normalize would enforce that the image of has the same diameter as , which we’d rather let the network decide. We believe this flexibility can be helpful. This belief is supported by experimental results shown in rightmost plot in Fig. 2. Furthermore as mentioned in Sec. 4.8, we did attempt to use an additional loss term to control the diameter of , but found that this led to a degradation in audio quality.
In the example we present below (faces to speech-like sounds), is a set of 128-dimensional unit vectors feature vectors output by (or, regarding the technique mentioned in Sec. 4.2, which could be output by) the OpenFaceAmos et al. (2016) implementation of FaceNet. FaceNet is trained using triplet loss, making Euclidean distance the natural choice for . Our is given by audio from the TIMIT dataset (preprocessed as described in Sec. 4.3). For we use our MFCC-based audio metric as described in Sec. 3.3.
3.5 Total Loss Functions
Our generator and discriminator loss functions are, in their totality, given by
where is a batch of feature vectors from the source domain, is a batch of audio clips from the target audio domain, and denote the generator and discriminator respectively, and is a Wasserstein gradient penalty term (the implementation from Donahue et al. (2018)). Following Donahue et al. (2018), we set . We experiment with tuning in Fig. 2.
In this section, we demonstrate the two (and MFCC) variants of our Earballs model on the task of transmodal translation from images of faces to human speech-like sounds, using WaveGAN without geometric preservation as a control.
4.1 Source Dataset: FaceNet Feature Vectors
For our source dataset we use feature vectors created by the pre-trained OpenFace Amos et al. (2016) implementation of FaceNet Schroff et al. (2015) from images in the Labeled Faces in the Wild (LFW) dataset Huang et al. (2012).
FaceNet is trained using triplet loss to embed images of the same face close together and images of different faces far apart by Euclidean distance. This makes distance the natural metric choice for our source domain.
LFW is collection of 13233 images of 5749 people, 1680 of which have two or more images in the dataset. The dataset also comes with automatically generated attribute scores for a collection of attributes related to eye wear, hair color, age, etc.
4.2 Undiscriminated Random Inputs (URI)
Our proposed model makes use of a feature we’ll refer to as undiscriminated random inputs (URI). This means that with a probability of each batch is not taken from the source dataset of feature vectors but instead is randomly uniformly sampled. The discriminator is not used to evaluate the quality of these samples. The feature vectors output by our pretrained embedding network being highly clustered, this methodology was implemented in an attempt to help the generator learn the spherical geometry of the input feature vectors thus easing its training. Potentially, feeding in random feature vectors like this could also help with generalization, enforcing that all possible faces (not just those in the training set) have a unique sound assigned to them. We analyze the effect of this technique in Fig. 3 and give some discussion of the results in Sec. 4.8. In the case of our FaceNet-based demonstration, our source domain (see Sec. 4.1) is composed of 128 dimensional unit vectors, or equivalently, points on the unit hypersphere in . To uniformly sample points on this sphere, we sample a 128-dimension Gaussian and normalize the samples with respect to Euclidean distance to get unit vectors. This works with any Gaussian that’s symmetric about the origin, as all unit vectors (i.e. all directions) will then be sampled with equal probability density.
4.3 Target Dataset: TIMIT
For our target audio dataset we chose one composed of English human speech sounds, the DARPA TIMIT acoustic-phonetic continuous speech corpus (TIMIT).Garofolo et al. (1993) We choice an English speech dataset thinking that human ears (especially those of native and fluent speakers) might be especially sensitive to these types of sounds and might also find the sounds to be more memorable. The dataset is composed audio recordings of 630 speakers of eight major dialects of American English, each reading ten "phonetically rich" sentences. We following Donahue et al. (2018)’s method of inputting the audio samples, are randomly shifted then truncated to 1.024 second. Note that with our proposed MFCC parameters this gives our target space an effective dimensionality of (w.r.t. geometric preservation).
Following Donahue et al. (2018), the audio is input (into the discriminator network) as 1.024 second clips of 16kHz mono PCM audio (vectors of length 16384), cast as floats and normalized by dividing by 32768.
4.4 Evaluation Metrics
We use three computational metrics to evaluate our model as well as two human subject test scores: mean absolute error (MAE), Pearson product momentum correlation (PC), nearest centroid accuracy (NCA), human subject classification accuracy (HSA), and a human subject memorability score (HSA). These human subject scores are defined below as well as our NCA metric. By MAE here we mean exactly our metric loss as defined in Eq. 1. PC is defined as usual.
4.4.1 Nearest Centroid Accuracy (NCA)
As our source vectors come from a triplet loss trained model (FaceNet), the source feature vectors in our demonstration can be classified with high accuracy using nearest centroid. Using the identity of (the person in the source photo for) each feature vector as its label, all feature vectors (in our test set) with the same label are averaged to find the centroid for each label. The nearest centroid prediction for any given feature vector is then the label with the closest centroid. The NCA for our test set source feature vectors output by OpenFace’s FaceNet implementation is . As we’re focused on preserving the geometry of the source domain, we use the nearest centroid predictions from the source feature vectors as the labels when computing the NCA scores in Fig. 2 – even if a source vector is nearer to another label’s centroid than its own label’s, we want to preserve this distance.
4.5 Human Subject Tests
In order to demonstrate that the translated information can be interpreted by humans we conducted a human subject test. All participants were students or alumni who responded to a mass email received a gift card as incentive. All tests were administered via one-on-one teleconference by the first author, and the first author was blind to which examples were included in the test and to which of the three models (the MFCC-based Eaballs, -based Earballs, or the WaveGAN control), was being tested. An institutional review board (IRB) oversight exemption was obtained from the first author’s institution’s IRB. Subjects were given a 15 minute time limit. A total of 23 subjects were tested. Two subjects returned incomplete forms, these datapoints were removed as the omissions were not noticed until the tests were graded. The results from the remaining 21 participants are shown in Tab. 2.
4.5.1 Human Subject classification Accuracy Test (HSA)
Participants were provided with three audio clips labeled “A.wav", “B.wav", and “C.wav" and eight additional clips labeled “0.wav" through “7.wav"and asked to classify each as sounding most similar to A, B, or C. These human subject accuracies are reported in Tab. 2. A, B, and C were each a translation of a photo of a different person, each chosen at random from our test set for each participant. The model used for the test (the MFCC-based Eaballs, -based Earballs, or the WaveGAN control) was also chosen at random. If any feature vectors from category A were farther apart from each other than the minimum distance between feature vectors in A and B or A and C, then we’d generate a new test – this was appropriate as we were seeking to test the preservation of information and not the accuracy of OpenFace’s FaceNet implementation. The probability of a generated test being inseparable by distance like this was about 68%. The tests were generated from examples in our test set in the following manner.
4.5.2 Human Subject Memorability Test (HSM)
We, the authors, noticed that some of the generated sounds were somewhat memorable, we’d remember them well enough one day to remember them the next day. As a rough test for the prevalence of this property among generated samples, in each of the teleconferenced tests, after showing the subject an example, the test administrator would play one of the user’s test’s A, B, or C sounds at random three times. The test administrator directly after this turned off his camera and microphone and emailed the participant the test. The first question on the test’s response form was asked which sound was played by the test administrator. The idea here being to test if the sound was memorable enough that the participant would be able to hold the sound in their head for the 30 seconds or so that it took to receive the email, unzip the test, and listen to “A.wav", “B.wav", and “C.wav". The results are shown in Tab. 2.
We train all models for 30k steps using a batch size of 64. This took about 11 hours per run on an NVIDIA 2080 Ti. Following Donahue et al. (2018), we set and use Adam as to optimize both our generator and discriminator with the learning rate, , and set to 0.0001, 0.5, and 0.9 respectively. The discriminator is updated 5 times for each time the generator is updated. We experiment with tuning and the URI proportion (see Fig. 2 and Fig. 3 respectively) finding and a URI proportion of 0.5 to be good choices. We also experimented extensively with MFCC parameters and other audio metrics, however, we do not provide any related results here as we these choices were decided primarily based on personal evaluations of the audio fidelity – the MFCC parameters and target metric choices greatly affected the quality of the produced audio. That said, we found the parameters and implementation specifics proposed in this work to consistently lead to good results when using the above-described LFW/FaceNet source domain and TIMIT target domain.
We use LFW’s official train-test split which splits the dataset by person, i.e. no person is featured in both the train and test set. We further split the 4038 person, 9525 sample training set, reserving 415 celebrity identities (1174 samples) for validation.
4.7 Experimental results
Tables 1 and 2 show our computational and human subject metric results respectively. The two (and MFCC) variations of Earballs are competitive. Results from WaveGAN (without geometric preservation) are included as a control. In Fig. 2 we show how three metrics of geometric preservation change as the metric loss weight, , varies. The PC, MAE, and NCA improve as increases from (the control case) to until they plateau around on our MFCC-based model or on our based model. This explains our choices of setting and for our two Earballs variations.
|Method||sample size||HSA||HSA range||HSM|
4.8 Audio Quality
The generated audio samples might be described as noisy human speech. There was a trade-off between metric preservation and audio fidelity which can be seen in Fig. 2. Audio fidelity tended to degrade from the clean sound of a single speaker to a noisy chorus to simply noise in all models as geometric preservation constraints were increased. As the the metric loss weight, , increased past about the audio quality began to decrease. By , the generated audio sounded entirely like noise, not at all like human speech, and translations were difficult to discern between by ear. We attempted to use an additional loss, , to control the ratio, , of the mean pairwise distance of generated samples and the mean pairwise distance in our target dataset, but this led to other undesirable audio qualities. The parameters recommended above resulted in acceptable audio fidelity without compromising the metric preservation. This was especially true for the MFCC variant, which by the authors’ personal judgement, allowed for increased audio fidelity compared to the variant it the metric preservation plateau point. We used the URI in our proposed model because models trained with it tended to have greater audio fidelity. Given the results in Fig. 3 it may have been possible to achieve similar results by tuning the number of times the discriminator is updated per generator update.
Both variants of Earballs learned to preserve the desired information (as seen by the computational metric results) and much of it was preserved in a perceptually usable fashion (as seen by the human subject metric results), if not all. We converted faces to sounds as a simplest case example of transmodal translation. The results are promising, raising hopes that we might be successful using the same technique to communicate other high level visual features through sound. It may also be possible to communicate lower dimensional information such as the geometry of a user’s surroundings (as viewed through a depth sensitive camera) or feedback from 2D surfaces (e.g. prosthetics). We picture a practical transmodal translation system to be a tool that a user (at least for the near future) would need to spend significant time to learn to use. That said, the reward for learning to use such a system could be great, and the system itself could be quite inexpensive, possibly for some applications, even implementable using the sensors and compute power available on a modern smartphone.
Many humans have a need or desire for a robust form of sensory feedback. The methodology proposed here could, with further refinement, provide an effective avenue for providing that sensory feedback through a user’s ears. This is applicable to those with a visual impairment, those in need of feedback from medical or external devices, (e.g. prosthetics or robotic limbs), and, more generally, anyone who wishes to augment neurological or biological systems and requires a robust form of feedback.
As new technologies for human enhancement grow in popularity, it’s possible that those without access to (or desire/ability to use) these technologies will be disadvantaged by their existence. This, along with negatives related to potential military applications, are perhaps the most likely negative consequences of advancing this technology.
System failures in a system like Earballs could lead to unexpected temporary sensory blindness. Extensive user testing would be necessary for users to safely rely on this technology for operating heavy machinery or navigating in dangerous environments. It’s also possible that this technology would be susceptible to adversarial attacks Goodfellow et al. (2014a) that biological neural systems are not.
The influences of cultural biases in training datasets can influence the sensory responses this system generates. For example, training Earballs on a biased dataset of faces could lead to a system which performs unsatisfactorily when recognizing individuals in underrepresented groups.
OpenFace: a general-purpose face recognition library with mobile applications. Technical report CMU-CS-16-118, CMU School of Computer Science. Cited by: §3.4, §4.1.
- NetVLAD: CNN architecture for weakly supervised place recognition. CoRR abs/1511.07247. External Links: Cited by: §1.
- One-sided unsupervised domain mapping. In Advances in neural information processing systems, pp. 752–762. Cited by: §2.2, §3.4.
- Adversarial audio synthesis. arXiv preprint arXiv:1802.04208. Cited by: §1, §3.1, §3.5, §4.3, §4.6.
- Cascade attention guided residue learning GAN for cross-modal translation. CoRR abs/1907.01826. External Links: Cited by: §1, §2.1.
Neural audio synthesis of musical notes with wavenet autoencoders. External Links: Cited by: §3.
Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
- Comparative evaluation of various mfcc implementations on the speaker verification task. In in Proc. of the SPECOM-2005, pp. 191–194. Cited by: §3.2.
- DARPA timit acoustic-phonetic continuous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon Technical Report N 93, pp. 27403. Cited by: §4.3.
- Explaining and harnessing adversarial examples. External Links: Cited by: Broader Impact.
- Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §3.
- Improved training of wasserstein gans. CoRR abs/1704.00028. External Links: Cited by: §3.1.
- Learning to align from scratch. In NIPS, Cited by: §4.1.
- Learning to discover cross-domain relations with generative adversarial networks. CoRR abs/1703.05192. External Links: Cited by: §2.2.
- An experimental system for auditory image representations. Biomedical Engineering, IEEE Transactions on 39, pp. 112 – 121. External Links: Cited by: §2.1.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §3.
- Reconstruction for feature disentanglement in pose-invariant face recognition. CoRR abs/1702.03041. External Links: Cited by: §1.
Cross-modal Bidirectional Translation via Reinforcement Learning. In
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 2630–2636 (en). External Links: Cited by: §1, §2.1.
- A scale for the measurement of the psychological magnitude pitch. The Journal of the Acoustical Society of America 8 (3), pp. 185–190. External Links: Cited by: §3.2.
- Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §3, §4.1.
- A device for human ultrasonic echolocation. IEEE transactions on bio-medical engineering 62 (6), pp. 1526–1534. External Links: Cited by: §2.1.
- Auditory sensory substitution is intuitive and automatic with texture stimuli. Scientific reports 5, pp. 15628. Cited by: §2.1.
- Perceptual distance in timbre space. In International Conference on Auditory Display (ICAD 2005), pp. 61–68. Cited by: §3.2.
- Dist-GAN: An Improved GAN using Distance Constraints. arXiv:1803.08887 [cs] (en). Note: arXiv: 1803.08887Comment: Published as a conference paper at ECCV 2018 External Links: Cited by: §2.2.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §2.2.