Trajectory annotation using sequences of spatial perception

04/11/2020 ∙ by Sebastian Feld, et al. ∙ Universität München 6

In the near future, more and more machines will perform tasks in the vicinity of human spaces or support them directly in their spatially bound activities. In order to simplify the verbal communication and the interaction between robotic units and/or humans, reliable and robust systems w.r.t. noise and processing results are needed. This work builds a foundation to address this task. By using a continuous representation of spatial perception in interiors learned from trajectory data, our approach clusters movement in dependency to its spatial context. We propose an unsupervised learning approach based on a neural autoencoding that learns semantically meaningful continuous encodings of spatio-temporal trajectory data. This learned encoding can be used to form prototypical representations. We present promising results that clear the path for future applications.



There are no comments yet.


page 4

page 5

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Mobile robots enter our daily lives, be it in private or business context, to raise either productivity or comfort. Probably the most popular use case for such autonomous acting hard- and software systems is way-finding support in complex public environments like airports

(Ruppel et al., 2009), fairs, or hospitals (Cosma et al., 2004). Even daily-routine support in private environments like high-income households or home for the elderly (i.e., Active Assisted Living (AAL) (Rashidi and Mihailidis, 2013)) may benefit from intelligent and autonomous mobile robots. Another important use case is Simultaneous Location and Mapping (SLAM), in particular in unknown or hazardous environments (Nagatani et al., 2013).

Originating from cognition psychology, spatial perception describes attempts to understand the environment human beings or other entities in general are surrounded by (De Smith et al., 2007). Space syntax (Hillier and Hanson, 1989) are corresponding techniques to measure and analyze such local environments with isovist analysis (Tandy, 1967; Benedikt, 1979) as a popular implementation focusing on all points visible from a given point of view. Besides analyzing a single point in space, one can also measure the spatial perception along a trajectory. The idea is that while continuously moving through space, one may also measure continuous changes in corresponding spatial perception. There are several attempts utilizing space syntax techniques that define and analyze (psychology, e.g. (Wiener and Franz, 2004)) or recognize and learn (computer science, e.g. (Sedlmeier and Feld, 2018)) recurrent although fuzzy structures inside buildings like rooms, halls, or combinations of them. Thus, transforming visual sensor input into some kind of spatial-temporal awareness may help creating human-machine-interfaces and wayfinding-systems with special requirements, e.g. the identification and presentation of routes for visually impaired persons that avoid identified hazardous situations (Li et al., 2004; Hub et al., 2004)

. Even SLAM units may benefit from spatio-temporal awareness when enabled to communicate their situational understanding or identified patterns in spatial perception along their path. The machine learning community made huge advances creating techniques that may be suitable in the contexts mentioned above. There are numerous results in information extraction and pattern recognition on large scale and unknown data sets. Regarding the visual impression of space, Convolutional Neural Networks (CNNs) are able to detect patterns in images, i.e. visual imagery. Concerning the aspect of time, Recurrent Neural Networks (e.g. LSTM or GRU) or temporal convolutions are able to detect temporal dependencies in given data. Finally, the latent representation learned by some generative models, such as Variational Auto-Encoders (VAE), can be used to create low-dimensional informative representations of factors of variation in the data distribution of interest, such as trajectory data.

We hypothesize that it is possible to cluster movement through buildings based on spatial perception. Thus, the question to be answered is how to implement, train, and apply an artificial neural network in an unsupervised way using isovist sequences along trajectories through spatial structures in 2D worlds as input. More precisely, we suppose that a neural network is able to learn recurring spatio-temporal patterns in sequences of bitmaps of isovists and to cluster them for further usage (e.g., annotation). As mentioned above, such a framework may evaluate and interpret environmental information and communicate it in a human-like way. The identification of patterns in an entity’s spatial perception along a path may support the development of human-machine-interfaces by, for example, the annotation of trajectories based on the identified spatial perception.

After discussing related work regarding semantic annotation floor plans in Section 2, we describe our methods and background in Section 3. We then propose the concept of our system that is able to analyze movement through space in Section 4. Basically, it consists of (a) data synthesis, thus the creation of isovists along trajectories through floor plans, (b) the neural network architecture including CNN, GRU, and VAE, and finally (c) some visualization aspects. Section 5 incorporates an in-depth analysis of our system including the evaluation and discussion of several results. We conclude our paper in Section 6 and briefly give hints on future work.

2. Related Work

This section contains related work regarding the semantic annotation of floor plans. Basically, floor plans constitute a subset of map representations of spatial environments, an important concept inside the communities of GIS and LBS. There are many different types of map representations, ranging from geometrical or topological to logical maps.

There is a huge corpus of existing literature regarding the semantic annotation of floor plans in the context of SLAM (Leonard and Durrant-Whyte, 1991). There are techniques that detect rooms and doors in order to create topological maps using virtual sensors, 2D laser scans, or camera images (Buschka and Saffiotti, 2002; Anguelov et al., 2004; Chen et al., 2014)

. Further work semantically annotate maps using supervised learning techniques creating labels like room, corridor, hallway, doorway, or free-space

(Mozos and Burgard, 2006; Mozos, 2010; Goerke and Braun, 2009). We delimit our work regarding three facts: (1) we focus on the analysis of floor plans without the integration of further sensors, (2) we focus on the impression of movement through space and not on the analysis of space itself, and (3) we follow an unsupervised and fuzzy approach.

Besides, there is related work that incorporates isovists for the analysis of architectural space, just as the paper at hand. (Bhatia et al., 2012)estimate salient regions, i.e. regions with strong visual characteristics, in architectural and urban environments using 3D isovist. (Feld et al., 2016) approximate isovist measures along trajectories in order to identify traversed doors, for example. Finally, (Feld et al., 2017) calculate and cluster isovist measures on 2D floor plans showing that the identified clusters correspond to regions like e.g., streets, rooms, or hallways. These approaches, however, do not take the impression of movement into account.

The most significant related work may be (Sedlmeier and Feld, 2018), where a framework for creating 2D isovist measures along trajectories traversing a 3D simulation environment is presented. The authors show that these isovist measures reflect the recurring structures found in buildings and that the recurring patterns are encoded in a way that unsupervised machine learning is able to identify meaningful structures like rooms, hallways and doorways. The labeled data sets are further used for neural network based supervised learning. The models generated this way do generalize and are able to identify structures in different environments. Again, our paper delimits regarding the fact that we focus on the analysis of spatial perception during movement.

3. Methods & Background

The main goal of this paper is the clustering of interior movement using sequences of spatial perception. Thus, this section will describe methods and background used in this concept. We describe isovists as a representation of spatial perception in Section 3.1, followed by several machine learning techniques that focus on spatial and temporal pattern recognition as well as on unsupervised clustering (Section 3.2).

3.1. Environments and Isovists

In GI-science, the definition of an environment ranges from a landscape-sized objects to space as a social construct and further. In general, it can be seen as being an immovable object with a surrounding character, while its surfaces give structure to an observer’s topological perceived immersion. We consider an environment as a three-dimensional finite structure that consists of a walkable floor, any kind of obstacles forming a boundary to the structure, and a ceiling as a closure to this construct. From an observer’s perspective at any given point in this environment, the perceived visual space would be the area which can be described by all directly visible surfaces.

Isovist analysis is a method first introduced by Tandy (Tandy, 1967) in 1967, which was later extended by Benedikt (Benedikt, 1979) in 1979. It transforms the perception of space into a measurable representation (cf. Figure 3). Isovist analysis is focused on retrieving and analyzing quantitative environmental information (structures and arrangements), rather than qualitative object attributes (texture, color, movability, function). Benedikt(Benedikt, 1979) describes the isovist as “location-specific patterns of visibility” (Benedikt, 1979, p. 7). Thus, each isovist describes the spatial perception at a specific position whereas a chronology of isovists describes the spatial perception during a movement along multiple points in space. Since isovists make spatial descriptions measurable and comparable, a sequence of isovists makes a series of spatial descriptions during motion through space comparable. In literature, the concept of tracking motion using isovist analysis is referred to as isovist fields (Batty, 2001).

(a) Continuous Isovist
(b) Discrete Isovist
Figure 3. Batty (Batty, 2001)

demonstrates the difference between a continuous vector-based isovist model to a discrete grid-based one.

Besides isovist analysis there exist other methods to measure and process visible space. In (Llobera, 2003), Llobera gives an overview of available methods in visual analysis. In geographical and archaeological context, the concept of viewshed captures visual space of the scale of natural landscapes by Digital Elevation Models (DEM). Furthermore, the concept of Visual Graphs arises from the definition of “isovists as a subgraph of a visibility graph”. It allows the calculation of measurable properties such as distance, area, perimeter, compactness, and cluster ratio to be calculated and mapped back into space (Llobera, 2003; Turner et al., 2001). Llobera defines visualscapes as an extension to Benedikt’s initial ideas (Llobera, 2003).

3.2. Machine Learning & Neural Networks

Our system consists of machine learning techniques that are able to describe visited training samples and build a classifier.

3.2.1. Spatial Patterns & Convolutional Neural Networks

A CNN tries to learn and reveal spatial patterns by applying several filters and local pooling to an input. It basically consist of three layer types that are combined (O’Shea and Nash, 2015). First, the convolutional layer can be described as a local connected weight multiplier. Small areas of its input are multiplied with an internally stored weight matrix on multiple filter levels. Each of those filters becomes specialized on a certain characteristic, like a discrete color or a spatial pattern. A feature is found through a high magnitude outcome of the filter’s weight multiplication operation in relation to neighboring pixels (Lacey et al., 2016). In detail, the convolution operation itself is achieved by calculating the scalar product within a kernel window (e.g., 3x3), which is moved over the input matrix rows and columns.

Additionally, pooling layers act as dimensionality reducers and merge semantic similar features into one feature using an arithmetic function. The most common kind of pooling is max pooling, which works by splitting the input in (usually non-overlapping) patches and outputting the maximum value of each patch (LeCun et al., 2015). As a provider of invariance, pooling operations also reduce a model’s computational complexity (O’Shea and Nash, 2015; Dumoulin and Visin, 2016).

Lastly, a fully-connected layer usually serves as the final stage of a CNN. By connecting every neuron of the previous layer with each of the current layer’s neurons, it attempts to produce class scores which can be used for classification (usually by applying the softmax function). Its way of operation is determined by the chosen activation function and structural position within an neural network

(O’Shea and Nash, 2015).

Since isovists can be processed in form of spatially correlated binary images, CNNs are an adequate tool for finding recurrent structures and natural patterns.

3.2.2. Temporal Patterns & Gated Recurrent Units

We assume a temporal relation within a sequence of isovist images. Sequential data input can be processed using Recurrent Neural Networks (RNN). While one element of a sequence is processed at a time, a hidden vector carries the history of all past elements of a sequence so that the output at a time step is the result of each previously evaluated input combined with the current input. Thus, RNNs are comparable with a single layer that is reused multiple times in one iteration while tracking all subsequent computations (LeCun et al., 2015).

While regular RNNs are known for their problems in processing long time dependencies (the learning gradient is often known to either explode or vanish (LeCun et al., 2015; Chung et al., 2014)) Hochreiter & Schmidhuber introduced the Long-Short-Term-Memory (LSTM) unit which performes better on distant temporal relations (Hochreiter and Schmidhuber, 1997). The main difference between regular RNNs and LSTM units lies in the unit’s connection to itself at the next time step through a memory cell and the introduction of a forget gate (LeCun et al., 2015). Recently, Cho et al. proposed Gated Recurrent Units (GRU) as a modified LSTM unit that delivers comparable results at a lower number of weights/parameters.

Those units include a reset gate and an update gate that control how much each hidden unit remembers or forgets while reading a sequence (Cho et al., 2014).

3.2.3. Clustering & Variational Auto-Encoder

Since human perception is highly subjective we need an unsupervised clustering approach that proves the assumption of a temporal and spatial relation within the training data in general.

Figure 4. Overview of the NN architecture; blue: processing a sequential input through several layer to form an internal representation; green: network prediction; orange: building a sequential output as training target.

The Auto-Encoder (AE) can be seen as a type of high-dimensional clustering network capable of learning prototypical representations. By comparing the previous network input to the generated output, an estimated error is calculated and afterwards back-propagated as the training signal (cf. Figure 4) (Kamyshanska and Memisevic, 2015). Its general concept consist of two stages: First, the probabilistic encoder model reduces an input’s dimensionality and learns a representation. This representation vector is established in a small fully connected bottleneck layer (Figure 4, blue). Second, the probabilistic decoder or generator model reconstructs the original input based on the given representation (Figure 4

, orange). During training, a loss function evaluates the decoder’s output by comparing it to the encoder’s input. Since AEs operate without the need for labeled data, the entire process is considered to be unsupervised.

Typically, the AE’s layers consist of fully connected neurons, however, there are cases in which AEs are combined with other ANN structures like CNNs (Section 3.2.1) or recurrent-layers (Section 3.2.2). It can be simply described as a deep discriminative ANN whose output targets are the data input itself rather than class labels (Deng et al., 2014; LeCun et al., 2015). In 2013, Kingma (Kingma and Welling, 2013) introduced a learning enhancement called the Auto-Encoding Variational-Bayes algorithm, which allows for a better approximation of “posterior interference using simple ancestral sampling” (Kingma and Welling, 2013, 1). Kingma proposed the Variational Auto-Encoder (VAE

) whose main benefit lies in the structure that can be discovered within the bottleneck layer. The resulting probability distribution of sample representations in latent space can be approximated to any desired choice. For instance, a Gaussian distribution for real-valued data or a Bernoulli distribution for binary data input is applicable.

As Auto-Encoders have shown to provide state-of-the-art performance in a variety of tasks like object recognition or learning invariance in representations (Kamyshanska and Memisevic, 2015), we employ a VAE as the overall neural network structure capable of handling sequences of isovists.

4. Concept

Our system’s main goal is the clustering of movement based on the visual impression during movement while following a path that runs through an interior space (e.g., a building). We therefore have presented our environment (Section 3.1) which is modeled as a discrete occupancy grid together with trajectories traversing it. For each visited position on a discrete trajectory we compute isovists and use sequences of such as our data basis, called an isovist-sequence. Such can therefore be used to analyze changes in spatial perception.

In Section 3.2 we have explained three key methods of machine learning used in this paper. We now propose the combination of Convolutional Neural Networks (CNN) and Gated Recurrent Units (GRU) elements embedded in an Auto-Encoder (AE) structure allowing to cluster temporal correlation within sequences of two-dimensional bitmaps as representations of spatial structures. Subsequently, we present the clustering outcome as a visualization of annotated trajectories in Section 5.

4.1. Data Synthesis

In this work, machine perception is considered as an entity’s capability to recognize spatio-temporal patterns based on a sequence of spatial sensory input in form of isovists, a technique that makes the description and perception of architectural space quantifiable and, thus, appropriate for utilization (Benedikt, 1979; Tandy, 1967). Based on the review of existing techniques in the field of visual space, we decided to use isovists as our measure of perception of space (Section 3.1). In literature and application, the simplest and most often used form of isovists is a 2D top-down floor plan (Llobera, 2003; Batty, 2001). A problem reduction to 2D floor plans as source of spatial structures encoded in visible floor and non-visible floor is legitimate, since a robotic unit only needs information about the floor on which it can safely move. Those binary images are composed of black pixels (non-visible floor, wall elements, or obstacles) and white pixels (visible and walkable floor).

Thus, our system’s data preprocessing and input creation consists of three parts: (1) reading floor plans from binary images to build a routable graph, (2) generating paths to simulate motion/trajectories and (3) computing isovists for each visited pixel position of a trajectory using a shadow casting algorithm (Bergström, 2017, 2015). Just like Benedikt’s explanation of isovists, the Shadow Casting-algorithm spreads from a source pixel. This proceeding generates a sequence of binary isovists that are consecutively rotated in walking direction.

Figure 5. Schematic isovist-sequence setup.

Depending on the use-case, instead of simulating the visual perception as described above, it would also be possible to generate the isovists from real world sensory input, e.g. collecting measurements via laser scanning. As there are various ways of collecting or generating the required isovists, the work at hand does not focus on the specifics (performance etc.) of the data synthesis or collection procedure.

Figure 5 shows a schematic overview of our isovist-sequence setup.

4.2. Neural Network Architecture

After introducing our understanding of environment and isovist-sequences as a tool to capture changes in spatial perception, we now propose an Artificial Neural Network (ANN) architecture that clusters a sequence of temporal-related binary bitmaps of two-dimensional spatial representations.

We utilize several machine learning techniques to automatically discover a function by employing CNNs (Section 3.2.1) and GRU (Section 3.2.2). Both are combined within an unsupervised trained VAE structure (Section 3.2.3).

The first step of our architecture is the extraction of visual pattern. We chose to use a CNN structure as described in Section 3.2.1 employing two convolution/pooling bundles with 10 filters at a

kernel size at a stride of 1, activated by a

function. Since we assume a temporal relation within a sequence of isovist images, we use a single GRU layer equipped with 250 cells to process the sequential data input by a function. After detecting spatial and temporal patterns, the next logical step in our architecture’s design aims at detecting classes or groups within the isovist sequences. As stated above, due to the high performance in a variety of clustering tasks, we eploy a VAE as the overal neural network structure. It has been implemented as described by Kingma et al. using a variational constraint. Followed by a similarly built generative model (first GRU, then convolution/pooling bundles), we establish a reconstruction of the raw input data as a valid training target . A cross entropy loss with applied Kullback-Leibner (KL) divergence to a Gaussian distribution is utilized to produce an error signal that can be back-propagated as the training signal to perform the network’s internal weight adjustment (Kingma and Welling, 2013).

After training, the VAE structure is not in need of the decoder model any more. Thus, the ANNs bottleneck can be exposed as clustering result (Figure 4, green).

Summarized, the ANN architecture has been implemented using Keras with a Tensorflow backend in Python

(Abadi et al., 2015; Python Software Foundation, 2017; Chollet and others, 2017).

4.3. Visualization

Figure 6. Visualization work-flow: comparing encoder input with the associated decoder output (blue), latent activation of encoder inputs (green), and isovist-sequence reconstruction (orange).

In the previous sections we have introduced the overall structure of our network which can be trained in an unsupervised fashion. This section now introduces the visualization process that exploits the VAE’s internal representation to annotate trajectories traversing a two-dimensional environment. Having an AE structure, we provide three possibilities of visualizing the network’s outputs as pictured in Figure 6.

First, the encoder input (an isovist sequence) is viewed next to the related decoder output. This way, it is possible to validate the network’s reconstruction capability visually to get insights into its way of function (Figure 6, blue). Second, by sampling from the compressed latent space in either random fashion or by using a steady pattern along a regular grid of the size of the latent dimension, associated sequences are revealed. In other words, the overall latent structure can be observed by feeding a synthetic latent vector to the generative model (orange). Third, isovist sequences along a trajectory are used for annotation by collecting latent vector activations. Those are afterwards drawn on a related floor plan to reveal patterns of spatial and temporal similarity (green).

5. Results & Evaluation

We see our approach as a high-dimensional clustering concept using sequences of two-dimensional isovists in a general way to analyze changes in spatial perception. As such, it may support the implementation of sophisticated robotic navigation and assistance systems.

To proof our assumptions, we have implemented the proposed neural network architecture (Section 4) as a system that processes sequences of isovists. These isovists are in the form of a three-dimensional binary grid holding spatio-temporal relations. The network then produces fuzzy labels that can be used for trajectory annotation or other purposes.

We have evaluated the results of training the network on a total of six different floorplans (Figure 13) with varying shapes of rooms and hallways. This setup has been chosen to allow the network to learn generalizations across different floor plans and possibly generalize to unknown floorplans. Therefore, attention was given to include as many variations of room sizes, orientations, round & square structures, for example, as possible. All floor plans were scaled to an equal average door width of four pixels. Besides, we build a pixel-wise routable graph and applied Dijkstra’s algorithm (Dijkstra, 1959) to generate several thousands of artificial paths through the environments. Our network has been trained on equally-sized isovist sequences having a length of that were separated by a of pixels. The path segment covered by such a sequence is therefore .

There is no fixed number of necessary training repetitions, so called epochs, that can be directly taken from literature. The correct amount depends on the number of samples, the variation along the data, and the problem’s complexity. Experiments showed suitable results after about epochs visiting about in training.

Figure 13. Environments used in training.
Figure 14. A custom colorbar represents the network’s certainty on isovist-sequence reconstructions.

Since this work deals with unsupervised clustering of perception without any labeled data, there is no recognized metric that can be applied “out of the box”. We therefore performed a visual evaluation of results using the proposed methods of Section 4.3

. The desired outcome would be the analogous colorization of similar movement in a related spatial situation. For that, we have placed the network predictions at the center-pixel

of an isovist-sequence from a longer artificial trajectory on an underlying floor plan (Figure 6, green). Additionally, VAE decoder predictions of evenly spaced latent samples are evaluated (Figure 6, orange). Those artificial IsovistSequences can be used to understand a VAE’s internal latent bottleneck structure.

For completeness, when choosing a higher latent space dimensionality, the output for tested samples could be subsequently clustered. Well established techniques like applying k-means clustering

(Ball and Hall, 1965) or the DBSCAN algorithm (Ester et al., 1996) are only two of many possible post-processing options. For visual insights, a dimensionality reduction using Principle Component Analysis (PCA) (Jackson, 2005) or t-Distributed Stochastic Neighbor Embedding (t-sne) (Maaten and Hinton, 2008) could be applied on a set of latent space activations.

The following results were generated by predicting on a new environment which was not part of the training data. Since it is hard to annotate multiple overlapping trajectories within a two-dimensional figure, several manually created paths are presented. Not only had the network to generalize from its learned spatial structure to an unknown environment, but also the process that formed the underlying trajectories was of foreign nature as well.

5.1. Perceptive Trajectory Annotation

In this section, we will demonstrate our system’s annotation capabilities, the clustering outcome, and proof the correct encoding of temporal pattern along sequences of spatial representation captured by isovists. For that, we first show trajectories in a foreign environment that are annotated next to an evenly sampled latent space that results in sequential decoder reconstructions.

The color coding of such reconstructed isovist-sequences are shown in Figure 14. It runs from black (0, wall) over green (0.5, total uncertainty) to white (1, floor). A black colored pixel, for example, indicates a maximum certainty that this particular pixel had a black color when it entered the network, whereas a green pixel implies an even likelihood for both binary extremes. Such reconstructions can be visualized in order to gain insights into the learning process and to reveal the internal latent VAE structure. We picture them as horizontal bundles of mostly five isovists. The chronology is named from t to t. Please mind the correct orientation along those visualizations. Sequence position indicators are always positions on the back side of an imagined agent’s movement direction.

(a) A single cell VAE has been used to color hand drawn paths through the test environment. The training data parameters were set to .
(b) Top: Equally spaced samples of the VAEs latent space used for the coloration of Figure (a)a. Every column represents an isovist sequence from bottom to top . The walking direction points to the right; Bottom: Color map used to color hand drawn paths through the test environment.
Figure 17. Pixel-wise annotation of trajectories based on spatio-temporal context.

Figure (a)a shows the two-dimensional trajectory annotation results based on high-dimensional spatio-temporal clustering through VAE. For that, our model was used to predict consecutive isovist-sequences along hand drawn trajectories which have not been visited in training. For better orientation during reading, floor-plans have been divided in areas referred to as Sn, where S refers to the horizontal letter and n refers to the vertical digit.

In Figure (b)b, a one-dimensional VAE’s latent space has been sampled in a regular interval based on the predicted values in Figure (a)a. Both color-bars are directly connected, thus, a color found in the base-map (Figure (a)a) is therefore further explained by the sequence next to the same color in Figure (b)b. The custom trajectories on the test map were then evaluated sequentially by the VAE encoder. Resulting predictions have been normalized and were transformed into RGBA value tuples.

We now examine the evenly spaced latent space samples in Figure (b)b from left to right. Columns 1-5 describe movements along a right-handed wall (e.g., spiral at area D1 in Figure (a)a). Besides, there is a disruption similar to an approaching wall visible at column 4. Then, uncertainties and thick wall segments are joining in from the right. The RGBA color shifts from a strong red to a greenish tone per yellow. In the following, green to turquoise color tones describe a spatial situation under a lot of changes, like right-handed curves (column 11) or towards a wall (12). At column 13 the representation of a movement through completely free space can be found (e.g., area E1 in Figure (a)a). The following reconstructed sequences, colored in blue tones, cluster movements with left-handed curves. The very narrow environments (area F4 & G4, blue color, column 19) can be seen just before the corridor widens (column 20). Diagonal wall segments (pixel neighboring each other in 45°) and right-hand side free-space is visible (spiral at areas A2/B2 in Figure (a)a). The very last sequences of the sampled latent range represent straight corridors of various width. The colors shift from blue to purple and finally a pink tone.

In short, our model has been successfully applied to visualize the clustering of movement in varying spatial context. The additional latent sampling visualization (Figure (b)b) helped to get insights into unsupervised colored trajectories (Figure (a)a). The main findings are the identification of movements through free-space (turquoise), along right-hand (red) and left-hand (purple) wall alignment as well as through narrow corridors (blue).

However, the sometimes not directly interpretable results indicate the necessity for further optimization. With an increase of the latent space dimensionality the overall generalization is assumed to be less strict and results in more meaningful results when colored accordingly or mapped to a semantic label.

5.2. Temporal Layer Validation

We now proof our system’s spatio-temporal clustering capabilities by observing decoder outputs such as the hand-picked examples in Figure 18. Each of the rows represents a reconstructed isovist sequence with a clearly visible movement. Not only the spatial structure for each time step (t), but also the shifted center caused by an imagined virtual agent’s movement in time has been successfully encoded. The second row of Figure 18, for example, can be read as a motion along a corridor approaching a crossing. This demonstration is essential, since it proves the correct application of the GRU units in combination with convolution and pooling layers.

More importantly it shows that temporal pattern are encoded, transferred through the small one-dimensional bottleneck-layer, and subsequently restored. Furthermore it can be concluded that these images demonstrate a successful learning and back-propagation behavior.

Figure 18. VAE samples showing temporal reconstruction along the time steps.

5.3. Influence of Isovist Sequence Lengths

In addition to the former results, we now present the influence of the sequence-length parameter by increasing it to . This also increased the length of our total observation to . Thus, computational costs are almost doubled by introducing additional CNN and GRU operations to the network structure. The results are pictured in Figure (a)a. Surprisingly, much less variations were differentiated in narrow situations (e.g., A3 in Figure (a)a). On the other hand, areas including any kind of wide free-space are colored in more detail, as it can be observed in area C2, for example. Totally free-space is clearly represented by a purple color. Red and turquoise colors, on the other hand, seem to describe a movement along either a left- or right-handed wall in the context of free-space. The regular latent space sampling in Figure (b)b supports this assumption, as there are more white pixels visible than in direct comparison to Figure (b)b.

The network generalized over smaller spatial features while, at the same time, differing much more along movement that involves some kind of free-space on either or both sides in the imagined agent’s walking direction.

5.4. General Discussion

This section discusses the lately presented results.

Our initial idea was the prediction of multiple isovist-sequences that were based on similar trajectories like those used in training. Thus, an additional trajectory-set filled with approximately trajectories had been constructed. The creation process was exactly the same as for the training data, i.e., we collected shortest Dijkstra paths from random start to random target coordinates. The emerging problem lays within the unavoidable overlap of such random shortest paths resulting in visually mixed trajectories.

Colorization of floor plans such as shown in Figure 13 was first realized by the winner-takes-it-all principle. Anyhow, this was only applicable in simple, straight corridors, when mostly similar movement types occurred. Crossings, or more general, spaces that allow various forms of movements, suffered from this method. Additionally, it is not possible to plot our system’s prediction onto a single map pixel.

A trajectory can basically be seen as a positional change under the influence of time . A subset of a trajectory () follows the same definition . Successively generated isovists, for such a -subset of a specific length (), results in our isovist-sequence. Thus, neural network predictions (vector of size n) represent spatio-temporal (three-dimensional) data inputs.

Placing such a one-dimensional representation on an environmental pixel eliminates the context of the encoded movement in total. Only by marking the center positions of all involved isovist the movement context can be revealed again. Another solution is the prediction of multiple consecutive isovist-sequences that are part of a single trajectory. Using uneven sequence-lengths while placing the prediction at the center positions where showed the best results.

As stated before, overlapping trajectories cannot be visualized in this way, which is why it was not possible to use the same random trajectory generation method as in training. Instead, custom non-overlapping trajectories were generated by hand as a solution. The selection of these custom trajectories aimed to fulfill two separate goals: “squeezing” as many non-overlapping paths on a floor plan as possible while at the same time having a high variation of spatial configurations, movement situations, and distances to the walls. A drawback of this manual method is the possible introduction of human bias compared to the random Dijkstra’s shortest path based trajectory generation procedure used in training.

As a result, not only the test-environment possesses totally new characteristics, the do as well. As long as there is no applicable metric for measuring and comparing the performance on spatial perception clustering, this chosen method delivered results that are readable and allowed valuable insights. This course of action additionally tests the networks generalization capabilities.

6. Conclusion and Future Work

(a) At parameter settings of & , the network generalized over smaller features while, at the same time, differing much more along sequences that involve some kind of free-space on either or both sides.
(b) A Variational Auto-Encoder was trained on isovist-sequences with parameter setting of: & . The resulting latent space was then sampled 25 times on equally spaced positions between a fixed range. The legend for reconstructed isovist-sequences can be found in Figure 14.
Figure 21. Variation of sequence length.

Our contribution is the design and application of a system that clusters movement along trajectories by spatial perception using isovist in combination with machine learning techniques. The training data was built upon isovists as introduced by (Benedikt, 1979), which are a reliable and computational cheap tool to represent and measure human spatial perception in the interiors. Through several visualizations we have showed the immense potential that lays within such unsupervised trained systems. Robotic or assistance systems may be supported to understand movement through spatial structures in foreign environments. Future autonomous mobile robots or SLAM systems can produce similar sensory range data, so that our approach may not only be used to annotate trajectories or to help the visual impaired by giving them spatial context, but also to improve existing human-machine-interfaces that rely on discrete situation labels like spoken commands. In future, our concept could be applied in a three-dimensional domain to measure and process a spatial representation which is closer to the real human world.

Our implementation of the variational auto-encoder was found to generalize along the visited training data. Rather than delivering meaningful human readable classes, the networks seemed to prefer frequently found samples. To match the network’s prediction with the human spatial perception and semantic labels, a semi-supervised network seems to be the most promising approach for future application. For that, the VAE decoder could be extended by a softmax classifier and trained on some labeled data. Since this work’s focus was on showing the feasibility of using the respective ML techniques to learn movement in relation to spatial structures, we did not perform or include a fine tuning of the setup. Future work could cover such an empirical performance study and general concept tuning including e.g. switching from RNN to attention based networks.

Computational costs that come with the training of neural networks increase with higher resolutions and the overall network depth. As a consequence, separate training and application units with varying processing power are imaginable.

A small IOT driven robotic unit could be used for online clustering while sending environmental samples to a large-scale GPU or TPU based training unit. Such a system would be cheap in application while it keeps learning at the same time. Additionally, the backbone system would benefit from a wide fleet of such cheap robotic units that, all together, draw a vast number of environmental samples. A backbone, once trained, could enable large-scale predictions on laptops or even SOC systems. Another solution presented by (Lacey et al., 2016) would be an embedded system acceleration by so called field programmable gate arrays (FPGA). The difference to traditional CPU, GPU, or TPU-based system is that FPGA architectures are tailored for the application in low powered systems (Lacey et al., 2016).

FPGAs can be thought of as a NN structure built directly into a chip rather than implemented through algorithms that are applied on multi-purpose processing units. Such application-specific hardware acceleration modules have been embedded in large-scale consumer products just recently (e.g., Google Pixel 2111Google Pixel 2 -, Apple iPhone X222Apple iPhone X -, or Huawei Mate 10 Pro333Mate 10 Pro -

Those recent developments move the application of machine learning enhanced processes in our every-daily life.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from External Links: Link Cited by: §4.2.
  • D. Anguelov, D. Koller, E. Parker, and S. Thrun (2004) Detecting and modeling doors with mobile robots. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, Vol. 4, pp. 3777–3784. Cited by: §2.
  • G. H. Ball and D. J. Hall (1965) ISODATA, a novel method of data analysis and pattern classification. Technical report Stanford research inst Menlo Park CA. Cited by: §5.
  • M. Batty (2001) Exploring isovist fields: space and shape in architectural and urban morphology. Environment and planning B: Planning and Design 28 (1), pp. 123–150. Cited by: Figure 3, §3.1, §4.1.
  • M. L. Benedikt (1979) To take hold of space: isovists and isovist fields. Environment and Planning B: Planning and design 6 (1), pp. 47–65. Cited by: §1, §3.1, §4.1, §6.
  • B. Bergström (2015) External Links: Link Cited by: §4.1.
  • B. Bergström (2017) External Links: Link Cited by: §4.1.
  • S. Bhatia, S. K. Chalup, M. J. Ostwald, et al. (2012) Analyzing architectural space: identifying salient regions by computing 3d isovists. In Conference Proceedings. 46th Annual Conference of the Architectural Science Association (AN-ZAScA), Gold Coast, QLD, Cited by: §2.
  • P. Buschka and A. Saffiotti (2002) A virtual sensor for room detection. In Intelligent Robots and Systems, 2002. IEEE/RSJ International Conference on, Vol. 1, pp. 637–642. Cited by: §2.
  • W. Chen, T. Qu, Y. Zhou, K. Weng, G. Wang, and G. Fu (2014)

    Door recognition and deep learning algorithm for visual based robot navigation

    In Robotics and Biomimetics (ROBIO), 2014 IEEE International Conference on, pp. 1793–1798. Cited by: §2.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §3.2.2.
  • F. Chollet et al. (2017) External Links: Link Cited by: §4.2.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, Cited by: §3.2.2.
  • C. Cosma, M. Confente, M. Governo, and R. Fiorini (2004) An autonomous robot for indoor light logistics. In Intelligent Robots and Systems, 2004.(IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on, Vol. 3, pp. 3003–3008. Cited by: §1.
  • M. J. De Smith, M. F. Goodchild, and P. Longley (2007) Geospatial analysis: a comprehensive guide to principles, techniques and software tools. Troubador Publishing Ltd. Cited by: §1.
  • L. Deng, D. Yu, et al. (2014) Deep learning: methods and applications. Foundations and Trends® in Signal Processing 7 (3–4), pp. 197–387. Cited by: §3.2.3.
  • E. W. Dijkstra (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1 (1), pp. 269–271. Cited by: §5.
  • V. Dumoulin and F. Visin (2016) A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285. Cited by: §3.2.1.
  • M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §5.
  • S. Feld, H. Lyu, and A. Keler (2017) Identifying divergent building structures using fuzzy clustering of isovist features. In Progress in Location-Based Services 2016, pp. 151–172. Cited by: §2.
  • S. Feld, M. Werner, and C. Linnhoff-Popien (2016) Approximated environment features with application to trajectory annotation. In 6th IEEE Symposium Series on Computational Intelligence (IEEE SSCI 2016), Cited by: §2.
  • N. Goerke and S. Braun (2009) Building semantic annotated maps by mobile robots. In Proceedings of the Conference Towards Autonomous Robotic Systems, pp. 149–156. Cited by: §2.
  • B. Hillier and J. Hanson (1989) The social logic of space. Cambridge university press. Cited by: §1.
  • S. Hochreiter and J. Schmidhuber (1997) LSTM can solve hard long time lag problems. In Advances in neural information processing systems, pp. 473–479. Cited by: §3.2.2.
  • A. Hub, J. Diepstraten, and T. Ertl (2004) Design and development of an indoor navigation and object identification system for the blind. In ACM Sigaccess Accessibility and Computing, pp. 147–152. Cited by: §1.
  • J. E. Jackson (2005) A user’s guide to principal components. Vol. 587, John Wiley & Sons. Cited by: §5.
  • H. Kamyshanska and R. Memisevic (2015) The potential energy of an autoencoder. IEEE transactions on pattern analysis and machine intelligence 37 (6), pp. 1261–1273. Cited by: §3.2.3, §3.2.3.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.2.3, §4.2.
  • G. Lacey, G. W. Taylor, and S. Areibi (2016) Deep learning on fpgas: past, present, and future. arXiv preprint arXiv:1602.04283. Cited by: §3.2.1, §6.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §3.2.1, §3.2.2, §3.2.2, §3.2.3.
  • J. J. Leonard and H. F. Durrant-Whyte (1991) Simultaneous map building and localization for an autonomous mobile robot. In Intelligent Robots and Systems’ 91.’Intelligence for Mechanical Systems, Proceedings IROS’91. IEEE/RSJ International Workshop on, pp. 1442–1447. Cited by: §2.
  • W. Li, H. I. Christensen, A. Oreback, and D. Chen (2004) An architecture for indoor navigation. In Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, Vol. 2, pp. 1783–1788. Cited by: §1.
  • M. Llobera (2003) Extending gis-based visual analysis: the concept of visualscapes. International journal of geographical information science 17 (1), pp. 25–48. Cited by: §3.1, §4.1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.
  • O. M. Mozos and W. Burgard (2006) Supervised learning of topological maps using semantic information extracted from range data. In Intelligent Robots and Systems, 2006 IEEE/RSJ International Conference on, pp. 2772–2777. Cited by: §2.
  • Ó. M. Mozos (2010) Semantic labeling of places with mobile robots. Vol. 61, Springer. Cited by: §2.
  • K. Nagatani, S. Kiribayashi, Y. Okada, K. Otake, K. Yoshida, S. Tadokoro, T. Nishimura, T. Yoshida, E. Koyanagi, M. Fukushima, et al. (2013) Emergency response to the nuclear accident at the fukushima daiichi nuclear power plants using mobile rescue robots. Journal of Field Robotics 30 (1), pp. 44–63. Cited by: §1.
  • K. O’Shea and R. Nash (2015) An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458. Cited by: §3.2.1, §3.2.1, §3.2.1.
  • Python Software Foundation (2017) External Links: Link Cited by: §4.2.
  • P. Rashidi and A. Mihailidis (2013) A survey on ambient-assisted living tools for older adults. IEEE journal of biomedical and health informatics 17 (3), pp. 579–590. Cited by: §1.
  • P. Ruppel, F. Gschwandtner, C. K. Schindhelm, and C. Linnhoff-Popien (2009) Indoor navigation on distributed stationary display systems. In Computer Software and Applications Conference, 2009. COMPSAC’09. 33rd Annual IEEE International, Vol. 1, pp. 37–44. Cited by: §1.
  • A. Sedlmeier and S. Feld (2018) Discovering and learning recurring structures in building floor plans. In LBS 2018: 14th International Conference on Location Based Services, pp. 151–170. Cited by: §1, §2.
  • C. Tandy (1967) The isovist method of landscape survey. Methods of landscape analysis, pp. 9–10. Cited by: §1, §3.1, §4.1.
  • A. Turner, M. Doxa, D. O’sullivan, and A. Penn (2001) From isovists to visibility graphs: a methodology for the analysis of architectural space. Environment and Planning B: Planning and design 28 (1), pp. 103–121. Cited by: §3.1.
  • J. M. Wiener and G. Franz (2004) Isovists as a means to predict spatial experience and behavior. In International Conference on Spatial Cognition, pp. 42–57. Cited by: §1.