A framework for interactive, real-time 3D scene segmentation
We present an open-source, real-time implementation of SemanticPaint, a system for geometric reconstruction, object-class segmentation and learning of 3D scenes. Using our system, a user can walk into a room wearing a depth camera and a virtual reality headset, and both densely reconstruct the 3D scene and interactively segment the environment into object classes such as 'chair', 'floor' and 'table'. The user interacts physically with the real-world scene, touching objects and using voice commands to assign them appropriate labels. These user-generated labels are leveraged by an online random forest-based machine learning algorithm, which is used to predict labels for previously unseen parts of the scene. The entire pipeline runs in real time, and the user stays 'in the loop' throughout the process, receiving immediate feedback about the progress of the labelling and interacting with the scene as necessary to refine the predicted segmentation.READ FULL TEXT VIEW PDF
A framework for interactive, real-time 3D scene segmentation
Scene understanding, the overarching goal of allowing a computer to understand what it sees in an image and thereby derive some kind of holistic meaning about the scene being viewed, is a fundamental problem that spans the entire field of computer vision. As such, it has been the subject of intense research ever since computer vision was naïvely assigned to Gerald Sussman as a ‘summer project’ in 1966. However, it remains a notoriously difficult problem even to specify, let alone solve in a general sense.
What does it mean for a computer to understand a scene? We might think it desirable for a computer to be able to take in a scene at a glance and infer not only what people and objects are present in the scene and where they are, but what they are doing and their relationships to each other. We might even hope to be able to predict what might happen next. In reality, even the first of these (the problem of determining what is present in the scene and where it is) is far from being solved. In spite of the progress that has been made in recent years, fully understanding an arbitrary scene in an automatic way remains a major challenge. Similarly, although many approaches have taken into account the relationships between objects in a scene (e.g. [CCPS15, LFU13, ZZY15]), we remain far from the overall goal of fully understanding the interactions between them and making predictions about what might be about to happen.
These tasks remain fundamental challenges for the computer vision community as a whole, and we do not pretend to be able to solve them here. We mention them primarily to illustrate the extent to which scene understanding is likely to remain a significant research challenge over the coming years, and to highlight the continuing need for foundational work in this area that other researchers can build upon to help drive the field forwards.
In this report, we present our contribution to this foundational work in the form of an interactive segmentation framework for 3D scenes that we call SemanticPaint [VVC15, GSV15]. Using our framework, a user can walk into a room wearing a depth camera and a virtual reality headset, and both densely reconstruct the 3D scene [NZIS13, KPR15] and interactively segment the environment into object classes such as ‘chair’, ‘floor’ and ‘table’. The user interacts physically with the real-world scene, touching objects and using voice commands to assign them appropriate labels. These user-generated labels are leveraged by an online random forest-based machine learning algorithm, which is used to predict labels for previously unseen parts of the scene. The entire pipeline runs in real time, and the user stays ‘in the loop’ throughout the process, receiving immediate feedback about the progress of the labelling and interacting with the scene as necessary to refine the predicted segmentation.
Since we keep the user in the loop, our framework can be used to produce high-quality, personalised segmentations of a real-world environment. Such segmentations have numerous uses, e.g. (i) we can use them to identify walkable surfaces in an environment as part of the process of generating a navigation map that can provide routing support to people or robots; (ii) we can use them to help partially-sighted people avoid collisions by highlighting the obstacles in an environment; and (iii) in a computer vision setting, we can extract 3D models from them that can be used to train object detectors.
By releasing our work in library form, we hope to make it straightforward for other people to implement ideas such as this on top of our existing work, rather than having to implement the scene labelling pipeline for each new application from scratch.
The high-level architecture of the SemanticPaint framework is shown in Figure 1. It is organised as a set of reusable libraries (rafl, rigging, spaint and tvgutil), together with an application (spaintgui) that illustrates how to make use of these libraries to implement an interactive 3D scene segmentation tool. In terms of functionality, rafl contains our generic implementation of random forests, rigging contains our implementation of camera rigs, spaint contains the main SemanticPaint functionality (sampling, feature calculation, label propagation and user interaction) and tvgutil contains support code.
In this section, we will first take a top-down view of the spaintgui application in order to make clear the way in which the framework is organised, and then take a closer look at the structures of some of the more important libraries. We defer a detailed discussion of the various different parts of the system until later in the report.
As previously mentioned, spaintgui is a tool for interactive 3D scene segmentation. It is structured in much the same way as a conventional video game, i.e. it performs an infinite loop, each iteration of which processes user input, updates the scene and renders a frame (see Figure 2). The application has a number of different modes (see Table 1) that the user will manually switch between in order to effectively reconstruct and label a scene. A typical spaintgui workflow might involve:
Reconstruction. Starting in normal mode, the user reconstructs a 3D model of the scene. Reconstruction in SemanticPaint is performed using the InfiniTAM v2 fusion engine of Kähler et al. [KPR15].
Labelling. The user chooses a method for labelling the scene (see §3.1) and switches to propagation mode. Using their chosen method, they provide ground truth labels at sparse locations in the scene, which are then propagated over smooth surfaces to augment the training data the user is providing (see §3.4). For example, the user might choose to interact with the scene using touch, in which case providing labels would involve physically touching objects in the scene.
Training/Prediction. Having labelled relevant parts of the scene, the user now switches to training mode to learn a model (a random forest) of the object classes present in the scene (see §4.4). Once a model has been learnt, the user switches to prediction mode to predict labels for currently-unlabelled parts of the scene (see §4.5).
Correction. Finally, the user switches to training-and-prediction mode, and corrects the labelling of the scene as necessary using their chosen input method.
The first step in each iteration of the application’s main loop is to process any user input provided, so as to allow the user to move the camera around, label the scene, undo/redo previous labellings, or change the application mode, renderer or current label. The details of how we process the input are largely uninteresting, so we will not dwell on them, but readers who are interested in our implementations of camera control or command management may want to look at Appendices A and B. It should also be noted that parts of the application’s functionality (e.g. changing the application mode or current label) can be controlled via voice commands: details of how voice recognition is implemented can be found in Appendix C.
|Normal||Allows the user to reconstruct a 3D model of the scene and label voxels|
|Propagation||Normal + propagates the current label across smooth surfaces in the scene|
|Training||Normal + trains a random forest to predict labels for unlabelled scene voxels|
|Prediction||Normal + uses the forest to predict labels for unlabelled scene voxels|
|Training-and-Prediction||Normal + interleaves training and prediction on alternate frames|
|Feature Inspection||Normal + allows the user to inspect voxel features (requires OpenCV)|
Our mechanism for updating the scene is structured as a processing pipeline, which we divide into a number of different sections (see Figure 4). In each frame, only some of the sections of the pipeline will be run, depending on the current mode of the system. In most modes, the main section and a section of the pipeline corresponding to the mode in question will be run in sequence. However, in training-and-prediction mode, we interleave training with prediction by running the corresponding sections on alternate frames: this allows us to achieve the effect of performing training and prediction ‘simultaneously’ without having a detrimental impact on the frame rate of the overall system.
The main section of the pipeline performs geometric scene reconstruction and camera pose estimation, and is run on every frame (although it is possible to stop reconstructing the scene and only estimate the pose of the camera in the currently-reconstructed scene if desired). The first step in each frame is to read new depth and colour images from the current image source (generally either a depth camera such as the Asus Xtion or Kinect, or a directory on disk). These are then passed to whichever tracker is in use to allow it to determine an estimate of the camera’s pose in the new frame. If reconstruction is enabled, the next step is then to update the global scene using the information from the depth and colour images (bearing in mind that their location in relation to the scene is now known): this process is generally known asfusion. Finally, a raycast is performed from the current estimated camera pose: this will be needed as an input to the tracker in the next frame. For more details about how InfiniTAM works, please see the original technical report [PKC14] and journal paper [KPR15].
The propagation section of the pipeline spreads the current label outwards from the existing set of voxels it labels until either there is a significant change in surface normal, or a significant change in colour. An example of this process is shown in the Labelling part of Figure 3. Propagation is useful because it allows the user to provide a significant amount of training data to the random forest whilst labelling only a very small part of the scene. However, it is a comparatively brute force process and must be applied cautiously in order to avoid flooding across weak normal or colour boundaries in the scene. For more details about how propagation is implemented, see §3.4.
The training section of the pipeline trains a random forest online over time, using training examples sampled from the current raycast result for each frame. The first step is to sample similar numbers of voxels labelled with each semantic class (e.g. chair, floor, table, etc.) from the raycast result (see §4.4.1). A feature descriptor is then calculated for each voxel (see §4.3), and paired with the voxel’s label to form a training example that can be fed to the random forest. Finally, the random forest is trained by splitting nodes within it that exhibit high entropy (see §4.4.2). For more details about the random forest model and the training process, see §4.
The prediction section of the pipeline uses the existing random forest to predict labels for currently-unlabelled parts of the scene. The first step is to uniformly sample voxels from the current raycast result (see §4.5.1). As during training, a feature descriptor is then calculated for each voxel (see §4.3). The feature descriptors are then fed into the random forest, which predicts labels for the corresponding voxels. Finally, these labels are used to mark the voxels in question. For more details about the prediction process, see §4.5.
The final step in each iteration of the application’s main loop is to render the scene. The spaintgui application currently supports two different renderers: one that performs monocular rendering into a window (which can be made full-screen) and another that performs stereo rendering onto an Oculus Rift head-mounted display. To support both of these use cases without duplicating any code, we have made it possible to render the scene multiple times from different camera poses. The windowed renderer then simply renders the scene directly into a window from a single camera pose. The Rift renderer renders the scene twice (once for each eye) into off-screen textures (using OpenGL’s frame buffer capabilities) and passes these textures to the Oculus Rift SDK.
The actual rendering of the scene (see Figure 5) is split into two parts, the first of which renders a quad textured with a raycast of the reconstructed 3D scene, and the second of which renders a synthetic scene (consisting of the scene axes and the current user interaction tool) over the top of it using OpenGL. In order to achieve the desired result, the correct model-view and projection matrices must be set when rendering the synthetic scene.111An additional consideration when rendering on top of a raycasted scene is that the OpenGL depth buffer will not contain correct depth values for the scene in question. If we wanted to render synthetic objects that were correctly occluded by parts of the raycasted scene, we would therefore need to perform a depth raycast of the reconstructed scene and copy it into the OpenGL depth buffer to obtain accurate depth. However, in our case, the objects we want to render do not need to be occluded by the scene, so we are not currently performing this additional step. The model-view matrix is easily obtained from the current pose of the camera in InfiniTAM. The projection matrix must be calculated from the intrinsic parameters of the camera being used.
As previously mentioned, a key part of our SemanticPaint framework is the way in which we train a random forest online in order to predict object classes for previously-unlabelled parts of the scene. Our random forest implementation is provided as a separate library called rafl to maximise its reusability. For more details about the random forest model, see §4.
As can be seen in Figure 6, rafl’s architecture consists of a number of layers. The lowest layer, base
, contains classes that have few dependencies, such as feature descriptors, histograms and probability mass functions. The next layer,examples, contains classes relating to training examples for random forests. A training example consists of a feature descriptor and a semantic label. Each leaf node in the forest stores a ‘reservoir’ of examples – this is used to keep track of the distribution of examples that have reached that node, whilst limiting the number of examples for each semantic class that are actually stored so as to bound the forest’s memory usage. The decisionfunctions layer contains classes relating to decision functions for random forests. Each branch node in the forest contains a decision function that can be used to route a feature descriptor down either the left or right subtree of the node. The decision functions used can be of various types – for example, a feature-thresholding decision function routes feature descriptors by comparing a particular feature in the descriptor to a fixed threshold. Client code can control the type of decision functions used by a random forest by initialising it with a decision function generator that generates decision functions of the desired type. When a leaf node is split, this decision function generator is used to randomly generate candidate decision functions to split the node’s examples. Finally, the core
layer contains the actual decision tree and random forest classes.
The core implementation of SemanticPaint is provided in the spaint library. As can be seen in Figure 7, this is divided into a number of components, the names of which are fairly self-explanatory. For example, the touch component contains the code to detect when the user physically touches part of the 3D scene. The bits of the code that sample voxels, calculate features and generate appropriate decision functions for the random forest can be found in sampling, features and randomforest respectively. The key bits of the code that facilitate user interaction with the scene can be found in selectors and selectiontransformers. Code to mark voxels with labels is implemented in markers, and code to propagate labels across surfaces can be found in propagation. Details about how the various components work can be found in the rest of this report.
The primary purpose of interaction in SemanticPaint is to allow the user to assign ground truth semantic labels to one or more voxels in the scene. Users can either label the whole scene manually, or the labels they provide can form a basis for learning models of the various semantic classes (e.g. table, floor, etc.) that can be used to propagate the ground truth labels to the rest of the scene.
Our framework allows the user to interact with the scene using a variety of modalities, ranging from clicking on individual voxels in a view of the scene using the mouse (known as picking), to physically touching objects in the real world. In terms of their implementations, these modalities work in very different ways, but at an abstract level they can all be seen as selection generators, or selectors: things that generate selections of voxels in the scene. As a result, we can handle them in a unified way by introducing an abstract interface for selectors, and implementing an individual selector for each modality.
In practice, some modalities (notably picking) initially produce selections that contain too few voxels to provide a useful way of interacting with the scene (it would be tedious for the user to have to label every voxel in the scene individually). To deal with this problem, we introduce the additional notion of a selection transformer, which takes one selection and transforms it into another. For instance, a simple transformer might expand every voxel in an initial selection into a cube or sphere of voxels to make a more useful selection for labelling. A more interesting transformer might take account of the scene geometry and spread a selection of voxels along the surface of the scene. By cleanly separating selection transformation from selection generation, we achieve significant code reuse – in particular, our selectors can limit themselves to indicating the locations in the scene that the user is trying to select, and delegate the concern of producing a useful selection from that to a subsequent generic transformer.
Once a selection has been generated, it can be used to label the scene. (It could also be used for other things, e.g. scene editing, but those are not the focus of the basic framework.) This can be implemented straightforwardly: the user chooses a label (e.g. using the keyboard or a voice command) and that label is written into each voxel in the selection. Figure 8 shows the overall pipeline we use for interactive scene labelling.
The SemanticPaint framework incorporates a variety of selectors that the user can use to select voxels in the scene. All of these selectors share a common interface, making it straightforward to add new interaction modalities to the framework. In the following subsections, we describe the two most important selectors that are currently implemented, namely the picking selector and the touch selector.
The picking selector (see Figure 9) allows the user to select a voxel in the scene by clicking a 2D point on the screen with the mouse. This is equivalent to clicking a 3D point on the image plane of the current camera. To determine which voxel the user wants to select, we conceptually cast a ray from the viewer through the clicked point on the image plane and into the scene.222In practice, InfiniTAM already produces a voxel raycast of the scene from the current camera pose at each frame – this is an image in which each pixel contains the location of the voxel in the scene that would be hit by a ray passing through the corresponding point on the image plane. As a result, the picking selector does not actually need to perform any raycasting itself – instead, it can simply look up the voxel to select in the voxel raycast. We then select the first surface voxel (if any) hit by the ray. See Figure 10 for an illustration.
The picking selector is simple, but it has a number of advantages: in particular, it not only allows the scene to be labelled precisely, it also allows the user to label distant parts of the scene. Its disadvantage is that it is less intuitive than simply touching scene objects.
The touch selector allows the user to select voxels in the scene by physically touching them with a hand, foot or any other interacting object. The locations (if any) at which the user is touching the scene are detected by looking at the differences between the live depth image from the camera and a depth raycast of the scene. An illustration of the touch selector in action can be seen in Figure 3, and a detailed description of the touch detection process can be found in Appendix D.
As previously mentioned, a selection transformer takes one selection of voxels and transforms it into another. A variety of selection transformers are possible, but at present we have only implemented a simple voxel-to-cube transformer, as described in the following subsection. The first step in the process, for any transformer, is to calculate the size of the output selection as a function of the input selection. Different transformers can then determine the voxels in the output selection in whichever way they see fit.
The voxel-to-cube transformer transforms each voxel in the input selection into a fixed-size cube of voxels that is centred on it. To do this, it first specifies the size of the output selection to be the result of multiplying the size of the input selection by the cube size, and then computes each output voxel on a separate thread (see Listing 1).
When a user selects some voxels and chooses a label with which to mark them, actually marking the voxels with the label in question is straightforward. However, manual labelling by the user is not the only way in which voxels can acquire a label: voxels can also be labelled either during label propagation or as the result of predictions by the random forest. Since it is important to ensure that ‘reliable’ labels provided by the user are not overwritten by such automatic processes, this creates the need for a way of distinguishing between labels based on their provenance – in particular, SemanticPaint currently stores a group identifier alongside each label, allowing it to distinguish between user labels, propagated labels, and labels predicted by the forest.333In practice, we bit-pack a label and its group into a single -bit byte in each voxel to save space.
The voxel marker uses these group identifiers to control the labelling process and prevent undesirable overwriting of user labels. More specifically, it considers the old label of a voxel and its proposed new label, and marks the voxel with the new label if and only if either (a) both the old and new labels are non-user labels, or (b) the new label is a user label. One minor complication is that this scheme does not interact well with the undo system – without special handling, it is not possible to undo the overwriting of a non-user label with a user label, since the voxel marker prevents the overwriting of user labels with non-user labels. For this reason, in practice the voxel marker has a special ‘force’ mode that is used when undoing a previous labelling – this bypasses the normal checks and allows any label to overwrite another.
The overall purpose of our system is to make it possible for users to label a 3D scene with a minimum of effort, and to this end, we train a random forest classifier, based on the labelled examples the user provides, that can be used to predict labels for the whole scene. However, users tend to label the scene quite sparsely. Whilst we can certainly train a forest using this sparse data, and even achieve reasonable results, we would naturally expect that much better results can be obtained by training from more of the scene. In practice, therefore, we augment the training examples provided by the user using a process calledlabel propagation before training the random forest.
Label propagation operates on the current raycast result for each frame. It works by deciding, for each voxel in the raycast result, whether or not to set its label to the current semantic label, based on the labels, positions, colours (in the CIELab space) and normals of neighbouring voxels in the raycast result image.444In practice, we do not use the direct neighbours of each voxel in the image, but we elide over this implementation detail to simplify the explanation. A voxel’s label will be updated if some of its neighbouring voxels have the current semantic label and are sufficiently close in position, colour and normal to the voxel under consideration. Over a number of frames, this has the effect of gradually spreading the current semantic label across the scene until position, colour or normal discontinuities are encountered. The implementation of this process is shown in Listing 2, and an illustration showing a label gradually propagating across a scene can be found in Figure 3.
Although label propagation is effective in augmenting the training examples provided by the user, it does have some limitations. The most notable of these is that because it is essentially performing a flood fill on the scene, it is vulnerable to weak property boundaries. Whilst we mitigate this to some extent by using a conjunction of different properties that are less likely to all have a weak boundary in the same place in the scene, it is still entirely possible for propagation to spread the current semantic label into parts of the scene that the user did not intend. We control this in two ways: first, by spreading the label relatively slowly across the scene, we allow the user to stop the propagation at an appropriate point; second, we allow the user to revert back to the sparse labels they originally provided and try the propagation again. In practice, we find that this gives users a reasonable degree of control over the propagation and allows them to make effective use of it to speed up their labelling of the scene.
In this section we briefly describe how we can train a random forest model to predict the probability that a specified voxel takes a particular object label. We use random forests because they have been employed successfully in many computer vision applications [Bre01, CSK12] and are extremely fast at making predictions. Moreover, they can be trained online [VVC15] and support the addition of new object categories on the fly.
A binary decision tree is a classifier that can be used to map vectors in a feature spaceto one of a number of discrete labels in a finite label space .555Note that in general, a decision tree may support arbitrary label spaces such as structured output label spaces [DZ13, KBBP11], since at test time the leaf node reached only depends on the input vector . Each branch node in the tree contains a decision function that can be used to send feature vectors down either the left () or right () subtree of the node. Each leaf node contains a probability mass function that satifies and specifies a probability for each label .
Starting from the root node, we can use such a tree to classify a feature vector by repeatedly testing it against the decision function in the current branch node and then recursing down either the left or right subtree as appropriate. The classification process terminates when a leaf node is reached, at which point an over the probability mass function in the leaf can be used as the classification result.
Binary decision trees are generally trained by selecting decision functions at the branch nodes that are good at dividing the available training examples into different classes. Since the number of different decision functions that could be chosen for a branch node is generally infinite, it is common to randomly generate a moderate, finite number of candidate decision functions for a node and pick one that best divides the available data. This heuristic approach is not guaranteed to find a globally optimal decision function, but will generally give a reasonable split of the examples in practice.
One way of improving the classification performance of a system based on binary decision trees is to combine multiple such trees in an ensemble called a random forest and perform classification by averaging the probability mass functions they yield for each given feature vector (the final label for a vector can be computed as an over this average). The individual trees in the forest are trained on the same examples, but will have different structures because of the way in which the candidate decision functions are randomly generated during training. Training a random forest rather than an individual tree reduces the propensity of the system to overfit to the training data and allows it to better generalise to new examples.
One factor that affects the computational efficiency and classification performance of a decision tree is the specific form of decision function used at the branch nodes. For SemanticPaint, we adopted simple ‘stump’ decision functions of the form described in [CSK12], which work by testing a single feature (i.e. the feature in a feature vector ) against a threshold :
Decision functions of this form are attractive because they are extremely efficient to evaluate whilst still supporting good overall classification performance.
In order to allow the random forest to learn from an incoming stream of training examples and to update itself as new data arrives, we use the ‘streaming decision forest’ algorithm described in [VVC15].666Note that ‘decision forest’ is a synonym for ‘random forest’. The main idea is to use reservoir sampling [Vit85] to maintain an unbiased sample of examples of fixed maximum size at each leaf node. The use of reservoirs bounds the memory usage (allowing larger trees to be grown) in a way that still ensures that there is an equal chance of keeping examples in the reservoir over time. Further details of the algorithm may be found in [VVC15].
The random forests we use in SemanticPaint learn to semantically segment a scene based on features computed from local geometry and colour information around individual voxels. In this section, we describe the features we use (originally developed in [VVC15]), and the way in which they are calculated. The basic requirements we have for features are as follows:
They should capture the local geometric and colour context around voxels.
They should be very fast to compute (since we have to compute them for thousands of voxels per frame).
They should be viewpoint-invariant.
They should be at least somewhat illumination-invariant.
To meet these requirements, [VVC15] introduced voxel-oriented patch (VOP) features. These primarily consist of an colour patch (where defaults to ) around the voxel, extracted from the tangent plane to the surface at that point and rotated to align with the dominant orientation in the patch. The colours are expressed in the CIELab colour space to achieve some level of illumination invariance. The full feature descriptor also includes the surface normal at the voxel to make it easier to classify key elements in the scene such as the floor and walls.
The pipeline we use for computing features is shown in Figure 11. The first step is to calculate surface normals for all the target voxels for which we want to compute features. We use these to construct an arbitrary 2D coordinate system in the tangent plane to the surface at each voxel.777The method we use to generate an arbitrary 2D coordinate system in a plane is described in §G.3 of [Gol06]. For each target voxel, we then imagine placing a grid aligned with the voxel’s coordinate system over the scene, and read an RGB patch from the voxels surrounding the target voxel on that grid. Next, we compute a histogram of gradients for each of these patches, and use it to find the dominant gradient orientation around each voxel. We then rotate the coordinate system for each voxel to align it with the voxel’s dominant gradient orientation, and read another RGB patch from the voxel’s new coordinate system. Finally, the RGB patches in this second batch are converted to the CIELab colour space and combined with the surface normals to form the full feature descriptors.
Having looked at how we can compute features for individual voxels, we are now in a position to describe how we train our random forests. The training process is online and spread across multiple frames. At each frame, we sample a number of voxels from the reconstructed scene. For each such voxel, we then construct a training example and add it to the forest. Finally, the random forest is trained by choosing leaf nodes in the various trees to split based on their entropy. In the following subsections, we discuss in more detail the way in which we sample voxels from the scene and the way in which we use training examples constructed from the sampled voxels to train the forest.
To acquire training data with which to train the random forest, we randomly choose voxels from the scene to serve as examples of particular semantic labels. These voxels are passed to the feature calculator (see §4.3), which in turn produces feature descriptors that can be used to form training examples. The input to the training sampler is the current voxel raycast of the scene: that is, at each frame we sample from the voxels that are visible from the live camera position.
The goal of the training sampler is to produce relatively even numbers of voxels from the different semantic classes so as to train a forest that does not unfairly prioritise one class over another. To do this, it first groups the voxels in the current voxel raycast by semantic label to form a set of candidate voxels for each label, and then randomly chooses similar numbers of voxels from the candidate sets to produce a final set of sampled voxels for each label.
Although this is conceptually quite a simple process, in practice the implementation is complicated by the need to sample voxels as quickly as possible to achieve real-time frame rates. To do this, we parallelise the entire process, as illustrated in Figure 12. The first step is to produce a voxel mask for each semantic class that is in use: the mask for a class contains a bit for each voxel in the current raycast result, denoting whether or not that voxel is labelled with the class in question. Having generated the masks, the goal is to write the voxels for each class whose mask bits are set to into an array of candidate voxels for that class from which voxels can then be randomly sampled for training purposes. To allow these voxels to be written into the candidate voxel arrays in parallel, however, the threads doing the writing need to know the array locations into which to write the voxels. To calculate these locations, we resort to a standard parallel programming trick. Starting from the voxel masks, we first efficiently compute their prefix sums, as described in [HSO07]: these tell us the locations into which to write the voxels. Each thread then writes its voxel into the candidate voxel array at the specified location if and only if the corresponding mask bit is set to . Having generated arrays containing candidate voxels for each semantic class, the only thing that remains is to then sample from them randomly to produce the final output.
Having sampled voxels for each semantic class, it is then straightforward to make training examples from them with which to train the forest. To do this, we simply calculate a feature descriptor for each sampled voxel (see §4.3) and pair each feature descriptor with its voxel’s label to make a training example.
These training examples can then be added to the forest, a process that involves separately adding all of the examples to each decision tree. To add an example to a tree (see Listing 3), we simply pass it down the tree to an appropriate leaf node by testing its feature descriptor against the decision functions in the branch nodes it encounters on the way. Once it reaches a leaf, we add it to the leaf’s example reservoir (as mentioned in §4.2, the use of reservoirs bounds the memory usage of the forest by limiting the number of examples that are explicitly stored in each leaf).
Training the forest involves splitting its leaf nodes so as to separate the examples they contain based on their semantic classes. Given a set of examples that have accumulated in a particular leaf node, the task of splitting the node is to find a decision function that splits those examples into sets and , typically in a way that maximises the amount of information gained in the process. (Intuitively, the idea is that a ‘good’ decision function will try and separate examples with distinct class labels.) The leaf can then be replaced with a branch node containing the decision function, and two child nodes that each contain some of the examples from the old leaf node.
In practice, it is typical to choose a decision function from among a number of different candidates, scored using a cost function that encourages the average entropy of the example distributions in the new child nodes to be lower than the entropy of the distribution in the original leaf. Formally, the score of the split induced by a particular decision function can be calculated using the information gain objective
in which denotes the Shannon entropy of the distribution of class labels in .
Candidate decision functions are generated randomly, generally from a parameterised family of functions with a predefined structure. In our case, we are using decision functions that test an individual feature against a threshold (see §4.1), so we can generate candidate decision functions simply by randomly generating pairs that determine the functions in question. When splitting a leaf, we first generate a moderate number of candidate decision functions, and then choose a candidate that maximises the information gain in (2) as the decision function for the new branch node.
We have seen how to split an individual leaf node, but the question of how to schedule such splits so as to train the overall forest remains. In practice, since we are trying to train our forest online whilst maintaining a real-time frame rate, a convenient way to implement the scheduling process is to maintain a priority queue of nodes to be split for each tree in the forest. We assign each node in each tree a splittability score and use the priority queues to rank them in non-increasing order of splittability. The splittability score of a node with example set is defined to be
in which is a tunable constant. This captures the intuitions that (a) nodes with a higher number of examples will contain more accurate probability mass functions, and (b) nodes with a higher entropy should be split first, since they are more likely to achieve a high information gain when split. At each frame, we split nodes in each tree, starting from the most splittable, until we either run out of nodes to split or exceed a specified split budget. This process allows us to focus attention on the nodes we are most keen to split whilst maintaining a real-time frame rate.
Having trained a random forest model for the scene from sparse user labels as described in §4.4, we would like to use it to predict semantic labels for the remaining voxels in the scene. In practice, however, there are far too many voxels in the scene as a whole to simultaneously predict labels for all of them in real time. Instead, we adopt a strategy of gradually predicting labels for voxels over time. At each frame, we randomly sample a proportion of the voxels in the current view888The current default is voxels, but this is configurable. and predict their labels; over time, as the camera moves around the scene, this results in us labelling all of the surface voxels in the scene. In the following subsections, we first describe how we sample voxels from the scene, and then describe how we use the random forest to predict labels for the sampled voxels.
By contrast with the process of sampling voxels for training, where it is important to sample balanced numbers of voxels from each semantic class, the sampling process for prediction is extremely straightforward. Indeed, since we are primarily trying to predict labels for currently-unlabelled voxels in the scene, one might think it would suffice simply to randomly sample unlabelled voxels from the current raycast result. In practice, we simplify things even further by allowing labelled voxels to be predicted more than once: this reduces the process to one of uniformly sampling voxels with replacement from the current raycast result, which can be straightforwardly implemented by generating random pixel indices on the CPU. Moreover, this scheme allows previously-labelled voxels to have their predictions updated as the forest changes over time.
The slight drawback of this approach is that it can take some time to fill in the last few unlabelled voxels in the current view. For the purposes of visual feedback, this can be mitigated by applying median filtering when rendering the scene to remove impulsive noise. For the purposes of further computation, however, it would be better to focus attention on the unlabelled voxels in the scene before repredicting previously-labelled voxels: we plan to improve the sampler to account for this in future iterations of our framework.
To predict a label for each sampled voxel, we first calculate a feature descriptor for it as described in §4.3. We pass this descriptor down each decision tree in the forest until we reach a leaf, using the find_leaf function shown in Listing 3
. Each of the leaves found by this process contains a probability mass function specifying the probabilities that the voxel should be labelled with each of the various semantic classes that are being used to label the scene. To determine an appropriate label for the voxel, we first average these probability distributions to produce a probability mass function for the forest as a whole, and then deterministically choose a label with the greatest probability. An illustration of how label prediction works can be seen in Figure13, and the corresponding implementation can be found in Listing 4.
The multi-class classification performance of our random forest library was evaluated on the UCI Poker dataset [Lic13], since it contains categories with a large imbalance in the number of samples per class, as often encountered in the real-world usage of SemanticPaint. For instance, it is common for the user to point the camera at certain parts of the scene for longer than others, which results in gathering more samples for some classes than others.
With the exact same data splits and evaluation procedure as described in [SSK13], we achieved an average accuracy of (better than , the accuracy achieved by always predicting the ‘nothing in hand’ class). However, since the prior distribution of the categories in Poker is highly non-uniform, the percentage of correctly-classified examples may not be an appropriate measure of overall classification performance [EGW10]
. After normalising the confusion matrix to account for the imbalance in the distribution of classes, the multi-class accuracy was(random chance is ).
In the presence of unbalanced data, the multi-class prediction accuracy may be improved by weighting each category with the inverse class frequencies observed in the training data , . Moreover, these weights may be considered in the computation of the (weighted) information gain [KBBP11]. We found that reweighting the probability mass functions during learning and prediction [KBBP11] doubled the normalised classification accuracy on Poker to .
Having discussed some of the details of how our framework is implemented, we now present a few qualitative examples to illustrate its performance on some real-world scenes. In Figure 14, we show the user interacting with the 3D world in order to assign labels to object categories such as ‘coffee table’, ‘book’ and ‘floor’. Note that after initially labelling the scene using touch interaction, the labels are propagated across smooth surfaces in the scene as described in §3.4. After training the random forest on the propagated user labels, the labels of the remaining voxels in the scene are predicted in real time by the forest, as illustrated in Figure 15.
In this report, we have described SemanticPaint [VVC15, GSV15], our interactive segmentation framework for 3D scenes, built on top of the InfiniTAM v2 3D reconstruction engine of Kähler et al. [KPR15]. Using our framework, a user can quickly segment a reconstructed 3D scene by providing ‘ground truth’ semantic labels at sparse points in the scene and then taking advantage of label propagation and an online random forest model to efficiently label the rest of the scene.
Our framework supports a number of different modalities for user interaction, ranging from selecting scene voxels using the mouse to physically touching objects in the real world. Moreover, it can be extended straightforwardly to support new modalities.
By keeping the user ‘in the loop’ during the segmentation process, we allow any initial classification errors made by the random forest to be quickly corrected. This is important, because automatic segmentation is in general an ill-posed problem: different users want different results, so in the absence of user input it is unclear what the desired output of the segmentation process should look like.
From an architectural perspective, our SemanticPaint framework has been implemented as a set of portable libraries that currently work on Windows, Ubuntu and Mac OS X. Although we have not currently done so, it would be reasonably straightforward to port them to further platforms if needed. We very much hope that they can form the basis for further research in this area.
Camera control in SemanticPaint is implemented in the rigging library. This contains a hierarchy of camera classes, as shown in Figure 16. The abstract Camera class, at the root of the hierarchy, represents a camera in 3D space using its position and a set of orthonormal axes (see Figure 17). The abstract MoveableCamera class derives from this, adding member functions that allow a camera to be translated and rotated in space. There are then three concrete camera classes: SimpleCamera, DerivedCamera and CompositeCamera. A SimpleCamera is a concrete implementation of MoveableCamera that stores explicit position and axis vectors. A DerivedCamera is a non-moveable camera whose position and orientation are based on those of a base camera: it stores a rotation and translation in camera space, and its position and orientation are computed on-demand relative to the position and orientation of the base camera. This is useful, because it allows rigid camera rigs to be represented in a straightforward way. Finally, a CompositeCamera is a moveable rig of several cameras: it consists of a single primary camera that controls the position and orientation of the rig itself, and a number of secondary cameras that are generally directly or indirectly based on that camera. An example showing how these classes can be used together to represent a simple stereo rig can be seen in Figure 18.
To make it possible for users to undo and redo their actions, SemanticPaint incorporates a command system of the form described in [Gol06] and originally based on [Suf04]. We briefly discuss the workings of this system here for completeness.999We borrow from [Gol06] for this purpose, with the permission of the original author.
The command system is primarily implemented in the tvgutil library. As illustrated in Figure 19, it consists of classes relating to the commands themselves, and a command manager.
The root of the command class hierarchy is an abstract base class called Command. The interface of this class allows commands to be both executed and undone (see Listing 5).
Atomic actions in an application using this system can be represented using classes that derive directly from Command and implement execute and undo member functions. Composite actions (those that involve executing a sequence of smaller actions) can be represented using an instance of SeqCommand, a helper class whose execute implementation executes a known sequence of commands in order, and whose undo implementation undoes the same commands in reverse order.
The command manager provides the mechanism by which commands can be undone and redone. It is implemented as a pair of stacks (both initially empty), one of commands that the user has executed/redone and the other of those that have been undone. Its interface (see Table 2) allows clients to execute commands, undo and redo commands provided the relevant stacks are not empty, check the sizes of the stacks, and clear both of them if desired.
|CommandManager||consists of||(executed, undone)|
|execute_command(c)||ensures||executed = c : old(executed)|
|ensures||executed = head(old(undone)) : old(executed)|
|undone = tail(old(undone))|
|ensures||executed = tail(old(executed))|
|undone = head(old(executed)) : old(undone)|
Executing a command involves clearing the stack of undone commands, calling the execute member function of the new command and adding it to the stack of executed commands. Undoing a command moves the top element of the executed stack (if any) across to the undone stack and calls its undo member function. Redoing a command moves the top element of the undone stack (if any) across to the executed stack and calls its execute member function. See Figure 20 for an illustration of the undo/redo process.
To allow the user to interact verbally with the SemanticPaint engine (e.g. to choose a semantic label with which to label the scene), we implemented a simple, Java-based voice command server called Voice Commander using CMU’s popular Sphinx-4 speech recognition engine [LKG03]. This listens to voice input from a microphone and attempts to recognise predefined commands spoken by the user. It accepts connections from any number of clients and notifies them whenever a new command is recognised. The main SemanticPaint engine initially connects to the server, and then checks for new commands once per frame. Detected commands are used to immediately update the engine’s state, e.g. the command ‘label chair’ causes the semantic label with which the user is labelling the scene to be set to ‘chair’. A diagram illustrating the architecture of our voice recognition subsystem is shown in Figure 21.
The detection of touch interactions is a core part of the SemanticPaint framework, and is motivated by the potential for humans to teach a machine about various object categories in a natural way, just by touching objects in the scene. We allow a user to touch objects using their hands, feet or any other object they wish to use for interaction. The wide range of objects with which the user may interactively label the 3D world stems from a basic assumption that anything moving in front of the static scene is a candidate for touch interaction.
To detect the places in which the user is touching the scene, we look at the differences between the scene at the time of reconstruction, as captured by a depth raycast of our model from the current camera position, and the scene at the current point in time, as captured by the raw depth input from the camera. If we assume that the scenes we are labelling are mostly static, then these differences will tend to consist of a combination of user interactions with the scene and unwanted noise. By postprocessing the differences appropriately, we aim to identify which of them correspond to genuine user interactions with the scene, and label the scene accordingly.
The pipeline we use to detect user interactions is shown in Figure 22. We first construct an image containing the differences between the reconstructed and current scenes, and threshold this to make a binary change mask. We then find the connected components in this mask, and filter them in order to determine the component (if any) that is most likely to represent a user interaction with the scene. Finally, we again look at the differences between the reconstructed and current scenes, and identify the points within this component that correspond to places in which the user is touching the scene. The details of how the various stages of the pipeline work can be found in the following subsections.
The inputs to the touch detector are the current raw depth captured by the camera and a voxel raycast of the scene from the current camera position. The raw depth image is thresholded to ignore parts of the scene that are greater than m away from the camera (see Figure 23(a)) . This threshold was chosen to be less than the maximum depth integrated into the scene by InfiniTAM [KPR15], and to be greater than the maximum expected distance from the camera to a point of interaction (i.e. we do not expect that the user’s hand or foot will extend to more than m away from a camera positioned close to the user).
Based on the voxel raycast, we use orthographic projection to calculate a depth raycast of the reconstructed scene from the current camera position (see Figure 23(b)). Pixels for which no valid depth exists are set to an arbitrarily large distance away from the camera (e.g. m). The alternative approach of ignoring invalid regions would have the undesirable effect of breaking up connected touch components if they occlude parts of the scene that are too far away to be reconstructed. In the next subsection, the thresholded raw depth and depth raycast will be used to find the parts of the scene that have changed.
The purpose of change detection is to find image regions that have changed since the reconstructed scene was last updated. Let the thresholded raw depth image be denoted by and the depth raycast of the reconstructed scene be denoted by . We use these to calculate a binary change mask as follows:
Here, denotes pixel in image , and defines a threshold (by default mm) below which the thresholded raw depth and the depth raycast of the reconstructed scene are assumed to be equal. We test against because a pixel in may be if the corresponding pixel in the raw depth image was either invalid or greater than the specified m threshold.
After computing the change mask , we apply a morphological open operation [GW06] to it to reduce noise, with the result as shown in Figure 24(a). This has the added advantage of reducing the number of connected components that we will calculate in the next subsection, which in turn reduces the computational cost of subsequent image processing stages.
The next step is to calculate (using standard methods [GW06]) a connected component image from the denoised change mask. This is an image in which the pixels in each component share the same integer value (see Figure 24(b), in which we have applied a colour map to make the different components easier to visualise).
The connected components represent the changing parts of the scene. They are subsequently filtered to find the component (if any) that is most likely to represent a touch interaction. The initial filtering stage removes candidate components that are either too large or too small to represent viable interactions. The probability that each connected component that survives this stage is a touch region is then calculated using a random forest trained on unnormalised histograms of connected component difference images (such as the one shown in Figure 25(a)) for which the ground truth is known.
Let the trained random forest be represented by a function mapping a feature vector for each connected component to the probability that the connected component region is part of a touch interaction:
The connected component most likely to be a touch interaction can then be calculated as follows:
The most likely connected component is classified as being a touch interaction iff . If a touch interaction is detected, the final stage of our pipeline involves finding the points (if any) in that component that are touching a surface, as described in the next subsection.
To extract a set of touch points from a component that has been classified as a touch interaction, we first construct an image containing the differences between the thresholded raw depth and depth raycast for just those pixels that are within the component (see Figure 25(a)). We threshold this image to find points that are within a certain distance of the surface, producing a result as shown in Figure 25(b).
Formally, let the absolute element-wise subtraction between the thresholded raw depth and the depth raycast be an image . Keeping only the pixels relevant to the selected connected component, we set , where is a binary mask indicating which elements of have value , and is the element-wise multiplication operator (i.e. it applies the binary mask). The parts of the touch component that are touching the scene may be extracted by setting a threshold below which it is assumed that the depth difference is small enough to consider that the touch object is in contact with the scene. The touch point mask is then , in which the logical and comparative operators are performed element-wise on matrix . As a final step, since the number of touch points may be quite large, we spatially quantise the image by nearest-neighbour down-scaling to extract a smaller set of representative touch points.
The touch detection algorithm described above is quick enough to run as part of a real-time system, taking approximately ms on each frame, but it has several downsides. If a hand interacting with the scene hovers over part of the scene that contains inaccurate depth measurements, those noisy areas may become part of the same connected component as the hand, since no shape constraints are imposed on the interacting object. Furthermore, the random forest used in the touch detector does not perfectly filter out candidate connected components that are not touch interactions. We think that the detections may be improved by augmenting the unnormalised histograms of depth difference with more complex connected component features. Frequent sources of noise arise from inaccurate depth measurements, caused by objects that absorb/reflect infrared light in ways that disturb the reflected patterns needed by structured light sensors. Moreover, inserting an object into the scene may interfere with the iterative closest point (ICP) camera pose tracking, causing unwanted depth differences at the edges of objects. Finally, the assumption that the moving parts of the scene may be extracted by taking the difference between the thresholded raw depth and the depth raycast rests upon an assumption of highly-accurate camera pose tracking, which does not always hold. The effects of these nuisance factors limit the reliability of the touch detector in anything other than ideal capturing conditions. In future versions of our framework, we hope to implement alternative approaches to touch detection to account for these difficulties.
Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning.Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227, 2012.