Zero-shot Video Object Segmentation via Attentive Graph Neural Networks (ICCV2019 Oral)
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS). The suggested AGNN recasts this task as a process of iterative information fusion over video graphs. Specifically, AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. The underlying pair-wise relations are described by a differentiable attention mechanism. Through parametric message passing, AGNN is able to efficiently capture and mine much richer and higher-order relations between video frames, thus enabling a more complete understanding of video content and more accurate foreground estimation. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case. To further demonstrate the generalizability of our framework, we extend AGNN to an additional task: image object co-segmentation (IOCS). We perform experiments on two famous IOCS datasets and observe again the superiority of our AGNN model. The extensive experiments verify that AGNN is able to learn the underlying semantic/appearance relationships among video frames or related images, and discover the common objects.READ FULL TEXT VIEW PDF
How to make a segmentation model to efficiently adapt to a specific vide...
In this paper, we introduce a novel network, called discriminative featu...
In this paper, we present a novel Motion-Attentive Transition Network
We introduce a novel network, called CO-attention Siamese Network (COSNe...
Conventional sequential learning methods such as Recurrent Neural Networ...
We propose a new method for video object segmentation (VOS) that address...
Existing LiDAR-based 3D object detectors usually focus on the single-fra...
Zero-shot Video Object Segmentation via Attentive Graph Neural Networks (ICCV2019 Oral)
Zero-shot Video Object Segmentation via Attentive Graph Neural Networks (ICCV19, Oral)
Automatically identifying the primary objects in videos is an important problem that could benefit a wide variety of applications, by reducing or eliminating manual effort needed to process and understand video. However, discovering the most prominent and distinct objects across video frames without having prior knowledge of what those foreground objects are is a challenging task.
Traditional methods tend to tackle this issue by using handcrafted or learnable features in a local or sequential manner. For instance, handcrafted feature based methods use objectness , motion boundary , and saliency  cues over a few successive video frames, or explore trajectories , , link optical flow over multiple frames to capture long-term motion information. These are typically non-learning methods working in a purely unsupervised
manner. Recent deep learning based methods learn more powerful video object features from large-scale training data, yielding azero-shot solution  (still no annotation used for any testing frame). Many of these [7, 57, 21, 58, 31, 55]
employ two-stream networks to combine local motion and appearance information, and apply recurrent neural networks to model the dynamics in a frame-by-frame manner.
Though these methods greatly promoted the development of this field and gained promising results, they generally suffer from two limitations. First, they focus primarily on the local pair-wise or sequential relations between successive frames, while ignoring the ubiquitous, high-order relationships among the frames (since frames from the same video are usually correlated). Second, since they do not fully leverage the rich relationships, they fail to completely capture the video content and hence may easily get inferior foreground estimates. From another perspective, as video objects usually suffer from underlying object occlusions, huge scale variations and appearance changes (Fig. 1 (a)), it is difficult to correctly infer the foreground when only considering successive or local pair-wise relations in videos.
To alleviate these issues, we need to explore an effective framework that can comprehensively model the high-order relationships among video frames into modern neural networks. In this work, an attentive graph neural network (AGNN) is proposed to addresses zero-shot video object segmentation (ZVOS), which recasts ZVOS as an end-to-end, message passing based graph information fusion procedure (Fig. 1 (b)). Specifically, we construct a fully connected graph where video frames are represented as nodes and the pair-wise relations between two frames are described as the edge between their corresponding nodes. The correlation between two frames is efficiently captured by an attention mechanism, which avoids time-consuming optical flow estimation [7, 57, 21, 58, 31]. By using recursive message passing to iteratively propagate information over the graph, , each node receives the information from other nodes, AGNN can capture higher-order relationships among video frames and obtain more optimal results from a global view. In addition, as video object segmentation is a per-pixel prediction task, AGNN has a desirable, spatial information preserving property, which significantly distinguishes it from previous fully connected graph neural networks (GNNs).
AGNN operates on multiple frames, bringing the added advantage of natural training data augmentation, as the combination candidates are numerous. In addition, since AGNN offers a powerful tool for representing and mining much richer and higher-order relationships among video frames, it brings a more complete understanding of video content. More significantly, due to its recursive property, AGNN is flexible enough to process variable numbers of nodes during inference, enabling it to consider more input information and gain better performance (Fig. 1 (c)).
We extensively evaluate AGNN on three widely-used video object segmentation datasets, namely DAVIS , Youtube-Objects  and DAVIS , showing its superior performance over current state-of-the-art methods.
AGNN is a fully differential, end-to-end trainable framework that allows rich and high-order relations among frames (images) to be captured and is highly applicable to spatial prediction problems. To further demonstrate its advantages and generalizability, we apply AGNN to an additional task: image object co-segmentation (IOCS), which aims to extract the common objects from a group of semantically related images. It also gains promising results on two popular IOCS benchmarks, PASCAL VOC  and Internet , compared to existing IOCS methods.
Experiments on the ZVOS and additional IOCS tasks clearly demonstrate that AGNN is able to not only capture the relationships among correlated video frame images, but also mine the semantics among semantically related static images. Notably, this work can be viewed as a very early attempt to apply and extend GNNs for pixel-wise prediction tasks, which provides an effective video object segmentation solution and new insight into this task.
GNN was first proposed in  and further developed in  to handle the underlying relationships among structured data. In , recurrent neural networks were used to model the state of each node, and the underlying correlation between nodes are learned via parameterized message passing over neighbors. Li  further adapted GNN to sequential outputs. Gilmer  Later formulated the message passing module in GNNs as a learnable neural network. Recently, GNNs have been successfully applied in many fields, including molecular biology 48, 71, 76]62]2]. Another popular trend in GNNs is to generalize the convolutional architecture over arbitrary graph-structured data [10, 40, 26]
, which is called graph convolution neural network (GCNN).
The proposed AGNN falls into the former category; it is a message passing based GNN, where all the nodes, edges, and message passing functions are parameterized by neural networks. It shares the general idea of mining relationships over graphs but has significant differences. First, our AGNN is unique in its spatial information preserving nature, which is opposed to conventional fully connected GNNs and crucial for per-pixel prediction task. Second, to efficiently capture the relationship between two image frames, we introduce a differentiable attention mechanism which addresses the correlated information and produces further discriminative edge features. Third, as far as we know, there is no prior attempt to explore GNNs in ZVOS.
and certain heuristic assumptions related to the foreground (, local motion differences, background priors ). Some others explore more efficient object representations, such as dense point trajectories [41, 42, 66] or object proposals [74, 27, 23, 36]. Most of these methods work in a purely unsupervised manner without using any training data.
Recently, with the renaissance of deep learning, more research efforts have been devoted to tackling this in deep learning frameworks, leading to a zero-shot solution [13, 21, 58, 7, 30, 31, 29, 37]. For instance, a multi-layer perception based detector was designed in  to detect moving objectness. Li  integrated deep learning based instance embedding and motion saliency  to boost performance. Some others turned to fully convolutional networks (FCNs) [3, 34, 77]. They introduced two-stream networks to fuse appearance and motion information [29, 21, 7]
, or explored more efficient feature extraction models and LSTM variants, to better locate the foreground objects.
The differences from previous methods are multifold: our AGNN 1) provides a unified, end-to-end trainable, graph model based ZVOS solution; 2) efficiently mines diverse and high-order relations within videos, through iteratively propagating and fusing messages over the graph; and 3) utilizes a differentiable attention mechanism to capture the correlated information between frame pairs.
IOCS [50, 39, 18] aims to jointly segment common objects belonging to the same semantic class in a given set of related images. Early methods usually formulate IOCS as an energy function defined over the whole or a part of the image set and consider intra- and inter-image cues [64, 25, 52, 65]. To capture the relationships between images, some methods applied scene matching techniques , global appearance models , discriminative clustering methodologies , manifold ranking  or saliency heuristics [16, 56]. There are only a very few deep IOCS models [4, 32], mainly due to the lack of a proper, end-to-end modeling strategy for this problem. [4, 32] tackled IOCS through a pair-wise comparison protocol and employed a Siamese network to capture the similarity between two related images. Our AGNN based ICOS solution is significantly different from [4, 32]. First, [4, 32] consider IOCS as a pair-wise image matching problem, while we formulate IOCS as an information propagation and fusion process among multiple images. That means our model can capture richer relations from a global view. Second, the Siamese network based systems only handle pair-wise relations, while our message passing based iterative inference can learn higher-order relations among multiple images. Third, our method is based on the graph model, yielding a more general and elegant framework for modeling IOCS.
Before elaborating on our proposed AGNN (§3.2), we first give a brief introduction to generic formulations of GNN models (§3.1). Finally, in §3.3, we provide detailed information on our network architecture.
Based on deep neural networks and graph theory, GNNs are powerful for collectively aggregating information from data represented in graph domains [53, 14]. Specifically, a GNN model is defined according to a graph . Each node takes a unique value from , is associated with an initial node representation (or node state or node embedding) . Each edge is a pair , with an edge representation . For each node , we learn an updated node representation through aggregating representations of its neighbors. Here is used to produce an output , , a node label. More specifically, GNNs map graph to the node outputs through two phases. First, a parametric message passing phase runs for steps, which recursively propagates messages and updates node representations. At the -th iteration, for each node , we update its state according to its received message (, summarized information from its neighbors ) and its previous state :
where , and are the message function and state update function, respectively. After iterations of aggregation, captures the relations within the -hop neighborhood of node .
Second, a readout phase maps the node representation of the final -iteration to a node output, through a readout function :
The message function , update function , and readout function are all learned differentiable functions.
Next, we present our AGNN based ZVOS solution, which essentially extends traditional fully connected GNNs to (1) preserve spatial features; and (2) capture pair-wise relations (edges) via a differentiable attention mechanism.
Problem Definition and Notations. Given a set of training samples and an unseen testing video with frames in total, the goal of ZVOS is to generate a corresponding sequence of binary segment masks: . To achieve this, AGNN represents as a directed graph , where node represents the -th frame , and edge indicates the relation from to . To comprehensively capture the underlying relationships between video frames, we assume is fully connected and includes self-connections at each node (see Fig. 2 (a)). For clarity, we refer to , which connects a node to itself, as a loop-edge; and , which connects two different nodes and , as a line-edge.
The core idea of our AGNN is to perform message propagation iterations over to efficiently mine rich and high-order relations within . This helps to better capture the video content from a global view and obtain more accurate foreground estimates. We then readout the segmentation predictions from the final node states . Next, we describe each component of our model in detail.
FCN-Based Node Embedding. We leverage DeepLabV3 , a classical FCN based semantic segmentation architecture, to extract effective frame features, as node representations (see Fig. 2 (b) and Fig. 3 (a)). For node , its initial embedding can be computed as:
is a 3D tensor feature withspatial resolution and channels, which preserves spatial information as well as high-level semantic information.
Intra-Attention Based Loop-Edge Embedding. A loop-edge is a special edge that connects a node to itself. The loop-edge embedding is used to capture the intra relations within node representation (, internal frame representation). We formulate as an intra-attention mechanism [61, 70], which has been proven complementary to convolutions and helpful for modeling long-range, multi-level dependencies across image regions . In particular, the intra-attention calculates the response at a position by attending to all the positions within the same node embedding (see Fig. 2 (c) and Fig. 3 (b)):
where ‘’ represents the convolution operation, s indicate learnable convolution kernels, and is a learnable scale parameter. Eq. 4 makes the output element of each position in encode contextual information as well as its original information, thus enhancing the representability.
Inter-Attention Based Line-Edge Embedding. A line-edge connects two different nodes and . The line-edge embedding is used to mine the relation from node to , in the node embedding space (see Fig. 2 (b)). Here we compute an inter-attention mechanism  to capture the bi-directional relations between two nodes and (see Fig. 2 (c) and Fig. 3 (c)):
where . indicates the outgoing edge feature and the incoming one, for node . indicates a learnable weight matrix. and are flattened into matrix representations. Each element in reflects the similarity between each row of and each column of . As a result, can be viewed as the importance of node ’s embedding to , and vice versa. By attending to each node pair, explores their joint representations in the node embedding space.
Gated Message Aggregation. In our AGNN, for the message passed in the self-loop, we view the loop-edge embedding itself as a message (see Fig. 3 (b)), since it already contains the contextual and original node information (see Eq. 4):
For the message passed from node to (see Fig. 3 (c)), we have:
where softmax() normalizes each row of the input. Thus, each row (position) of is a weighted combination of each row (position) of , where the weights come from the corresponding column of . In this way, the message function assigns its edge-weighted feature (, message) to the neighbor nodes . Then, is reshaped back to a 3D tensor with a size of .
In addition, because some nodes are noisy due to camera shift or out-of-view, their messages may be useless or even harmful. We apply a learnable gate to measure the confidence of a message :
where indicates the use of global average pooling to generate channel-wise responses,
is the logistic sigmoid function, and and are the trainable convolution kernel and bias.
where ‘’ denotes the channel-wise Hadamard product. Here, the gate mechanism is used to filter out irrelevant information from noisy frames. See §4.3 for a quantitative study of this design.
ConvGRU based Node-State Update. In step , after aggregating all the information from the neighbor nodes and itself (Eq. 9), gets a new state by taking into account its prior state and its received message . To preserve the spatial information conveyed in and , we leverage ConvGRU  to update the node state (Fig. 2 (e)):
ConvGRU is proposed as a convolutional counterpart to previous fully connected GRU , and introduces convolution operation into input-to-state and state-to-state transitions.
Readout Function. After message passing iterations, we obtain the final state for each node . Finally, in the readout phase, we get a segmentation prediction map from through a readout function (see Fig. 2 (f)). Slightly different from Eq. 2, we concatenate the final node state and the original node feature (, ) together and feed the combined feature into :
Again, to preserve spatial information, the readout function is implemented as a small FCN network, which has three convolution layers with a sigmoid function to normalize the prediction to .
The convolution operations in the intra-attention (Eq. 4) and update function (Eq. 10) are realized with convolutional layers. The readout function (Eq. 11) consists of two convolutional layers cascaded by a convolutional layer. As a message passing based GNN model, these functions share weights among all the nodes. Moreover, all the above functions are carefully designed to avoid disturbing spatial information, which is essential for ZVOS since it is a pixel-wise prediction task.
Our whole model is end-to-end trainable, as all the functions in AGNN are parameterized by neural networks. We use the first five convolution blocks of DeepLabV3  as our backbone for feature extraction. For an input video , each frame (with a resolution of ) is represented as a node in the video graph and associated with an initial node state . Then, after a total of message passing iterations, for each node , we use the readout function in Eq. 11 to obtain a corresponding segmentation prediction map . More details on the training and testing phases are provided as follows.
|Method||KEY ||MSG ||NLC ||CUT ||FST ||SFL ||MP ||FSEG ||LVO ||ARP ||PDB ||MOA ||AGS ||AGNN|
|Airplane (6)||Bird (6)||Boat (15)||Car (7)||Cat (16)||Cow (20)||Dog (27)||Horse (14)||Motorbike (10)||Train (5)||Avg.|
Training Phase. As we operate on batches of a certain size (which is allowed to vary, depending on the GPU memory size), we leverage a random sampling strategy to train AGNN. Specifically, we split each training video with a total of frames into segments () and randomly select one frame from each segment. Then we feed the sampled frames into a batch and train AGNN. Thus the relationships among all the sampling frames in each batch are represented using an -node graph. Such a sampling strategy provides robustness to variations and enables the network to fully exploit all frames. The diversity among the samples enables our model to better capture the underlying relationships and improve its generalizability. Let us denote the ground-truth segmentation mask and predicted foreground map for a training frame as and . Our model is trained through the weighted binary cross entropy loss (see Fig. 2):
where indicates the foreground-background pixel number ratio in . It is worth mentioning that, as AGNN handles multiple video frames at the same time, it leads to a remarkably efficient training data augmentation strategy, as the combination candidates are numerous. In our experiments, during training, we randomly select 2 videos from the training video set and sample 3 frames () per video, due to the computation limitation. In addition, we set the total number of iterations as . Quantitative experimental settings can be found in §4.3.
Testing Phase. After training, we can apply the learned AGNN model to perform per-pixel object prediction over unseen videos. For an input test video with frames (with resolution), we split into subsets: , where . Each subset contains frames with an interval of frames: . Then we feed each subset into AGNN to obtain the segmentation maps of all the frames in the subset. In practice, we set during testing. We quantitatively study this setting in §4.3. As our AGNN does not require time-consuming optical flow computation and processes frames in one feed-forward propagation, it achieves a fast speed of per frame. Following the widely used protocol [58, 57, 55], we apply CRF as a post-processing step, which takes about per frame. More implementation details can be found in §4.1.1.
We first report performance on the main task: unsupervised video object segmentation (§4.1). Then, in §4.2, to further demonstrate the advantages of our AGNN model, we test it on an additional task: image object co-segmentation. Finally, we conduct an ablation study in §4.3.
Datasets and Metrics: We use two well-known datasets:
DAVIS  is a challenging video object segmentation dataset which consists of 50 videos in total (30 for training and 20 for val) with pixel-wise annotations for every frame. Three evaluation criteria are used in this dataset, , region similarity (Intersection-over-Union) , boundary accuracy , and time stability .
Youtube-Objects  comprises 126 video sequences which belong to 10 object categories and contain more than 20,000 frames in total. Following its protocol, we use to measure the segmentation performance.
DAVIS  consists of 60 videos in the training set, 30 videos in the validation set and 30 videos in the test-dev set. Different from DAVIS2016 and Youtube-Objects, which only focus on object-level video object segmentation, DAVIS provides instance-level annotations.
Implementation Details: Following [44, 55], both static data from image salient object segmentation datasets, MSRA10K , DUT , and video data from the training set of DAVIS are iteratively used to train our model. In a ‘static-image’ iteration, we randomly sample 6 images from the static training data to train our backbone network (DeepLabV3) to extract more discriminative foreground features. To train the backbone network, a convolution layer with sigmoid function is appended as an intermediate output layer, which can access the static image supervision signal. This is followed by a ‘dynamic-video’ iteration, in which we use the sampling strategy described in §3.3 to sample 6 video frames to train our whole AGNN model. The ‘static-image’ and ‘dynamic-video’ iterations are executed alternately. To apply the trained AGNN model on DAVIS, we first use category agnostic mask-RCNN  to generate instance-level object proposals for each frame. Then, we run AGNN on the whole video and generate a coarse mask for the primary objects in each frame. Then the object-level masks are used to filter out the proposals from the background and highlight the foreground proposals. Through combining an instance bounding proposals and coarse masks, we obtain the instance-level mask for each primary object. Finally, to connect multiple instances across different frames, we use overlap ratio and optical flow as an association metric  to match different instance-level masks.
Val-set of DAVIS. We compare the proposed AGNN with the top ZVOS methods from the DAVIS benchmark111https://davischallenge.org/davis2016/soa_compare.html, deadline: Mar. 2019 . Table 1 shows the detailed results. We can see that our AGNN outperforms the best reported results (, AGS ) on DAVIS benchmark by a significant margin in terms of mean (80.7 vs 79.7) and (79.1 vs 77.4). Compared to PDB , which uses the same training protocol and training datasets, our AGNN yields significant performance gains of 3.5 and 4.6 in terms of mean and mean , respectively.
Youtube-Objects. Table 2 gives the detailed per-category performance and average results on Youtube-Objects. As can be seen, our AGNN performs favorably according to mean criterion. Furthermore, unlike other methods whose performance fluctuates across categories, AGNN mains a stable performance.This further proves its robustness and generalizability.
Test-dev set of DAVIS. In Table 3 we report the performance comparison with the recent instance-level ZVOS method, RVOS , on the DAVIS test-dev set. We can find that AGNN significantly outperforms RVOS over most evaluation criteria.
Fig. 4 depicts visual results for the proposed AGNN on two challenging video sequences soapbox and judo of DAVIS and DAVIS, respectively. For soapbox, the primary objects undergo huge scale variation, deformation and view changes, but our AGNN still generates accurate foreground segments. Our AGNN also handles judo well, although the different foreground instances suffer from similar appearance and rapid motions.
Our AGNN model can be viewed as a framework for capturing high-order relations among images (or frames). To demonstrate its generalizability, we extend AGNN for IOCS task. Rather than extracting the foreground objects across multiple relatively similar video frames in videos, IOCS needs to infer the common objects from a group of semantically related images.
Datasets and Metrics: We perform experiments on two well-known IOCS datasets:
|Method||GO-FMR ||FCNs ||CA ||AGNN|
|Method||FCA ||CSA ||DOCS ||AGNN|
|Method||DC ||Internet ||TDK ||GO-FMR ||DDCRF ||CA ||FCA ||CSA ||DOCS ||CoA ||AGNN|
Implementation Details: Following [4, 32], we employ PASCAL VOC to train our model. In each iteration, we randomly sample a group of images that belong to the same semantic class, and feed two groups with randomly selected classes (6 images in total) to the network. All other experimental settings are the same as ZVOS.
After training, we evaluate the performance of our method on the test sets of PASCAL VOC and Internet dataset. When processing an image, IOCS must leverage information from the whole image group (as the images are typically different and some are irrelevant) [49, 65]. To this end, for each image to be segmented, we uniformly split the other images into groups, where . Then we feed the first image group and to a batch of size , and store the node state for . After that, we feed the next group and the store node state of to get a new state of . After steps, the final state of contains its relationships to all other images and is used to produce its final co-segmentation result.
PASCAL VOC. It is very challenging to segment the common objects in this dataset, since the objects undergo drastic variation in scale, position and appearance. In addition, some images have multiple objects belonging to different categories. On this dataset, we compare AGNN with six representative methods, including Siamese-based co-segmentation methods [4, 32], as well as deep semantic segmentation models (., FCNs ).
Table 4 shows detailed results in terms of mean . FCNs  segment each image individually (without considering other related images), and thus give poor performance. Both  and  consider pairs of images and gain better results. Our AGNN achieves the best performance because it considers high-order information from multiple images during inference, enabling it to capture richer semantic relations within the image groups.
Internet. We evaluate our model (pre-trained on PASCAL VOC) on Internet [4, 49]. Quantitative results in Table 5 again demonstrate the superiority of AGNN (4.5% performance gain compared with the second best method). The result of AGNN is higher than compared methods for three classes: Car (84.0%), Horse (72.6%), Airplane (76.1%).
Fig. 5 shows some sample results. Specifically, the first four images in the top row belong to the Cat category (red circle), while the last four images contain the Person category (yellow circle) with significant intra-class variation. For both cases, our AGNN successfully detects the common object instances amongst background clutter. For the second row, AGNN also performs well in the cases with remarkable intra-class appearance change.
|Reference||Full model (3 Iterations, N’= 5)||80.7||-|
|w/o. Gated Message (Eq. 9)||80.1||0.6|
We perform an ablation study on DAVIS  to investigate the effect of each essential component of AGNN.
Effectiveness of Our AGNN. To quantify the contribution of our AGNN, we derive a baseline w/o. AGNN, which indicates the results from our backbone model, DeepLabV3. As shown in Table 6, AGNN indeed brings significant performance improvements (72.280.6 in term of mean ).
Gated Message Aggregation Strategy. In Eq. 9, we equip the message passing with a channel-wise gated mechanism to decrease the negative influence of irrelevant frames. To evaluate this design, we offer a baseline w/o. Gated Message, which aggregates messages directly. A performance degradation is observed after excluding the gates.
Message Passing Iterations . To investigate the message passing iterations , we report the performance as a function of s. We find that, with more iterations (), better results can be obtained. The performance of the message passing converges at .
Node Numbers During Inference. To evaluate the impact of the number of nodes during inference, we report performance with different values of . We observe that, with more input frames (), the performance raises accordingly. When even more frames are considered (), the final performance does not change obviously. This may be due to the redundant content in video sequences.
This paper proposes a novel AGNN based ZVOS framework for capturing the relations among videos frames and inferring the common foreground objects. It leverages an attention mechanism to capture the similarity between nodes and performs recursive message passing to mine the underlying high-order correlations. Meanwhile, we demonstrate the generalizability of AGNN by extending it to IOCS task. Extensive experiments on three ZVOS and two IOCS datasets indicate that our AGNN performs favorably against current state-of-the-art methods. This further illustrates the importance of AGNN which can capture diverse relations among similar video frames or semantically related images.
Acknowledgements This work was supported in part by ARO grant W911NF-18-1-0296, Beijing Natural Science Foundation under Grant 4182056, CCF-Tencent Open Fund, Zhijiang Lab’s International Talent Fund for Young Professionals, and the National Science Foundation (CAREER IIS-1253549).
Image cosegmentation via saliency-guided constrained clustering with cosine similarity. In AAAI, Cited by: §2.3.