1 General description of the system
The main pipeline of our system is depicted in Fig. 1. It comprises the following steps:

For each clinical case , a dermoscopic image feeds a Lesion Segmentation Network that generates a binary mask outlining the area of the image which corresponds to the lesion. The description of this module is given in section 2.

Each clinical case , which is now defined by the imagemask couple , goes through the Data Augmentation Module. This module aims to extend the initial visual support of the lesion by generating new views corresponding to different rotations and cropped areas. Hence, the output of this module is an extended set of images related to the lesion. Section 3 provides a detailed description of this data augmentation process.

The next step in the process is the Structure Segmentation Network. It aims to segment each view of the lesion into a set of eight global and local structures that have turned to be very important for dermatologists in their daily diagnosis. Examples of these structures are dots/globules, regression areas, streaks, etc. Hence, the output of this system is a set of 8 segmentation maps , each one associated to a particular structure of interest. This module is introduced in section 4.

Finally, the augmented set is passed to the Diagnosis Network, which is in charge of providing the final diagnosis for the clinical case. The description of this network can be found in section 5.
2 Lesion Segmentation Network
The Lesion Segmentation Network has been developed by learning a Fully Convolutional Network (FCN) (Shelhamer et al., 2016). FCNs have achieved stateoftheart results on the task of semantic image segmentation in generalcontent, as demonstrated in the PASCAL VOC Segmentation (Everingham et al., 2015). In order to train a network for our particular task of lesion/skin segmentation, we have used the training set for the lesion segmentation task in the 2017 ISBI challenge. Let us note that the goal of this module is not to generate very accurate segmentation maps of a lesion, but to broadly identify the area of the image that corresponds to the lesion, giving place to a binary map for each clinical case.
3 Data Augmentation Module and Normalized Polar Coordinates
It is well known that data augmentation notably boosts the performance of deep neural networks, mainly when the amount of training data is limited. Among all the potential image variations and artifacts, invariance to orientation is probably the main requirement of our method, as dermatologists do not follow a specific protocol during the capture of a lesion. Other more complex geometric transformations such as affine or projective transforms are less interesting here as the dermatoscope is normally placed just over and orthogonally to the lesion surface. The particular process of data augmentation is described next:

First, starting from the pair , we generate a set of rotated versions.

As rotating an image without losing any visual information requires incorporating new areas which were not present in the original view, we find and crop the largest inner rectangle ensuring that all pixels belong to the original image.

Finally, as our sub subsequent CNNs (Structure Segmentation and Diagnosis) require square input images of 256x256 pixels, we finally perform various squared crops which are in turn resized to the required dimensions.
Considering the aforementioned rotations and crops, for each given clinical case
, we generate an augmented set of 24 images, represented by a tensor
, with .In addition, for each generated view , we compute the Normalized Polar Coordinates from the lesion mask. The goal of this new alternative coordinates is to support subsequent processing blocks by providing invariance against shifts, rotations, changes in size and even irregular shapes of the lesions. To do so, we transform pixel Cartesian coordinates into normalized polar coordinates , where and
stand for the normalized ratio and angle, respectively. The process to compute this transformation is as follows: first, the mask of the lesion is approximated by an ellipse with the same secondorder moments. Then, we learn the affine matrix that transforms the ellipse into a normalized (unit ratio) circle centered at (0,0). Figure
2 shows an example of a rotated and cropped view of a lesion, and its corresponding normalized polar coordinates.4 Structure Segmentation Network
The goal of this module is, given an input view of the lesion , to provide a corresponding segmentation into a predefined set of textural patterns and local structures that are of special interest for dermatologists in their diagnosis. In particular, we have considered a set of eight structures: 1) dots, globules and cobblestone pattern, 2) reticular patterns and pigmented networks, 3) homogeneous areas, 4) regression areas, 5) bluewhite veil, 6) streaks, 7) vascular structures and 8) unspecific patterns.
The main challenge to develop this module is the generation of a stronglylabeled training dataset, in which each image has an associated ground truth pixelwise segmentation. This kind of annotation is often hard to obtain as it requires a huge effort of the dermatologists to manually outline the segmentations. Alternatively, providing weak imagelevel labels indicating only which structural patterns are present in each lesion is much easier for dermatologists and therefore becomes more realistic. Hence, following this latter approach, we asked dermatologists of a collaborating medical institution, the Hospital Doce de Octubre in Madrid, to annotate the ISIC 2016 training dataset with the presence or absence of the 8 considered structures. In particular, we asked them to provide one labels for each structure: 0 if the structure is not present, 1 if is locally present, 2 if it is present and large enough to be considered a global pattern in the lesion.
Given this weaklyannotated dataset, we have built our approach over the work of (Pathak et al., 2015), where the authors introduced a novel constrained optimization for weaklylabeled segmentation using CNNs. The output of this network is a reduced version of the input image (64x64 in our case) where, for each pixel location , a softmax is used to transform the net outputs into probabilities as follows:
(1) 
where represents the parameters of the CNN, and is the partition function at the location
. The presence or absence of a class, as well as, an estimate of its size in the image, lead to particular constraints over the probability
accumulated over all pixel locations in the segmentation map:
If a structure is not present in an image, the constraint acts as an upper bound over the accumulated probability , which has to be nearly zero.

If a structure is local in an image, we impose a lower and upper bound on the accumulated probability in the image to control the total area of the structure in the lesion.

If a structure is global in an image, we impose a lower bound on the accumulated probability in the image to ensure a minimum area corresponding to the structure.
In order to adapt this approach to our particular scenario, we have developed a set of modifications over the original approach, namely:

We observed that using simple softmax function lead to situations in which many constraints over local structures were obeyed by assigning some residual probability to every location in the segmentation map. From our point of view, this is an undesired behavior, as one would rather expect a small set of pixels showing large probabilities of belonging to the structure of interest. To overcome this limitation, we have used a parametric softmax . The parameter is a softapproximation towards the max function, and large values lead to scenarios in which each location shows high probability just for very reduced set of structures. In our case, we have used a value of .

We added a new constraint that helps to learn structures that appear in spatial locations of the lesion: e.g. streaks tend to appear in the borders of a lesion. For that end, we accumulate probabilities only in those locations that will likely contain the intended structure. At this point, we have defined these areas of interest over the Normalized Polar Coordinates described in section 3, which are more adequate than the original Cartesian coordinates.
We have implemented this module taking the wellknown vggvdd (Simonyan and Zisserman, 2014) (the same network used as initialization for the lesion segmentation module), removing the top layers, and using the ISIC 2016 training dataset and the described constrained optimization with weak annotations (Pathak et al., 2015). The output of this module is, for each view of a clinical case , a tensor that contains the 8 probability maps of the considered structures.
5 Diagnosis Network
The Diagnosis Network will gather the information from previous modules in order to generate a diagnose for each clinical case. As in the previous modules, our approach has taken a wellknown CNN as starting point and modified the top layers to get a better adaptation to our problem.
The network chosen as basis is the resnet50 (He et al., 2015), which uses residual layers to avoid the degradation problem when more and more layers are stacked to the network. When applied to our 256x256 images, the last convolutional block (conv_5x) of this network produces a tensor
, which hopefully behaves as a detector of high level concepts (objects in Imagenet, the dataset for which it was originally designed).
In the original work, an average pooling layer transformed this tensor into a singlevalue per channel and image , which was followed by a fully convolutional layer and a softmax to generate the final probabilities of the image containing the classes being detected. Hence, the goal of the average pooling was fusing detections at various locations of the input image and generating a unified score for each highlevel concept.
In our approach, however, we have modified the structure of the top layers of the network, giving place to the structure presented in Figure 3. We basically subdivide the top fullyconnected layer providing the lesion diagnosis into three arms: a) the original arm with an average pooling followed by a fully connected layer (FC1), b) a second arm that performs a normalized polar pooling ( rings by angles) and follows it by a fully connected layer (FC2), c) a third arm that estimates the asymmetry of lesion based on the previous polar pooling and applies then a Fully Connected layer (FC3). The results of the three arms are then linearly combined using a Sum block. We next describe the novel blocks that are required in this new structure and that have been specifically developed in this work:

Modulation block: The goal of this block is to take advantage of the previous segmentations of the lesion into global and local structures which are of great interest for dermatologists in their daily diagnosis. To do so, this blocks fuses the previous structure segmentation maps with the filter outputs of the conv_5x layer in resnet50. In particular, we modulate the outputs of the layer (2048 channels in our case) using the probabilities of the 8 local and global structures described in section 4. By concatenating the resulting modulation with the original set of outputs we finally generate a set of channels which is 9 times the original one (18432 in our case).

Polar Pooling
: This block aims to perform pooling operations over data (average or max pooling) but, rather than using rectangular spatial regions, we employ sectors defined in polar coordinates. Hence, this block is defined for a given number of radial rings R (radius ranging from 0 to 1) and angular sectors A (angles ranging between 0 and
), producing an output of size . Furthermore, in order to adapt to the irregular shapes of the lesions, we use the normalized polar coordinates described in section 3. Since, depending on the shape of the lesion and the size of the tensor being pooled, some combinations may not contain pixels in the image, we can also define overlaps between adjacent radius and angles to regularize the outputs. In addition, the division of the lesion into rings is nonuniform and ensures that every ring contains the same number of pixels for a perfect circular lesion. 
Asymmetry: This block computes metrics that evaluate the asymmetry of a lesion for a given angle. In particular, given a polar division of the lesion into sectors, we compute the asymmetry for angles by folding the lesion over each angle and computing the accumulated absolute difference between corresponding sectors.
As shown in the Figure 3, we combine these modules to generate a final output for each considered view of a clinical case.
Finally, in order to generate a final output for each clinical case , we consider independence between views leading to a factorization:
(2) 
It is also worth noting that our final submission has also incorporated in the factorization an extra classifier which depends only on external information about the clinical case, such as patient gender and age, and lesion area.
6 Code
The code that implements this paper as well as the Lesion Segmentation and Diagnosis Networks are provided in the following link: https://github.com/igondia/matconvnetdermoscopy.
Acknowledgments
We kindly thank dermatologists of Hospital 12 de Octubre of Madrid because of their inestimable help annotating the data contents with the weak labels of structural patterns. This work was supported in part by the National Grant TEC201453390P and National Grant TEC201461729EXP of the Spanish Ministry of Economy and Competitiveness. In addition, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X GPU used for this research.
References

Everingham et al. [2015]
M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman.
The pascal visual object classes challenge: A retrospective.
International Journal of Computer Vision
, 111(1):98–136, Jan. 2015.  He et al. [2015] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
 Pathak et al. [2015] D. Pathak, P. Krähenbühl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, 2015.
 Shelhamer et al. [2016] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. CoRR, abs/1605.06211, 2016. URL http://arxiv.org/abs/1605.06211.
 Simonyan and Zisserman [2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.