The skin cancer death rate has escalated sharply in the USA, Europe and Australia . However, with proper early detection, the survival rate after surgery (wide excision) reaches 98%. For this reason, the research community has put a significant effort in the early detection of skin cancer through the inspection of images 
. Recently, the best results has been achieved using transfer learning on Convolutional Neural Networks (e.g.,[10, 13, 11]).
As reported by Brinker et al. , regardless of the similarities in terms of sensitivity, specificity, and ROC AUC, those works are hardly comparable with each other because they are all based on different datasets (often proprietary), and use different CNN architectures and hyper-parameters.
Hence, given a new dataset, characterised for example by its own resolution, settings (lenses and light conditions), type (dermoscopic or clinical), and ethnicity (caucasian, asiatic, worldwide), choosing for the best CNN architecture and hyper-parameters is not straightforward. For example, one of the mostly recent influencing works (Esteva et. al ) showed a CNN that matches the accuracy of expert dermatologists when trained on more than 126k images. However, Fujisawa et al.  showed that, with a higher image augmentation (24x) and image resolution (1k), the same performances can be achieved using less than 5000 images. This is very important to the general area under study with less data material.
Also, we have reports from the pre-CNN era, when features were extracted manually and image pre-processing was required, about the importance of extended segmentation  and color filtering  to improve the performance of classification. However, such techniques have not been applied in conjunction with deep learning approaches, where performance gains are mostly pursued by increasing the size of the training sets.
We try to enhance the infrastructure for research based on previous implementations [22, 18] to explore a number of options that require a considerable amount of software development. To address the two above-mentioned issues (lack of cross-CNN comparison and lack of integration of pre-processing techniques) we develop a software platform for the easy and systematic exploration of (hyper-)parameters governing the performances of image augmentation and CNNs.
Our platform (which will be released as open-source software once out of beta stage) has two target user groups: Developers, who need a structured and extensible software architecture to experiment with new image processing techniques and CNN architectures, and Practitioners in the field of dermatology, who do not necessarily have the competencies to script new software, but do need to explore the performances of existing techniques when new datasets become available.
2 Training and Testing Pipeline
shows the pipeline for the training and testing procedure of the proposed architecture. Starting from left, a dataset is chosen, and all images are (optionally) segmented and the mask is extended. The segmented images then go through an augmentation procedure, which includes the possibility to resize (using different filters), transform (flip/rotate), modulate brightness and saturation, and change the color space. Images are augmented on-the-fly. The augmented images are then sent to a CNN which can output a binary classification (e.g., malignant vs. benign lesion), a multi-class probability distribution, or a pixel-level mask for the identification of featured image areas.
Training Input The source data consist of a single CSV file (comma-separated values), and is thus easily manageable as a spreadsheet with MS Excel or OpenOffice. Table 1 shows an input example.
|method||dataset||split||epochs||segment||imgaug||batch size||img size||resize filter||color space||class weights|
The method column specifies the CNNs configuration. The dataset column contains the name of another CSV file having a column with the image file names and a second column with ground truth labels. In future versions, we plan to simplify this procedure even more, bringing it to the level of filesystem management, where the input will be a folder whose subfolders represent the classes to predict. The split column specifies whether the validation and test sets are pre-split on different files (with extra suffix) or alternatively to sample elements from the training set. The segment is a float value that, if positive, enables segmentation and specifies the extension factor of the masking area. The epochs column specifies the number of training epochs, while the imgaug column contains a preset for image augmentation, both affecting training time. The batchsize column is the training batch size, while imgsize is the (square) resolution at which each input image will be rescaled, both affecting the quantity of GPU RAM needed. The resize filter specifies the resize sampling strategy (nearest, bilinear, bicubic, or lanczos). The colorspace specifies whether the images should be kept in their original RGB format or should be converted into HVS, LAB, or YCbCr. Finally, classweights specifies the weight factors for each class, used as compensation factors in unbalanced datasets. Such weights can also be computed automatically from the input dataset.
This is the process of automatically detecting the contour separating the lesion from the surrounding skin.
Masking out surrounding skin regions, together with a procedural mask extension, has the potential to improve classification results . This processing step is optional because there is not guarantee that the segmentation will be correct.111 According to interactive machine learning (IML) goals, we plan to implement the detection of “anomalies” (e.g., too many contours in the same image which blocks shape detection, or too small/big areas) to proactively warn and alarm the user to manually correct the pipeline.
According to interactive machine learning (IML) goals, we plan to implement the detection of “anomalies” (e.g., too many contours in the same image which blocks shape detection, or too small/big areas) to proactively warn and alarm the user to manually correct the pipeline.
. On a pixel-by-pixel test, we achieved 73% sensitivity, 98% specificity, and a Jaccard Index (aka Intersection over Union, IoU) of 0.69. This compares well with the top results of the ISIC 2017 challenge (83% sensitivity, 98% specificity, and 0.76 Jaccard Index).
Data Augmentation This is the process of procedurally deriving several alterations of an image that look plausible to augment the original dataset. Our image augmentation module has been implemented using a Decorator design pattern . An abstract ImageProvider class exposes the methods to query for the number of available images and get an image by integer index. A direct concrete subclass DiskImageProvider reads images from files, optionally changing the color space and resizing each image. The ImageAugmenter abstract class (subclass of ImageProvider) provides the base for augmentation. Several augmenters (HFlip, Rotation, Brightness, Saturation) can be concatenated in any order to provide a custom and controlled augmentation chain.
A Factory Method  provides a mapping between a mnemonic name and an augmentation configuration. For example, we are experimenting with three augmentation presets: with hflip every image is flipped horizontally, thus doubling the number of images; with hflip_rot4 every image is flipped and also rotated by 0, 90, 180, and 270 degrees (augmentation 8x); finally, hflip_rot24 (following the schema of ) leads to an augmentation factor of 48x.
The image augmentation is performed entirely via CPU, possibly on multiple threads, hence leaving the GPU for the training task only. Whether this is an advantage or not depends on other training parameters. For example, when training images at 227x227 resolution on a machine with only 4 cores, the augmentation process is actually a bottleneck. However, when training with 450x450 resolution on a 8-core machine, the CPUs are just about 20% loaded while the GPU is 100% loaded on training.
New CNN architectures can be inserted in the main software by implementing the abstract class Classifier and giving a concrete implementation for the method def build_model() -> keras.Model
def build_model() -> keras.Model. By design choice, the hyper-parameters of the specific architecture, such as the learning rate, or the type of optimiser and its parameters, are left to the software engineers and hence to the Python code. In future versions, meta leaning frameworks can be added, or AutoML systems that continuously improve over time. Only specific pre-sets are visible for the end user. Again, a Factory Method manages the mapping between a mnemonic string and an CNN architecture and some of its parameters.
We are currently experimenting with transfer learning using the VGG16 
network, pre-trained on ImageNet, on which we substituted the last layers with 2x 2048 fully connected layers and a final softmax. The default optimizer is SGD, but other configurations are available: VGG16_Nadam, VGG16_Adadelta, and VGG16_RMSProp. Soon, we will perform more tests using InceptionV3 . Also, we prepared two non pre-trained networks, one based on VGG16 (VGG16_random) and the second (SC19) as custom modified version of AlexNet 
on the Github distribution. For the feature extraction, we are preparing a CNN based on UNET, trained on the ISIC 2018 challenge  dataset, that is able to extract masks for five features (pigment network, negative network, streaks, milia-like cysts, globules).
Training Output The output of a training session is written into a directory where a train_output.csv file contains the same columns of the input file plus a number of columns with the output information, such as the size of validation and test set, class proportions, training time, accuracy, specificity and sensitivity for both validation and test sets, and the ROC AUC for the test set. An additional column is filled with an error message if an exception occurred during the training for the input line (e.g., out of GPU memory). Additionally, for each input line, the system generates plots for validation and training losses as function of epoch training, together with ROC graphs for both the validation and the test sets.
Implementation The whole architecture is implemented with the Python language and uses the Keras222https://keras.io/ – 23 May 2019
framework (Tensorflow333https://www.tensorflow.org/ – 23 May 2019 backend). All image processing is based on the Pillow444https://pillow.readthedocs.io/ – 23 May 2019 package. The reference hardware for our experiments is an 8-core i9-9900K CPU, 64GB RAM, and an 11GB nVidia RTX 2080 Ti GPU.
3 Preliminary results and Lessons Learned
We ran experiments on the ISIC dataset, as retrieved in February 2019, from which we removed the SONIC subset (whose images contains coloured markers) and the “2018 JID Editorial Images” subset (very high resolution, lossless). The resulting datasets counts 12319 images.
Experiments at 277x277 pixel resolution With the VGG16 model, SGD optimiser, images resolution at 277x277 pixels (no cropping, only scaling of the full original image), and image augmentation hflip_rot24 (48x augmentation), we achieved 0.649 specificity, 0.813 sensitivity, and 0.819 ROC AUC. The training of two epochs lasted about 4 hours and a half. This result is already satisfactory compared to other state-of-the-art approaches like Esteva et al. , who reached AUC ROC 0.96, but training on a dataset of 120k+ images and augmentation 720x, and like Fujisawa et al. 
, who reached 0.895, specificity and 0.963 sensitivity but using 1000x1000 pixel resolution images and 24x augmentation. With all the other optimisers (Nadam, Adadelta, RMSProp) the network was not training properly. More tests are needed to tune the parameters of the optimisers in combination with the other parameters of the pipeline.
Experiments without transfer learning Since training on a dataset like ImageNet requires months of training on high-end hardware, transfer learning might not be an option when investigating on new architecture, possibly simpler, specialised on the skin lesion domain. To have a baseline without transfer learning, we trained the VGG16_random configuration. The network, however, wasn’t able to converge after 10 epochs, suggesting that pre-training is not only an option to speed up training, but a necessary condition (at least with this dataset size).
The randomly initialised SC19 architecture showed worse results than VGG16, lacking in sensitivity, with 0.845 specificity, 0.492 sensitivity, and 0.803 ROC AUC. The training lasted about 21 hours for 7 epochs. However, the loss plot showed a possible overfitting occurring already during the first epoch. By switching to a lower augmentation policy hflip_rot4 (augmentation 8x), the results improved to 0.814 specificity, 0.674 sensitivity, and 0.835 ROC AUC in 11 epochs, using 1/6th of the computational resources.
Experiments at 450x450 pixel resolution We increased the images size to 450x450 pixels, which is the lowest resolution available as height of the ISIC images. Results improved, achieving 0.763 specificity, 0.798 sensitivity, and 0.862 ROC AUC. The drawback is an increased train time of 27 hours for 6 epochs. The same configuration (VGG16, 450px, 48x augmentation) was tested against human performance using the 100-image MClass-D test set , on which dermatologists reached 0.600 specificity, 0.741 sensitivity, and 0.671 ROC AUC. Our system performed better, reaching 0.762 specificity, 0.850 sensitivity, and 0.846 ROC AUC. Changing the operating value (malignant vs. benign threshold) from 0.5 to 0.6 leads to 0.862 specificity, 0.750 sensitivity. This latest results closely match with the performances measured by Brinker et al. themselves with their own CNN on the same test set .
No impact of image resize filters Reducing the size of the ISIC dermoscopic images into 227x227 or 450x450 pixels implies information loss. We investigated on the impact of the resizing filter over the classification results. For the following three conditions, VGG16 at 227x227, SC19 at 227x227, and VGG16 at 450x450 pixels, we tried 4 rescaling filters: nearest, bilinear, bicubic, and lanczos (as exposed by the PIL Python package). Our results show no significant difference in the all the metrics, suggesting that the resizing filter can be left to the default (nearest) for the sake of performances.
We described a software toolbox the configuration of deep neural networks in the domain of skin cancer classification. The results suggest that interactive machine learning (IML) design principles should be applied to train effective models in an explorative way. We provided means for the research community to quickly refine skin cancer classification pipelines by tuning (hyper-)parameters and get feedback as quickly as possible. Interface components need to be simple to end user groups to remain focussed on the machine learning problem at hand. The software platform can be used for other (medical) image processing tasks as well, where iterative processes are needed, and users’ control on the behaviour of the learning system and latency is sensitive for training. In the future, we investigate the visualisation of image features (aforementioned pigment network, negative network, streaks, milia-like cysts, globules). Alternatively, future work can explore meta leaning frameworks, or AutoML systems that continuously improve over time.
-  (2014-09) Two Systems for the Detection of Melanomas in Dermoscopy Images Using Texture and Color Features. IEEE Systems Journal 8 (3), pp. 965–979. External Links: Cited by: §1.
-  (2019-05) Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task. European Journal of Cancer 113, pp. 47–54 (en). External Links: Cited by: §3.
Comparing artificial intelligence algorithms to 157 German dermatologists: the melanoma classification benchmark. European Journal of Cancer 111, pp. 30–37 (en). External Links: Cited by: §3.
-  (2018-10) Skin Cancer Classification Using Convolutional Neural Networks: Systematic Review. Journal of Medical Internet Research 20 (10), pp. e11936 (en). External Links: Cited by: §1.
-  (2018-08) Rethinking Skin Lesion Segmentation in a Convolutional Classifier. Journal of Digital Imaging 31 (4), pp. 435–440 (en). External Links: Cited by: §1, §2.
-  (2019-03) Dermoscopy Image Analysis: Overview and Future Directions. IEEE Journal of Biomedical and Health Informatics 23 (2), pp. 474–478. External Links: Cited by: §1.
-  (2018-04) Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC). In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, pp. 168–172. External Links: Cited by: §2.
-  (2019-02) Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC). arXiv:1902.03368 [cs]. Note: arXiv: 1902.03368 External Links: Cited by: §2.
-  (2009-06) ImageNet: A large-scale hierarchical image database. In , Miami, FL, pp. 248–255. External Links: Cited by: §2.
-  (2017-01) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, pp. 115. External Links: Cited by: §1, §1, §3.
-  (2018-09) Deep-learning-based, computer-aided classifier developed with a small dataset of clinical images surpasses board-certified dermatologists in skin tumour diagnosis. British Journal of Dermatology (en). External Links: Cited by: §1, §1, §2, §3.
-  (1994) Design patterns: elements of reusable object-oriented software. Addison-Wesley. Cited by: §2, §2.
-  (2018-08) Man against machine: diagnostic performance of a deep learning convolutional neural network for dermoscopic melanoma recognition in comparison to 58 dermatologists. Annals of Oncology 29 (8), pp. 1836–1842 (en). External Links: Cited by: §1.
-  (2012) ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §2.
-  (2013) Computer Aided Diagnostic Support System for Skin Cancer: A Review of Techniques and Algorithms. International Journal of Biomedical Imaging 2013, pp. 1–22 (en). External Links: Cited by: §1.
-  (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Vol. 9351, pp. 234–241. External Links: Cited by: §2, §2.
-  (2014-09) Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs]. Note: arXiv: 1409.1556 External Links: Cited by: §2.
-  (2017) Fine-tuning deep CNN models on specific MS COCO categories. CoRR abs/1709.01476. External Links: Cited by: §1.
-  (2016-06) Rethinking the Inception Architecture for Computer Vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-08) The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific Data 5, pp. 180161. External Links: Cited by: §2.
-  (2019-03) Automatic skin lesion segmentation with fully convolutional-deconvolutional networks. IEEE Journal of Biomedical and Health Informatics 23 (2), pp. 519–526. Note: arXiv: 1703.05165 External Links: Cited by: §2.
-  (2018) A survey on deep learning toolkits and libraries for intelligent user interfaces. CoRR abs/1803.04818. External Links: Cited by: §1.