Recently, medical imaging tasks such as classification, segmentation and registration have been successfully carried out with state-of-the-art performance by deep learning models, which have found their way into a plethora of Computer Assisted Diagnosis and Intervention (CAD/I) Systems which aid physicians. However, medical imaging datasets utilized to train such models are often characterized by large class variability, severe class imbalance, outliers, inter-observer variability, ambiguity and most prominently limited data. The aforementioned problems hinder the training of neural networks and lead to sub-optimal and overfit solutions. Moreover, deep learning models deployed by physicians in a CAD/I system must be thoroughly evaluated, with respect to not only their generalizability, i.e. performance on data originating from a given test set, but also their behavior on data corrupted by noise, unknown transformations and outliers, which can be described by the term robustness. Data augmentation describes the act of increasing the size and variance of a given dataset to train a machine learning model, in order to achieve better generalizability and capture a better understanding of the underlying distribution of the training data. The manifold of a class learned by a classifier can be perceived as the space that represents the distribution of the training data.
In this work our contribution is two-fold: We propose a novel data augmentation technique, utilizing an exhaustive manifold-exploration method that increases the performance of a deep learning model on the provided test set, and significantly improves its robustness to random geometric transformations. Furthermore, we provide quantitative measures to assess a classifier’s robustness. Such a measure provides a significant step towards a thorough evaluation of machine learning models; a highly valuable step towards the safe and successful deployment of trained models by physicians in real-world scenarios involving patient diagnosis and treatment.
ManiFool Augmentation is performed by populating the training dataset for a given task with samples transformed with optimized affine geometric transformations. The method is outlined in Fig. 1, where it is contrasted with traditional data augmentation performed with random transformations. The algorithm utilized to craft samples leveraged for data augmentation is inspired by ManiFool  (discussed in Section 2) and the intuition behind it is rather simple: Move an image via affine geometric transformations iteratively towards a classifier’s decision boundary by following the direction that maximizes the gradient. After every step, project the calculated movement back onto the original training manifold of the class of the image being transformed. This process is repeated iteratively until either a transformation is found that causes the network to misclassify the transformed sample or a pre-defined maximum amount of steps is reached. In case of misclassification, we have crossed the decision boundary and stepped on the manifold of another class. We then backtrack to the manifold of the original class and use this calculated transformated for data augmentation during training.
Contrary to traditional augmentation methods with random transformations, ManiFool Augmentation ensures that the space explored by the network during training is not limited to the local vicinity of a training sample. Instead, augmentations are found globally up to the edges of each class-manifold for the whole training set as can be seen in Fig. 1. An effective augmentation technique should be able to ensure that the samples leveraged to increase the population of the training dataset originate from the same manifold as the original data. Augmenting the training dataset with samples from a different distribution would not necessarily facilitate the model with learning a better embedding for each of the classes, but would rather encourage it, to map the same class to two different sub-spaces, one for each training manifold.
Exhaustive experimentation on two challenging medical datasets showcases that the proposed augmentation technique does not only increase the robustness of a model to geometric transformations, but it also significantly improves its performance on the original test data. This is additionally highlighted by cross-dataset testing, where networks trained with ManiFool Augmentation were able to better capture the underlying distribution of the training data.
Many have taken steps in addressing the problem of limited data in deep learning applications in order to improve model accuracy without carrying the burden of costly data acquisition. Approaches range from elastic transformations , noise generation in a learned features space , to repeat, rotate and infill approaches whereby a known sample is scaled and rotated in a grid pattern, and background consistency is ensured . Fawzi et. al. proposed an algorithm for augmentation which can be integrated into the process of stochastic gradient decent and seeks an augmented sample with the greatest loss within a constrained exploration space or ”trust region” .
Data augmentation has also been extensively formulated as a learning task. 
show significant improvement in accuracy of hand-written-digit classification with a method deploying DAGAN. AutoAugment, formulates the augmentation task as a discrete search problem in which the search algorithm itself is based on a reinforcement learning approach that strives to ”learn” how to maximize the total classification accuracy via augmentation.
Specifically in the field of medical deep learning applications, creative augmentation approaches are necessary to combat the extreme lack of annotated data.  employed generated augmented samples and annotations via GANs to improve CT brain segmentation under severe lack of training data.  reported improved accuracy for liver segmentation by employing DCGANs for data augmentation.
ManiFool  is an iterative algorithm that can be applied to any differentiable classifier . In this Section we will discuss the mathematical operations that generate a geometrically transformed example leveraged for data augmentation.
For an image with ground truth label and a binary classifier an iterative process of steps is initialized and the original image can be defined as . Initially, ManiFool finds the movement direction towards the decision boundary of , by following the opposite of the gradient, . The gradient at the step for the image is the projection of onto the tangent space and can be calculated utilizing the pseudoinverse operation:
is the Jacobian matrix and the calculated is the direction towards the decision boundary for step .
To improve the accuracy and convergence speed during the calculation of a manifold optimization technique similar to  has been adopted:
where is the calculated step size of the iteration and is a constant momentum.
Mapping onto the original manifold
After the movement direction is calculated it is mapped back onto the manifold of the ground truth class. Following , this mapping is performed using retraction , where is the affine transformation calculated as:
are the basis vectors of the Lie Groupof the calculated affine geometric transformation. There are two conditions for the termination of the algorithm, namely the misclassification of the calculated transformed image by the model or reaching the maximum number of allowed iterations . After steps the accumulative affine transformations applied to to generate the ManiFool sample are given by:
The extension of the method from binary to multi-class classifiers is straightforward: We generate a ManiFool sample for each of the remaining classes, starting from the ground truth and based on the geodesic distance of the transformed to the original image we leverage the sample with the smallest transformation . The class with the smallest geodesic distance between the transformations can be found by:
In the following subsections we discuss how the distance is calculated and the significant role it plays as a measure of robustness for neural networks.
2.1 Invariance to Geometric Transformations
Geodesic Distance Between Transformations
The geodesic distance between two transformations and is the length of the shortest curve between and . However, since the metric space of the manifold of the training data is unknown we have to acquire a metric in the Riemannian space by mapping the Lie group to the differentiable image manifold of and , which inherits the Riemannian metric from [11, 12]. After this mapping, the geodesic distance between and is equal to the shortest path connecting and , formulated as:
Geodesic Distance Between Original and ManiFool Samples
Having explained how to calculate the distance between two transformations and two transformed images, we can now show how to measure the geodesic distance between the original samples of our training dataset and the ones generated with ManiFool. The initial untransformed image can be considered the initial point of the aforementioned curve if we define its transformation as the identity one. Thus, the distance between the original sample and , can be calculated by the distance between the identity transformation and the final aggregated one :
Normalization of the distance by the norm of the image is crucial, to ensure generalizability of the distance measure.
Robustness to Geometric Transformations
Since every computed ManiFool example originates from the edge of a class manifold, measuring the aforementioned distance between an original image and its respective transformed sample can act as a measure for the robustness of a classifier. Specifically networks that have learned a high-dimensional embedding space characterized by high class compactness and maximized distance between decision boundaries will require a larger average to transform a class from one class to another. In this work we compute the average distance of all the ManiFool samples as:
where is the number of crafted samples. acts as a quantitative measure of robustness of a neural network to geometric transformations, that can be used to compare the robustness of different deep model architectures or models trained with different augmentation techniques.
Another measure to quantify the robustness of classifier is , given by Equation 9. assesses a model’s performance when it’s evaluated on randomly transformed images. Specifically, for a range of given geodesic distances we craft samples transformed with random transformations and measure misclassification rate of .
where is a user defined threshold. A robust model can maintain higher classification accuracy for images that have larger geodesic distance from the originals.
2.2 ManiFool Augmentation
A significant difference in our approach to the original ManiFool work is that our purpose is not to fool a deep neural network and craft an adversarial example , but rather to utilize the transformed images for data augmentation. Therefore, once we compute the affine transformation that crosses the decision boundary and fools , we backtrack onto the original class manifold via an iterative reduction of the final step size.
Initially, for all the images in the training set of the given dataset, we create ManiFool Augmentation samples that reside around the edges of the class manifolds with an independent black-box classifier . Afterwards, we mix the generated samples with the original data in an equal ratio and train a model from scratch. An alternative approach would have been to utilize all the geometrically transformed images at every step towards the decision boundary for data augmentation. However, it was crucial to maintain an equal ratio of transformed and original samples in the final dataset, so that models utilizing it for training would not be biased to geometrically transformed images, due to an imbalanced amount of samples. Hence, we only utilized the transformed samples in the vicinity of the decision boundary, to provide the maximum possible variance to the models during training. Samples crafted with ManiFool Augmentation are presented in Fig. 2.
3 Experimental Setup
ManiFool Augmentation has been validated on two challenging, public, medical imaging classification datasets, namely, Digital Database for Screening Mammography (DDSM) ,  and Dermofit . DDSM consists of 11.617 expert selected regions of interest (ROI) of mammograms from 1861 patients annotated as normal, benign or malignant by radiologists. Dermofit is an image library consisting of 1300 high-quality dermatoscopic images, with histologically validated fine-grained expert annotations (10 classes). Both datasets were split at patient-level with non-overlapping folds (70% training and 30% testing).
, were used for the evaluation. All networks were initialized with ImageNet weights, therefore appropriate resizing and normalization of the input were performed. The loss function selected for the aforementioned classification problems was weighted Cross Entropy, since the selected datasets are characterized by severe class imbalance. Class weights were computed with median frequency balancing, as described in. The models were optimized with Adam optimizer with an initial learning rate of
across the board. The experiments were implemented in the deep learning framework PyTorch and an NVIDIA Titan Xp was used to train the models for epochs.
To validate the proposed contributions we perform not only ablative studies but also comparison against other widely used augmentation techniques. ManiFool Augmentation was compared with models trained without any augmentation (referred to as ”None” in the following Section) and models trained with traditional random augmentation (”Random”), i.e. rotation and horizontal flipping. The proposed method (noted as ”ManiFool” in the tables of results) was also evaluated against augmentation techniques including Random Erasing  (”Erasing”), a commonly used and fast augmentation technique that replaces random patches of the image with Gaussian noise, and data augmentation with images synthesized by GANs (”DCGAN”), following the method described in .
ManiFool Augmentation Crafting
A noteworthy implementation detail is that for the crafting of the ManiFool Augmentation samples, black-box state-of-the-art models were utilized as the differential classifier described in Section 2. Those models were previously trained on the given datasets but are not utilized in the evaluation phase of this work, to avoid any bias and to ensure that the dataset is previously unseen by all the evaluated models.
4 Results and Discussion
In this Section the detailed results of the ablative evaluation, as well as the baseline comparisons will be discussed, along with the effects of the proposed method to the performance and robustness of the models.
Performance improvement with ManiFool Augmentation
Tables 1 and 2 report the results of the ablative and baseline evaluation of the proposed ManiFool Augmentation method for the Dermofit and DDSM Datasets. Initially, it can be observed that the performance of models without any augmentation is significantly lower, due to overfitting and limited manifold exploration. Random Augmentation provides an improvement in performance but offers no guarantee regarding the increase in the variance that the model is exposed to during training. Moreover, random augmentation can result in out-of-distribution samples, which could hinder model training. Augmented samples created by ManiFool are guaranteed to originate from the same distribution as the original training data, a trait particularly crucial in the setting of medical applications, where misclassifications can have severe and undesired outcomes. Furthermore, Manifool Augmentation, with its improved exploration capabilities, increases the accuracy by across both datasets and model architectures. Additionally, ManiFool Augmentation consistently outperforms Random Erasing, Random Augmentation and GAN Augmentation by approximately across datasets and models.
Limitations of Augmentation with GANs
Generating synthetic images utilizing GANs is a task widely investigated recently as was discussed earlier in Section 1. However, limitations occur regarding GANs for medical imaging: In most cases the resolution of the synthetic images is low leading to a substantial loss of information and quality. Furthermore, GANs trained on the entire dataset do not provide the ground truth label of the generated samples. Therefore in order to use synthetic images for data augmentation with their respective label we have to train conditional GANs , where represents the number of classes. This is both time consuming and sometimes, unachievable due to limited data.
For example, some classes of the Dermofit dataset only have 23 samples for training, making training a conditional GAN on 23 images extremely challenging, if at all possible. Attempts have been made to solve the GAN labelling problem in the medical context , by generating Brain CT scans along with a paired segmentation label map. However, this approach does not offer any guarantee on the correctness of the label maps and though the performance increase on the test set looks promising, mislabeling could induce ambiguity during training and jeopardize the robustness of the model.
Additionally, compared to Manifool Augmentation, augmentation with GANs does not guarantee increase in the variance to which the model is exposed, since images are sampled randomly from the training data distribution and not from the outer regions of the manifold as can be seen in Fig. 1.
Robustness on Random Geometric Transformations
A noteworthy finding highlighted in Tables 1 and 2 is the significant increase in the robustness of models trained with ManiFool Augmentation to random transformations. The improvement is not only impressive, because it ranges from to , but also because even though the proposed augmentation exclusively utilized affine transformations, the robustness to projective ones was drastically improved as well. The remaining evaluated augmentation techniques, i.e. Random Erasing and GAN augmentation, provided much lower, if any, improvement in the robustness of the networks in comparison to the standard random augmentation.
Another experiment evaluating the effect of the ManiFool Augmentation in the robustness of the trained models is shown in Fig. 3. As described in Section 2, Equation 9 evaluates the misclassification rate of a classifier for samples transformed with random affine transformations for a given range of geodesic distance scores.
In Fig. 4 we show images generated within a range of for Dermofit and that were used to evaluate the misclassification rates of the evaluated models. As can be seen in Fig. 3, the models trained with ManiFool Augmentation achieve significantly lower misclassification rates for larger values of the geodesic distance .
Effect on Cross-Dataset Performance
In order to showcase the improved robustness provided by the ManiFool Augmentation, we perform cross-dataset evaluation between Dermofit and HAM10000 , which consists of 10.000 skin lesion images and there are 7 overlapping classes between the two datasets. Notably all models trained with the proposed method, achieve higher accuracy on the unseen dataset, as can be observed in Table 3. This validates the hypothesis that ManiFool Augmentation improves the model’s understanding of the underlying data distribution and leads to the increase of the model’s robustness not only on geometric transformations, but also on unseen test samples.
Robustness of Different Architectures
After we utilize a classifier to craft ManiFool Augmentation samples, we can calculate the average geodesic distance between the original and transformed samples (Equation 8). This measure can quantify the robustness of a machine learning model, since it implicitely measures the distance between the learned decision boundaries. Therefore, models that achieve higher robustness will be characterized by a larger geodesic distance between classes. In previous works, such as , attempts have been made to evaluate the robustness of a classifier utilizing adversarial examples. However, such examples cannot appear naturally and no quantitative measures have been given regarding the robustness. In this work, after we generated the ManiFool Augmentation samples we calculated the robustness scores for the given classifiers, that can be seen in Table 4. This experiment showcases how the robustness of different architectures can flunctuate according to the given dataset. Therefore, it is not sufficient to utilize a state-of-the-art architecture, based on its results on an independant dataset, since its robustness can significantly vary. In our case, InceptionV3 was the most robust model for the Dermofit dataset, while ResNet18 achieved the highest robustness score for DDSM.
In this paper we proposed a novel data augmentation technique based on affine geometric transformations and quantified the robustness of machine learning classifiers. Experiments on challenging medical imaging tasks, namely fine grained skin lesion classification and mammogram tumor classification showcased the advantages of the proposed ManiFool Augmentation. On one hand the performance achieved by the evaluated models increased for the original test set and outperformed other commonly used data augmentation techniques. On the other hand, the robustness of the models trained with the proposed augmentation scheme was increased both for random affine and projective transformations but also cross-datasets, in an unseen test scenario. Furthermore, a qualitative measure for the robustness of machine learning classifiers was calculated and showcased the variations in the robustness of state-of-the-art models for different datasets. Future work includes extension of the ManiFool Augmentation to a wider range or transformations for a variety of medical imaging tasks.
-  C. Kanbak, S.-M. Moosavi-Dezfooli, P. Frossard. Geometric robustness of deep networks: analysis and improvement. In CVPR 2017
-  S. C. Wong, A. Gatt, V. Stamatescu, M. D. McDonnell. Understanding Data Augmentation for Classification: When to Warp? In DICTA, 2016
-  T. Devries, G. W. Taylor. Dataset Augmentation in Feature Space. In CoRR abs/1702.05538, 2017
-  E. Okafor, L. Schomaker and M. A. Wiering. An analysis of rotation matrix and colour constancy data augmentation in classifying images of animals. In Journal of Information and Telecommunication, 2018
-  A. Fawzi, H. Samulowitz, D. Turaga and P. Frossard. Adaptive data augmentation for image classification. In IEEE Int. Conf. on Image Processing (ICIP), 2016
-  A. Antoniou, A. Storkey, Harrison Edwards. Data Augmentation Generative Adversarial Networks. In CoRR abs/1711.04340, 2017
-  E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, Q. V. Le. AutoAugment: Learning Augmentation Policies from Data. In CoRR abs/1805.09501, 2018
-  C. Bowles, L. Chen, R. Guerrero, P. Bentley, R. N. Gunn, A. Hammers, D. A. Dickie, M. C. Valdés Hernández, J. M. Wardlaw, D. Rueckert. GAN Augmentation: Augmenting Training Data using Generative Adversarial Networks. In CoRR abs/1810.10863, 2018
-  M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan. Synthetic Data Augmentation using GAN for Improved Liver Lesion Classification. In IEEE International Symposium on Biomedical Imaging (ISBI), 2018
-  P.-A. Absil, R. Mahony, R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008
-  E. Kokiopoulou, P. Frossard. Minimum distance between pattern transformation manifolds: algorithm and applications. In IEEE Trans Pattern Analysis and Machine Intelligence (TPAMI), 2009
-  L.W. Tu. Differential Geometry. In Graduate Texts in Mathematics, 2017
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014
-  M. Heath, K. Bowyer, D. Kopans, R. Moore, W. P. Kegelmeyer. The Digital Database for Screening Mammography. In the International Workshop on Digital Mammography, M.J. Yaffe, ed., 212-218, Medical Physics Publishing, 2001
-  M. Heath, K. Bowyer, D. Kopans, W. P. Kegelmeyer, R. Moore, K. Chang, S. MunishKumaran. Current status of the Digital Database for Screening Mammography. In Digital Mammography, 457-460, Kluwer Academic Publishers, 1998
-  L. Ballerini, R.B. Fisher, R.B. Aldridge, J. Rees. A Color and Texture Based Hierarchical K-NN Approach to the Classification of Non-melanoma Skin Lesions. In Color Med.IA., Lecture Notes in Comp. Vision and Bio-mechanics 6, 2013
-  K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In CoRR abs/1409.1556, 2014
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna: Rethinking the Inception Architecture for Computer Vision. CVPR 2016
-  A. G. Roy, S. Conjeti, D. Sheet, A. Katouzian, N. Navab, C. Wachinger: Error Corrective Boosting for Learning Fully Convolutional Networks with Limited Data. In MICCAI, 2017
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer. Automatic differentiation in PyTorch. In the 31st Conference on Neural Information Processing Systems (NeurIPS), 2017
-  Z. Zhong, L. Zheng, G. Kang, S. Li, Y. Yang. Random erasing data augmentation. In CoRR abs/1708.04896, 2017
-  A. Radford, L. Metz, S. Chintala. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In 4th International Conference on Learning Representations (ICLR) 2016
-  P. Tschandl, C. Rosendahl, H. Kittler. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. In Sci. Data 5, 2018
-  M Paschali, S Conjeti, F Navarro, N Navab. Generalizability vs. Robustness: Investigating Medical Imaging Networks Using Adversarial Examples. In MICCAI, 2018