Finding Strong Gravitational Lenses Through Self-Attention

10/18/2021 ∙ by Hareesh Thuruthipilly, et al. ∙ NCBJ Jagiellonian University 14

The upcoming large scale surveys are expected to find approximately 10^5 strong gravitational systems by analyzing data of many orders of magnitude than the current era. In this scenario, non-automated techniques will be highly challenging and time-consuming. We propose a new automated architecture based on the principle of self-attention to find strong gravitational lensing. The advantages of self-attention based encoder models over convolution neural networks are investigated and encoder models are analyzed to optimize performance. We constructed 21 self-attention based encoder models and four convolution neural networks trained to identify gravitational lenses from the Bologna Lens Challenge. Each model is trained separately using 18,000 simulated images, cross-validated using 2 000 images, and then applied to a test set with 100 000 images. We used four different metrics for evaluation: classification accuracy, the area under the receiver operating characteristic curve (AUROC), the TPR_0 score and the TPR_10 score. The performance of the self-attention based encoder models and CNN's participated in the challenge are compared. The encoder models performed better than the CNNs and surpassed the CNN models that participated in the bologna lens challenge by a high margin for the TPR_0 and TPR_10. In terms of the AUROC, the encoder models scored equivalent to the top CNN model by only using one-sixth parameters to that of the CNN. Self-Attention based models have a clear advantage compared to simpler CNNs. A low computational cost and complexity make it a highly competing architecture to currently used residual neural networks. Moreover, introducing the encoder layers can also tackle the over-fitting problem present in the CNN's by acting as effective filters.



There are no comments yet.


page 2

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Strong gravitational lensing is the phenomenon in which a distant galaxy or quasar produces multiple, highly distorted images because of the gravitational field of the foreground galaxy or a nearby massive astronomical body. Finding and analyzing these Strong Lenses (SL) have diverse applications in cosmology and astrophysics, ranging from estimating the universe’s dark matter distribution to constraining the cosmological models

(2009ApJ...691..531C; 2017MNRAS.465.4914B; Koopmans_2006; Bonvin_2016; Collett_2014; Cao_2015). Consequently, the current and the upcoming surveys have given significant attention to detecting strong gravitational lensing systems. For a detailed review of the applications of strong lensing, please refer to 2010ARA&A..48...87T and 1992ARA&A..30..311B.

However, for all these analyses, a large sample of SLs are required, and unfortunately, a few hundred lensing systems have been detected and confirmed by the present astronomical surveys till now. One of the largest lens catalogues available now is from the Sloan Lens ACS Survey (SLACS), with only 130 observed lenses (2008ApJ...682..964B). With the upcoming era of advanced missions such as the Euclid (scaramella2021euclid), and LSST (Ivezi__2019; verma2019strong), the number of observable SLs is expected to reach which should be identified from around objects. Similarly, the number of new SLs expected to be in the Square Kilometre Array (SKA) survey will also have similar orders of magnitude (2015aska.confE..84M). To analyze the enormous amount of data produced from the present and future large-scale surveys, various methods have been tried out like crowd science (2016MNRAS.455.1171M) to semi-automated methods like arc detectors (2004A&A...416..391L; 2007A&A...461..813C). However, these methods had only minor success and were too time-consuming to be a practical proposition. Hence, the situation demands better and effective alternatives approaches to detect SLs in future large-scale surveys.

On this background, it is worth mentioning that the advancements in Artificial Intelligence (AI) technology has opened up a plethora of opportunities and has been widely applied in astronomy and astrophysics (e.g. galaxy classification by

2019PASP..131j8002P, supernova classification by 2017ApJ...836...97C, and lens modelling by 2019MNRAS.488..991P

). In recent times, a particular class of deep learning models known as the Convolutional Neural networks (CNNs) has been shown to work exceptionally well to find SLs. Hence, developing deep learning-based algorithms to detect SLs from large scale surveys is actively investigated

(Lanusse_2017; Schaefer_2018; 2019PhDT.......108P; 2019MNRAS.487.5263D; 2020MNRAS.496..381C). For instance, 2017MNRAS.471..167J applied CNNs to the data from Canada-France-Hawaii Telescope Legacy Survey (CFHTLS) to find SLs, and numerous other successful attempts of finding potential strong lensing candidates from the Kilo Degree Survey (KiDS) have been reported (2017MNRAS.472.1129P; Petrillo_2018; 2019MNRAS.484.3879P; He_2020; Li_2020). Likewise, various groups have successfully used CNNs to identify strong lens galaxy-scale systems from large scale surveys such as Dark Energy Survey (DES) (2019ApJS..243...17J; rojas2021strong), Dark Energy Spectroscopic Instrument Legacy Imaging Surveys (Huang_2020; Huang_2021), Pan-STARRS (Lens_CNN&A...644A.163C) and VOICE survey (2021arXiv210505602G).

An exciting feature of the Convolution Neural Network is that it can directly take the image as the input and learn the image features, making them one of the most popular and robust architectures used till now. Generally, the learning capacity of a neural network increases with the number of layers in the network. The network can then learn the low-level features with the first layers and then learn more complex features with increasing depth (russakovsky2015imagenet; simonyan2015deep). However, as neural nets are concerned, increasing the layers in the network will result in higher complexity which in turn may lead to overfitting, which cannot be cured entirely by the dropout layers or any other regularisation techniques (doi:10.1021/ci0342472). In addition, the gradient of the cost function decreases exponentially and eventually vanishes for very deep networks, which is commonly called the vanishing gradient effect (Hochreiter:91; Hochreiter:01book). Because of these two problems creating very deep Convolution Networks were a challenging task (2015arXiv150500387S).

However, the recently introduced idea of residual learning tackles these problems by introducing skip connections between the input and output of a few convolution layers (he2015deep). As a result, the CNN learns the difference between the inputs and outputs rather than their direct mapping. Due to the skip connections, the gradients can reach deeper layers, thus tackling the vanishing gradient effect. Recently the authors of he2015delving

were able to build models as deep as 1000 layers while increasing classification accuracy for the ImageNet Large-Scale Visual Recognition Challenge 2015. However, the scientific community are constantly looking for alternative, simple solutions that can outperform the existing models with reasonable computational cost.

Recently, there was a breakthrough in natural language processing (NLP) by introducing a new self-attention-based architecture known as the Transformers


. The basic idea behind the transformer architecture is the attention mechanism which has found a wide variety of applications in machine learning

(guan2018diagnose; zhang2019selfattention; fu2019dual). Ever after, there were attempts to adapt the idea of self-attention to build better image processing models (ramachandran2019standalone; zhao2020exploring; tan2021explicitly). Recently, Facebook Inc. (carion2020endtoend), and Google brain (dosovitskiy2020image) have been able to surpass the existing Image recognition models with transformer-based architectures. The transformer-based models have not been employed in astrophysics or physics yet, to our best knowledge. In this paper, we explore the possibilities of this new architecture in detecting strongly gravitationally lensed systems.

We implemented various self-attention-based encoder models (Transformer Encoder) to find the gravitational lenses from the Bologna lens challenge and compared its performance with the simple created CNNs, and the CNNs participated in the challenge. The main objective of our study was to explore how well the transformer encoders are suited for finding strong lenses and how to optimize the performance of transformer encoders. From our analysis, we could find that the encoder models perform better than the CNN models, and we were able to beat the top TPR and TPR score (two metrics of evaluation for the Bologna challenge) by a high margin and reach the top AUROC score reported during the challenge.

The paper is organized as follows. In Section 2, we give a brief description of the data we used to train our models. Section 3 gives a brief overview of the methodology used in our study, including the model’s architecture and information on how the models were trained. The results of our analysis are presented in section 4, and the detailed discussion on our results with a brief review of the performance of the encoder models in comparison with the CNN models participated in the challenge is presented in section 5. Section 6 concludes our analysis by highlighting the advantages of the encoder models over CNN models.

2 Data

The data used in this study is from the Bologna Strong Gravitational Lens Finding Challenge (Metcalf_2019). The challenge consisted of two different challenges that could be registered independently. The first challenge was designed to mimic the data set from surveys such as Euclid, which consists of single-band images. The second challenge was designed to resemble data from ground-based detectors with multiple bands and was roughly modelled on the data from the Kilo-Degree Survey (KiDS)4 reported in 2013ExA....35...25D. However, the simulated images did not strictly mock the surveys; they were only employed as references to set noise levels, pixel sizes, sensitivities, and other parameters. The image of a mock simulated lens for the challenge is shown in Fig. 1

. The challenge was opened on November 25, 2016, and closed on February 5, 2017. Surprisingly, automated methods such as CNN and SVM showed far better results than human inspection. During the challenge, these methods were able to classify the images with high confidence where a human would have doubt.

Figure 1: Image of a Mock simulated Lens for the Challenge

An exciting result reported from the challenge was that colour information was crucial for finding strong lenses. All the methods that participated in the challenge performed well on the data from the ground-based observatories, which had four photometric bands (u, g, i, r) compared to the data from the space-based detectors, which had a single band. Consequently, it was advised by Metcalf_2019 to add even low-resolution information from other instruments or telescopes to the higher resolution data in one band to improve the detection rates significantly. In other words, multiple bands make a significant difference, and future upcoming surveys will perform better if they have information provided in multiple bands. Hence, for our study, we chose the data from the ground-based observatories, which had four photometric bands (u, g, i, r). Since we are also interested in exploring the transformer architecture’s optimization and analyzing the transformers’ performance, a better data structure was preferred for the study.

The mock images for the Ground-Based detector were created using Millennium simulation and GLAMER lensing code (2009MNRAS.398.1150B; 2014MNRAS.445.1942M). Sources from the Hubble Ultra Deep Field (UDF) decomposed into shapelet functions were used to create the background objects that are lensed. There were 9,350 such sources with redshifts and separate shapelet coefficients in four bands. The visible galaxies associated with the lens were simulated using an analytic model for the surface brightness of these galaxies. In particular, we imply Sérsic profile: . The parameters employed to simulate the galaxies were the total magnitude, the bulge-to-disc ratio, the disc scale height, and the bulge effective radius. The magnitude and bulge-to-disc ratio are a function of the passband. Each galaxy is provided with an inclination angle between and and random orientation. An elliptical Sérsic profile describes the bulge with an axis ratio randomly sampled between 0.5 and 1. The Sérsic index, , is given by


where x is a uniform random number between -1 and 1 and is the bulge to total flux ratio.

The ground-based images consisted of simulated images from four bands (u, g, r, and i), and the reference band was r. In the challenge set, 85% of the images were purely simulated, and the other 15% were actual images chosen from a preliminary sample of bright galaxies directly from the KiDS survey. These real images were added to the challenge set for more realism. Some images had masked regions where removed stars, cosmic rays, and bad pixels were present. The noise for the mock images was simulated by adding normally distributed numbers with the variance given by the weight maps from the KiDS survey. For a detailed review on how the data has been created, please refer to


2.1 Data Pre-Processing

The simulated data sets of the Ground-Based Bologna Strong Gravitational Lens Finding Challenge were in the FITS format and available for download to the public 111 The challenge data sets contained potential strong lens candidates, and the training set contained 20000 images along with the information about the total lens with noise, the foreground galaxies with noise, and the lensed background source without noise. In this work, we did not use the additional information about the images and only used the images to train the models. We used whole images (101101) in all four photometric bands (u, g, i, r) as an input to the model and information about the lens present or not as the desired output for training the models. During training, the 20000 images were split into two parts. We used a data set of 18,000 to train the network, and the latter was used for validation. Before training the models, each image was re-scaled, and rotations of where were used to increase the data.

3 Methodology

3.1 Convolution Neural Networks

The concept of using Convolutional Neural Networks to analyze image like data was first proposed by Lecun. However, a breakthrough for image recognition by CNN’s did not happen till Imagenet

created an architecture that won the ImageNet Large-Scale Visual Recognition Challenge 2012. Following the proposed architecture, CNNs have been extensively employed in various research disciplines. A regular CNN can be thought of as consisting of two parts. The first part being composed of the convolution layers and the second part being the Fully Connected Layers which resembles the usual ANNs. The main advantage of using convolution layers is that they can learn the local spatial correlation in the data. So using multiple convolution layers will help us to detect the features in the data independent of their position


During the training, the input image is convolved with a number of small kernels (or features maps, typically of dimension 3

3) and these individual kernels are to be optimized during training. The final part of the CNN is the Fully Connected (FC) layers, the classic ANN neuron layers. They are used to consolidate the information contained in the feature maps to generate the output. However, a convolutional neural network is restricted by the size of its kernels to collect spatial information from the data. Hence it may lead to deviations due to the ignorance of global information. Since the CNN models have high complexity and a high number of trainable parameters, they are usually prone to overfitting. In addition, the depth of the CNN models is restricted by the vanishing gradient effect, where the gradient of the CNN layers vanishes as we go deeper.

3.2 Self-Attention

The introduction of attention mechanisms in machine learning has a potential to revolutionize machine learning, and it has been found particularly useful in Natural Language Processing. Depending upon the task at hand, various types of attention mechanisms can be employed. Among them, self-attention is one of the highly used attention mechanisms for image analysis. For a review on various attention mechanisms, please refer Yang_2020; NIU202148

. If each point in the feature map generated by the convolution layer is considered as a random variable and the paring covariances are determined. Then the value of each prediction can be improved or minimized based on its similarity to other points in the feature map. Thus, in other words, the central idea of self-attention is to assign relative importance to the features of the input based on the input itself.

Figure 2: Multi head Attention Layer.

In general, the attention function can be defined mathematically as



are vectors and

is the dimension of the vector key (). When we compute the normalized dot product between the query () and the key (

), we get a tensor (

) that encodes the relative importance of the features in the key to the query (vaswani2017attention). For self-attention, the vectors (), () and () are identical. Hence multiplying the tensor () with vector () results in a vector that encodes the relative importance of features inside the input vector.

A physical interpretation of self-attention applied to feature vectors can be thought of as filtering the input features based on the correlation in the input. The structure of a multi-head attention layer is given in Fig.2. It is possible to give the self-attention more power by creating several layers and dividing the input vector into smaller parts (H, number of heads). Then each attention layer is called a head and applies self-attention to one part of the divided input.

3.2.1 Positional Encoding

Suppose we pass the input directly to the attention layers. In that case, the input order or the positional information is lost as transformer models are permutation invariant. So in order to preserve the information regarding the order of features, we use positional encoding, and the lack of positional encoding will lower the performance of a transformer model. Following the work of vaswani2017attention, we used fixed positional encoding defined by the function


where is the position, is the dimension and is the dimension of the input feature vector. That is, each dimension of the positional encoding corresponds to a sinusoid function. For a detailed description about positional encoding and its importance please refer vaswani2017attention; liutkus2021relative; su2021roformer; chen2021demystifying.

3.3 Transformer Encoder

The Transformer models we constructed to detect strongly lensed gravitational systems were inspired by the DEtection TRansformer (DETR) model employed by Facebook (carion2020endtoend). As shown in Fig. 3, the transformer encoder has a very simple architecture and contains three main parts, which are listed below.

Figure 3: Architecture of the Transformer Encoder
  1. The first component of the architecture is a simple CNN to extract the features from the image. The output from the CNN backbone will be a vector with dimensions HWD. Where D is the number of filters in the last convolution layer. The encoder demands a sequence as input; hence we have to reshape the output of the CNN to a DHW feature map.

  2. As mentioned earlier, the transformer architecture is permutation-invariant; hence we add the output of the CNN backbone with fixed positional encoding before processing it to the transformer encoder layer. After the CNN backbone, we have the self-attention based encoder layers to process and filter the relevant features extracted by the CNN. The encoder layer has a standard architecture and consists of a multi-head self-attention module and a feed-forward network (FFN).

  3. The final part of the model is a feed-forward network (FFN) similar to that of the regular CNNs and learns the features filtered by the encoder layers. The model’s output is a single neuron with a sigmoid activation function that predicts the probability of the input image to be a lens.

We created 21 encoder models with different structures to study how the hyperparameters in the encoder will affect the model’s performance. We used the Exponential Linear Unit (ELU) function as the activation function for all the layers in these models. We initialize the weights of our model with Xavier uniform initializer, and all layers are trained from scratch by the ADAM optimizer with the default exponential decay rates

(Glorot2010UnderstandingTD; kingma2017adam).

3.4 Lens Detector

Figure 4: Architecture of the Lens Detector 15

Among the created encoder models, the best performance was given by the encoder model, which uses a CNN backbone similar to the ’LASTRO-EPFL’ model, from the Bologna Lens Challenge (Metcalf_2019). Based on this architecture, we present the two best architectures: Lens Detector 15 and Lens Detector 16 that outperformed all the other models. The architecture of Lens Detector 15 is given in Fig. 4.

Figure 5: Architecture of the Lens Detector 16

The model Lens Detector 15 was first trained for 300 epochs with an initial learning rate of

and again trained for another 100 epochs starting with a learning rate of . This version of the Lens Detector gave high scores in all three metrics of evaluation for the challenge. The Lens Detector 16 was created by stacking two Lens Detector 15 models parallel and combining their outputs through an additional dense layer connected to a single neuron to give the output. The architecture of the model Lens Detector 16 is given in Fig. 5. The Lens Detector 16 was first trained for 100 epochs with an initial learning rate of and again trained for another 100 epochs starting with a learning rate of . Furthermore, the model was trained for 50 epochs with and after that with for another 200 epochs.

3.5 Metrics for Evaluation

We first start with a brief overview of the performance metrics we used to quantify the performance of the lens classification. We used classification accuracy as the metric to compare between the created transformer models. The classification accuracy is calculated as


where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. Apart from the classification accuracy, a popular figure of merit for a classifier is the Area Under the Receiver Operating Characteristic curve (AUROC) (Metcalf_2019). The receiver operating characteristic curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) as a function of the threshold. The TPR is the ratio of detected lenses to the total number of lenses:


The true-positive rate (TPR) measures how well the classifier detects lenses from the whole population of objects. The FPR can be understood as a contamination rate in the classification and defined as the fraction of non-lens images wrongly identified as lenses:


The AUROC score assesses the overall ability of a classifier to distinguish between classes. A perfect classifier will have AUROC = 1.0 with TPR = 1.0 and FPR = 0.0 for any threshold. Whereas a random classifier will have AUROC = 0.5 with TPR = FPR for any threshold. For the Bologna Lens Challenge, the participants were instructed to optimize AUROC rather than the accuracy. In addition, two more figures of merit were also considered for the competition, which is TPR and TPR. The TPR is defined as the highest TPR reached, as a function of the p threshold, before a single false positive occurs in the test set of 100,000 cases. This is the point where the ROC meets the FPR = 0 axis. If the classifier assigns a high probability for a non-lensed image to be a lens even for one case, the TPR will go low. Similarly, the TPR is defined as the TPR at the point where less than ten false positives are made. A high TPR and TPR indicates that the classifier can well distinguish between lensed and non-lensed images.

4 Results

We created four convolution models to use as the backbones for the encoder models and 21 encoder models to study how the hyperparameters of the encoder layer affect the performance. Since each architecture was implemented as a regression model, a probability of 0.5 was set as the threshold for classifying an image as a lens or not. Thus, input images with a prediction value less than 0.5 were classified as non-lensed images labelled zero and vice versa. Since the total accuracy of classification is one of the most common metrics for evaluating a model, we used accuracy as the parameter to compare between created models. Table 1 describes the architecture and total accuracy, AUROC, TPR, and TPR of all created models.

Model Name Model Structure Accuracy AUROC TPR TPR
CNN 1 5 CNN Layers 88.21 0.951 0.000 0.07
CNN 2 4 CNN Layers 86.74 0.915 0.000 0.4
CNN 3 8 CNN Layers 88.51 0.968 0.033 0.37
CNN 4 3 CNN Layers 88.49 0.956 0.000 0.68
Lens Detector 1 CNN 1+1 H+1(E) 89.57 0.961 0.000 0.100
Lens Detector 2 CNN 2 + 1 H + 1(E) 88.13 0.502 0.001 0.001
Lens Detector 3 CNN 2 + 2 H + 1(E) 88.00 0.962 0.018 0.018
Lens Detector 4 CNN 2 + 2 H + 1(E) 88.12 0.952 0.121 0.124
Lens Detector 5 CNN 2 + 4 H + 2 (E) 88.46 0.955 0.125 0.133
Lens Detector 6 CNN 2 + 4 H + 4(E) 89.51 0.957 0.003 0.004
Lens Detector 7 CNN 3 + 8 H + 2(E) 91.45 0.968 0.000 0.410
Lens Detector 8 CNN 4 + 2 H + 2 (E) 89.43 0.954 0.000 0.758
Lens Detector 9 3 CNN Layers + 2 H + 2 (E) 89.61 0.959 0.000 0.789
Lens Detector 10 5 CNN Layers + 8 H + 2 (E) 90.58 0.970 0.180 0.23
Lens Detector 11 5 CNN Layers + 8 H + 4 (E) 90.45 0.966 0.219 0.34
Lens Detector 12 8 CNN Layers + 8 H + 4 (E) 89.82 0.960 0.040 0.680
Lens Detector 13 8 CNN Layers + 8 H + 4 (E) 91.94 0.975 0.175 0.525
Lens Detector 14 8 CNN Layers + 8 H + 4 (E) 91.95 0.975 0.002 0.539
Lens Detector 15 8 CNN Layers + 8 H + 4 (E) 92.99 0.978 0.140 0.48
Lens Detector 16 16 CNN Layers + 8 H + 8 (E) 90.97 0.962 0.225 0.24
Lens Detector 17 16 CNN Layers + 8 H + 8 (E) 92.19 0.973 0.00 0.717
Lens Detector 18 16 CNN Layers + 8 H + 8 (E) 92.09 0.976 0.113 0.590
Lens Detector 19 16 CNN Layers + 16 H + 8 (E) 90.03 0.961 0.114 0.115
Lens Detector 20 25 CNN Layers + 8 H + 4 (E) 91.26 0.972 0.212 0.223
Lens Detector 21 8 CNN Layers + 8 H + 4 (E) 92.79 0.98 0.00 0.64
Table 1: Table comprising the architecture, accuracy, AUROC, TPR, and TPR

of all the models in the chronological order of creation. The encoder models are named as ’Lens Detector’ followed by a number. The model structure describes if the model uses transfer learning in the CNN backbone or not. The term ’8

’ in the model structure means there are eight heads with dimension 128 in one encoder and E represents the number of encoders.
Figure 6:

ROC and the Confusion Matrix of Lens Detector 15 on the Challenge Data set. Class 0 represents the non lensed images and Class 1 represents the lensed images.

Among the created encoder models, the highest accuracy was achieved by Lens Detector 15 and the highest AUROC, TPR, and TPR were achieved by models Lens Detector 21, Lens Detector 16, and Lens Detector 9, respectively. From the presented models here, we would like to highlight the model Lens Detector 15 as the best model since it performs well in all categories and has the highest classification accuracy. The receiver operator characteristic (ROC) and the confusion matrix of the model Lens Detector 15 are given in Fig. 6. Similarly, the models Lens Detector 13, Lens Detector 18 and Lens Detector 20 can also be considered as a better classifier among the encoder models since it has an AUROC equivalent to the second-best model that participated in the challenge and a better TPR, and TPR compared to all other models participated in the challenge.

5 Discussion

5.1 Transformers and Models from Bologna Lens Challenge

The Bologna Lens Challenge was intended to improve the efficiency and biases of tools for finding strong gravitational lenses on galactic scales. It was clear from the challenge that automated methods such as CNNs and SVM has a clear advantage compared to conventional methods. The performance of all these models was evaluated using AUROC, TPR, and TPR scored on the challenge set. We would like to compare the performance of these models to the performance of the encoder models to exhibit the advantages of encoder models over CNNs and SVM models.

During the challenge, the TPR was used to highly penalize the classifiers with discrete ranking because their highest classification level was not conservative enough to eliminate all false positives. In this category, an SVM model named ’Manchester SVM’ won the competition with a score of 0.22, and the second-best model was a CNN model named ’CMU-DeepLens-Resnet-ground3’ with a score of 0.09 (Metcalf_2019; Hartley_2017; Lanusse_2017). Like the other models that participated in the challenge, maximizing the TPR was a tough challenge for encoder models. However, the encoder models performed very well compared to the CNN models that participated in the challenge. The results of for the top three encoder models and the top three models that participated in the challenge are listed in table 2. Notably, we would like to highlight the models’ Lens Detector 16, which achieved a TPR of 0.225 and Lens Detector 11, which achieved a TPR of 0.219, which is very high compared to CNNs that participated in the challenge.

Name AUROC TPR TPR Model Type
Lens Detector 16 0.962 0.225 0.24 Transformer
Manchester SVM 0.93 0.220 0.35 SVM/Gabor
Lens Detector 11 0.966 0.219 0.34 Transformer
Lens Detector 15 0.978 0.140 0.48 Transformer
CMU-DeepLens Resnet-ground3 0.98 0.09 0.45 CNN
LASTRO EPFL 0.97 0.07 0.11 CNN
Table 2: Comparison of Encoder Models and Models participated in the Bologna lens Challenge. Listed in the decreasing order of .
Figure 7:

Variation of Loss Function with epochs for Lens Detector 13 and CNN 3, respectively. Lens Detector 13 uses CNN 3 as its CNN backbone.

The next parameter used to evaluate the models in the challenge was TPR for which the CNN model named ’CMU-DeepLens-Resnet-ground3’ scored the highest during the challenge with a score of 0.45. Coming to the encoder models, in this category, the encoder models showed a high range of supremacy over all other models that participated in the challenge. Particularly, the Lens Detector 9 achieved a TPR = 0.79. The results of TRP for the top three encoder models and the top three models that participated in the challenge are listed in table 3. Three of our models were able to score a TPR above 0.70, which is very high compared to the top TPR reported during the challenge. From table 1, it is clear that most encoder models achieved a higher score in this category compared to the other models which participated in the challenge.

Name AUROC TPR TPR Model Type
Lens Detector 9 0.959 0.00 0.789 Transformer
Lens Detector 8 0.954 0.00 0.758 Transformer
Lens Detector 17 0.973 0.00 0.717 Transformer
CMU-DeepLens 0.98 0.09 0.45 CNN
Manchester SVM 0.93 0.220 0.35 SVM/Gabor
LASTRO EPFL 0.97 0.07 0.11 CNN
Table 3: Comparison of Encoder Models and Models participated in the Bologna lens Challenge. Listed in the decreasing order of .

Now looking at the third parameter of merit used in the Bologna Lens challenge, which is the AUROC, we can see that the Lens Detector 21 were able to reach the highest reported AUROC in the Bologna lens Challenge (Metcalf_2019). The top three encoder models and the top three models that participated in the challenge that scored the highest AUROC are listed in table 4. The winning model named CMU Deep Lens had an AUROC of 0.98 and TPR of 0.45, which was the highest in their respective categories. However, the ’CMU-DeepLens’, was a 50 layer deep ResNet with around 23 parameters (Lanusse_2017). Whereas the Lens Detector 21 had only 3 parameters and achieved an AUROC of 0.9809, which is very close to the performance of CMU Deep Lens (AUROC = 0.9814).

Name AUROC TPR TPR Model Type
CMU-DeepLens 0.981 0.09 0.45 CNN
Lens Detector 21 0.981 0.00 0.64 Transformer
CMU-DeepLens 0.980 0.02 0.10 CNN
Lens Detector 15 0.978 0.140 0.48 Transformer
Lens Detector 18 0.976 0.113 0.59 Transformer
LASTRO EPFL 0.97 0.07 0.11 CNN
Table 4: Comparison of Encoder Models and Models participated in the Bologna lens Challenge. Listed in the decreasing order of AUROC.

5.2 Insights into Transformers

An initial glance at the results shows that encoder models perform better than CNN models. However, a detailed look at the results indicates that the encoder model is only good as its CNN backbone but always better. Because the encoder models depend on the CNN backbone to extract the features and as a result, the performance of encoder models depends upon the CNN backbone. For example, the lowest accuracy was achieved among the encoder models is for Lens Detector 3, which had a better performance than its backbone CNN 2. A similar trend can be observed for other encoder models, which use trained CNNs as their backbone. These observations show that the encoder models can achieve better accuracy by a small percentage than their CNN backbones.

On analyzing the results from the Encoder models, we can see that increasing the number of heads and the depth of the encoder increases the model’s performance. During the training period, it was also found to fasten the learning process. This points to an exciting aspect of the encoder models. The encoder’s performance is proportional to the number of trainable parameters in the encoder layer or specifically in the multi-head attention layer. The higher the number of trainable parameters, the better the learning curve and performance. However, for a given CNN backbone, the performance saturates beyond a limit.

Another striking feature worth pointing out is that unlike convolution layers increasing the number of layers and number of parameters in the encoder layers has a very slight effect on over-fitting the model. Since the self-attention layers act as the filters for features extracted by CNN, an increase in the number of parameters in the encoder layers helps the models to filter the features faster and effectively without causing the over-fitting of the model. The effect of self-attention layers in filtering and smoothing the learning curve can also be seen in comparing the loss curve of the CNN and Encoder models. The largest encoder model we created had a total number of 25 convolution layers as its backbone. We trained the network without using skip connections in the convolution layers, which would have been very difficult for a conventional CNN.

Figure 8: Comparison of the prediction capacity of CNN and Lens Detector. On this histogram values leaning towards zero represents - lack of lens present prediction, towards one - prediction of lens present on an image.

It is worth pointing out that the encoder models can identify SL and non-SL better than CNN. The probability distribution of finding a lens in the challenge data set is depicted in Fig.

8. The encoder model can assign a probability for an input to be lens () or non-lens () with greater confidence than the CNN. Furthermore, from Fig. 8 it is clear that the transformer models can approximately mimic a perfect classifier by assigning a probability 0 to non-lenses and a probability 1 to lensed images. This feature of the encoder model will be beneficial and applicable in the upcoming large-scale surveys to narrow down the potential lensing systems with great confidence. Hence, we can see that the encoder models offer high competition to other machine learning models.

We also tried transfer learning by using an already trained CNN as the backbone of the encoder model. However, surprisingly the encoder models that do not use transfer learning performed slightly better than the that use transfer learning. Since a trained CNN model have already learnt to extract specific features of an image, the encoder model with that CNN backbone is restricted to minimize the loss function in only a small part of the hyperspace. So, the self-attention layers can only filter the features and improve the accuracy by a small percentage (e.g. CNN 2 - 86.74% and Lens Detector 2 (CNN 2 as the backbone) - 88.13%). Nevertheless, for a model without transfer learning, there is a possibility for the CNN part in the encoder model to learn more features than a solo CNN about the image and improve the accuracy much better (e.g. Lens Detector 7 (CNN 3 backbone) - 91.45% and Lens Detector 15 (CNN 3 as the backbone without transfer learning) - 92.99%). However, this result cannot be generalized since it also depends on the trained CNN backbone.

Recently, an updated version of the CNN models that participated in the challenge has been reported by magro2021comparative which had better scores in every category compared to their previous versions. They used the same CNN’s participated in the challenge and retrained the networks with different epochs. Even though the models had improved the scores, it was evident from the report that the performance of the models is highly dependent on the number of epochs. In other words, the CNN’s reported in the bologna lens challenge had lower stability, and we have to monitor the training in order to achieve better results carefully. In contrast, the encoder models are highly stable compared to the CNNs, as we can see from Fig. 7, we were able to train the encoder models up to 2000 epochs without any sign of any over-fitting, and the fluctuations in the validation loss were very stable up to the end.

Figure 9: Images of False positives in four bands (u, g, r, and i respectively) reported by the encoder models. Image ID from the test data is given below for each set of images.
Figure 10: Images of False negatives in four bands (u, g, r, and i respectively) reported by the encoder models. Image ID from the test data is given below for each set of images.

Even though the encoder model performs better than the convolution models and the other models that participated in the challenge, the encoder models that have been trained here have a slight gap with a perfect classifier. We carefully examined the frequent false positives and the false negatives reported by various encoder models. Some of the images that have been identified as false positives and false negatives are given in Fig.9 and Fig. 10. Looking at these false positives and false negatives, we can see that the encoder models are trying to find if the input image has an arc-like structure or multiple distorted images. Suppose the input image has any of these characteristics in at least one of the bands. In that case, the detector identifies the image as a strong lens. Similarly, if both these features are missing, then the detector classifies the image to be non-lens. In order to improve the performance of the models, we need the model to be trained on more realistic and complex data.

We also would like to comment on an another reported CNN model on the Bologna lens challenge, which is the LensCNN, achieving a total accuracy of 0.8749 (TP - 0.8817 and TN 0.8682). It is the only CNN model where classification accuracy for the Bologna lens challenge has been reported (2019MNRAS.487.5263D). Looking at our results, we can see that all of our encoder models have surpassed the LensCNN in total accuracy. Furthermore, the LensCNN model has also reported an AUROC of 0.96 on the challenge data, which is surpassed by most of our encoder models. In this context, it is worth mentioning that the LensCNN had approximately 10 parameters, and we were able to outperform the LensCNN with just 3 of the parameters.

6 Conclusion

We have presented a novel machine learning approach known as the Transformers to detect strong gravitational lenses from the Bologna Lens challenge. We have explored this new architecture’s possibilities to understand better how to apply the transformer models for image analysis. Currently, most of the automated techniques employed to find strong lenses are based on convolution neural networks (CNN). However, as noted in Metcalf_2019, CNNs are prone to overfitting the training set. Here we showed that the attention-based architectures provide better stability and are less likely to overfit than CNNs. Another main advantage of an attention-based encoder over a CNN is that it performs better with fewer parameters. Hence, self-attention based encoder models can be considered a better alternative to convolution neural networks (CNNs) and other automated methods.

Here we had created 21 encoder models to study the Transformer architecture. We present the three best models with a more reliable performance than those that participated in the Bologna Lens Challenge. The Lens Detector 21 scored an AUROC of 0.9809 which is equivalent to the top AUROC score achieved in the challenge. Similarly, the Lens Detector 16 scored an 0.225 higher than any model that participated in the challenge and surpassed the top TRP () scored by the CNNs by a high margin. We consider the Lens Detector 15 as the best encoder model as it scored 0.14 and 0.48 respectively for and outperforming the CNN models to a greater extent and also scoring an AUROC of 0.9783 which is very close to the top AUROC score.

From our analysis, we were able to point out that the encoder models have more stability than CNN’s, which minimizes the need for human interaction or monitoring. Similarly, the encoder models were found to be better than the CNN models in finding and classifying lenses and non-lenses by assigning high probability scores for the lens () and the non-lens () systems. In addition, the architecture we proposed here is very simple and robust and has a high resistance to overfitting. We could train models as deep as 25 layers and for 2000 epochs without any sign of overfitting. Depending on the problem, we can also use a Residual Neural Network instead of a CNN, enabling us to create deeper networks. As noted in Schaefer_2018, the simulated images are not that much complex, and because of that, we did not use a deeper RNN as the backbone of the encoder model. With a simple eight-layer deep CNN, we were able to surpass the performance of a 50 layer deep RNN and surpass all the other models to a great extend.

In the upcoming era of big data in astronomy, automated methods are expected to play a crucial role. Better and alternative automated methods have to be consistently investigated to advance the scientific study in this scenario. From our study, it is clear that the search for strong lenses in the current and upcoming wide-field surveys such as KiDS (2019A&A...625A...2K), HSC (2019PASJ...71..114A), DES (2021), LSST (Ivezi__2019; verma2019strong), Euclid (scaramella2021euclid) and WFIRST (koekemoer2019ultra) can be achieved using self-attention based encoder model with better performance and lower computational cost compared to CNNs.