Phishing is a fraudulent attempt to obtain sensitive information, such as usernames, passwords, credit card numbers of users. Specifically, an adversary may build a fake website that has a similar domain and web page layout with the real one. Then he spreads the fake domain link (FDL) to benign users (e.g, through SMS messages, emails, social websites, etc). Some users may treat a FDL as a genuine one, follow the FDL and get directed to the fake website. After entering their credentials to log into the fake website, users leak their sensitive information to the adversary.
Two-factor authentication (2FA) is a user authentication technique which requires end users hold at least two types of information that can confirm his claimed identities, based on something they know, something they have, or something they are. With 2FA, even though users’ passwords are stolen, their accounts are still protected as long as the attacker is unable to obtain the second authentication factor. Therefore, 2FA has seen increasing deployment recently, with 53% surveyed users adopting it for account protection in 2019 (33).
While 2FA can significantly improve account security, the arms race with phishing has never stopped. Traditional 2FA is still vulnerable to phishing because a deceived user may input their second factor information (e.g., PINs received through emails, SMS) into the fake website, which defeats the second layer of protection. Moreover, in the past an adversary had to manually design the web page layout to mimick the real one, which is time-consuming for the adversary. Recently, the new real-time phishing (RTP) tools, like ‘‘Evilginx" (12), have made the situation even worse. Now, an adversary only has to download the tool and run it with proper configuration to automatically replicate from the real website. In other words, RTP tools have significantly lowered the technical barriers for adversaries to launch more powerful phishing attacks.
Several methods have been proposed to detect the traditional phishing by measuring the similarity of web page elements (e.g, image size, position) (Medvet et al., 2008) or tree structures of two websites (Rosiello et al., 2007). Unfortunately, such methods would not work in RTP, because the fake website keeps replicating its content from the real website through reverse proxy. For the same reason, human interactive proof based methods, which rely on human users to determine if the dynamically generated image surrounding the login window matches the locally generated image by the browser extension (Dhamija and Tygar, 2005b, a), will fail. Finally, phishing detection algorithms (Aggarwal et al., 2012) based on suspicious domains will fall short because an adversary can change its domains frequently.
Recently, new 2FA systems have been introduced and deployed, such as Duo Push (11), U2F (34). The research community has also proposed novel proof-of-concept 2FA systems (Karapanos et al., 2015; Czeskis et al., 2012; Shirvanian et al., 2014). However, most of them require special devices (or hardware configurations) (34; N. Karapanos, C. Marforio, C. Soriente, and S. Capkun (2015)), or some are still vulnerable to RTP attacks (11; N. Karapanos, C. Marforio, C. Soriente, and S. Capkun (2015)).
In this work, focusing on defeating the advanced RTP attacks, we propose a new 2FA system called PhotoAuth. Here, after a user passes the first factor (e.g., password) authentication, the web server will require the user to take a photo of the web browser with domain name in the address bar as the 2nd authentication factor. The phone automatically uploads the photo to the web server through a web app invoked in the browser of the phone. The web server side extracts the domain name information based on Optical Character Recognition (OCR), and then determines if the user is visiting this website or a fake one, thus defeating the RTP attacks where an adversary must set up a fake website with a different domain.
Compared with many other 2FA methods, PhotoAuth has several advantages. First, PhotoAuth can counter RTP attacks that traditional 2FA cannot handle. Second, PhotoAuth does not need any special device other than smartphones that are commonly used. No additional extension/plugin or Bluetooth is required for the browser, so users can even log into the web server securely on a public computer through PhotoAuth.
This work presents the design of PhotoAuth on both the phone side and web server sides. To build an efficient and attack-resilient PhotoAuth, we train a deep learning model for address bar detection based on transfer learning, and combine it with OCR output to extract the domain names correctly. We tested the accuracy and efficiency of PhotoAuth under different environments. The results showed that PhotoAuth is an effective technique with good scalability, and it is readily deployable.
In summary, we make the following contributions:
We propose PhotoAuth, a novel 2FA mechanism based on browser photos to counter real-time phishing attacks, homographic phishing attacks, and domain injection attacks.
Neither the computer nor the phone needs to pre-install any software or apps to finish 2FA, which differentiates it from many other 2FA systems. Especially, no Bluetooth or special devices are needed in PhotoAuth and it is compatible with most legacy devices.
An address bar content recognition mechanism is proposed to counter adversaries from making fake domains at the titles or web content. We labeled and trained the first address bar detection model with 16,454 items.
Note that PhotoAuth is designed and implemented for the case when a user uses a PC browser to log into a website with his phone as his 2nd factor device. We understand that nowadays many people also use the same phones for website login. We will provide a variation of our method to defeat the RTP attacks, as elaborated in Appendix. The main body of our presentation will focus on the first case which involves both PC and phone.
2.1. Real-time Phishing Workflow
Figure 1 shows how a real-time phishing (RTP) works, where the adversary is in the middle of the benign user and the real website. The detailed steps are as follows.
Step 1: The adversary sets up a fake website (microsoft1.com), which replicates a target website (e.g, microsoft.com), with a mature RTP tool (e.g, Evilginx (12)). With proper settings, the RTP tool can establish the fake website automatically and make it a man-in-the-middle web proxy for microsoft.com. Then the adversary distributes the url of the fake website to users through phishing channels.
Step 2: A user (referred to as Bob) does not pay close attention to the domain name and treats the fake website as the real one because of the same web-page layout. In this example, Bob inputs his Microsoft user name and password to microsoft1.com, the fake Microsoft phishing site.
Step 3: The adversary gets Bob’s credentials after Bob submits them to the fake website. Then the RTP tool opens a new session to access the real Microsoft website, and enters Bob’s credentials to login. Now the adversary impersonates Bob on the Microsoft website. During the process, the RTP tool automatically modifies appropriate message fields when relaying them between the benign user and the real website so that neither the user or the web server notices the difference from normal use cases.
Step 4: To verify the login, the Microsoft website sends a one time password (OTP) to Bob’s phone.
Step 5: Bob gets the OTP from the Microsoft website and inputs it into the fake website for authentication.
Step 6: The adversary gets Bob’s OTP and inputs it into his own login session to the Microsoft website as Bob. Finally, he successively logs into Bob’s account. To the Microsoft website, it gets a valid login request from the user Bob without knowing the existence of an adversary in the middle.
2.2. Security Model
On the server side, we assume it has implemented a 2FA mechanism. The server first verifies the username and password provided by the user. If correct, the server challenges the user with something only the user has through his phone to finish the second authentication. The target users of our new 2FA system are ordinary people like users of other 2FA systems. They know basic things about web apps like clicking the link to open with a browser. They also have the knowledge of 2FA.
Certainly, no 2FA system is attack-proof if we assume the second factor can also be compromised. In our system, we assume neither the user’s PC browser nor his smartphone is compromised by the adversary. The link (referred to as phone link) between user’s smartphone and the website is also secure from interception (e.g., man-in-the-middle attack (21)). We only consider the RTP attack alone. That is, we do not consider phishing mixed with other attacks like DNS spoofing (9), Domain hijacking (10), browser hijacking (4), or system compromises.
2.3. Design Goals
We have the following design goals for our new 2FA system.
High Compatibility The system should be compatible with the traditional 2FA system to support most legacy devices. The traditional 2FA system can be upgraded to the new 2FA system easily without changing much on the 2FA workflow. High Usability For pervasive deployment of our 2FA system, no special hardware other than smartphones will be required. To support 2FA, we will not require the installation of browser plugin/extension; otherwise, users of a public computer (e.g, a public library computer) will not be able to use it. Moreover, some non-tech-savvy users may not know how to install browser plugins/extensions.
During the 2FA, the system will challenge a user for something beyond username and password. The instructions of the challenge should be as simple and intuitive as possible, so that a user does not need any special background knowledge in a specific area to answer the challenge. Verbose and complicated challenges that are time consuming to answer or hard to understand will easily kill usability, so they should be avoided in our design.
High Accuracy The system should provide high accuracy for authentication. In other words, it should incur very low false positive rate and low false negative rate at the same time. False positives happen when benign users failed to pass the 2FA even though they followed the procedure; false negatives happen when the adversary was able to pass the 2FA while impersonating a benign user. As our system assumes the first (i.e., password-based) authentication factor is unreliable, here the high accuracy requirement is only upon our second authentication factor, which should be very difficult or not possible for the adversary to forge.
3. System Architecture and Design
3.1. Design Considerations
To defeat the RTP attack, the first step is to detect it through distinguishing benign users from adversaries. Generally, there are three aspects for making such a distinguish.
The first aspect is based on the fact that benign users and adversaries have different IP addresses. If the web server has obtained an authenticated list of IP addresses used by each user (e.g., through explicit/implicit registration or valid historical use), it may base the authentication on IP address. However, as roaming is so common these days with mobile devices (e.g., with laptops and smartphones), many users do not have a fixed IP address. As such, this approach will not work well in reality, although IP address information may still be leveraged to assist in one way or another.
The second aspect is that the real web server and the fake one have different SSL/TLS certificates. However, a browser only checks if a website is not its block list and has a valid certificate. It cannot determine if it is a phishing website or not. If a website has a valid certificate and hence encrypts the traffic between the users and the website, the browser will show a green locker icon. When users see such a green locker icon, they may believe they are visiting a trusted website and no one can eavesdrop the communication. What users do not know is that the RTP tool like Evilginx is able to automatically obtain a valid SSL/TLS certificate for free (e.g., from Let’sEncrypt) and provide responses to ACME challenges (23), using its in-built HTTP server.
The third aspect is that the real website and the fake one have different domain names. While the adversary cannot register the same domain name that is owned by the real website, he can register a very similar one (microsoft1.com) or seemingly valid one (e.g., microsoft.com.jp) to confuse the users.
As the RTP attack is getting so advanced, in our design we do not rely on end users to detect the attack by themselves. That is, we do not assume users are able to detect a phishing website based on domain names or browser green lock icons. Instead, it is the job of the web server’s side to distinguish benign users from RTP adversaries. In a nutshell, our system, like FIDO2 (13), leverages the third aspect, i.e., domain names, to distinguish benign users and the adversaries, but on the server’s side, with a software 2FA mechanism involving smartphones only. No hardware device or pre-install app is needed. Technically speaking, the main task is to deliver (the domain name part of) the website URL in the browser address bar to the authenticated web server in a convenient and secure way. If usability is not a concern, there could be many ways to achieve this goal. For example, with built-in browser support customized for U2F, the browser passes the URL to the local mounted USB U2F device for signing and then transfers it to the web server for verification.
3.2. System Overview
In Steps 9 and 10, the PhotoAuth web app guides the user to take a photo and automatically upload it to the server for processing. In Step 11, the PhotoAuth module extracts from the photo the domain name in the address bar of the PC browser based on deep learning and OCR techniques. If the domain name does not match its own one, the user must be visiting a fake website and the authentication request must be from an adversary, so it notifies the the server to deny the authentication request. In the rare case of a false positive, caused by the poor quality of the photo, the user may retry the login process and provide a different photo.
3.3. User Side Design
On the user’s phone, there are generally three ways to receive notifications from the server, e.g.,, through push message, SMS, or email link. When the user clicks the notification on the phone, the default browser will be launched (if not yet), which in turn loads the PhotoAuth web app (from the server) to handle the server challenge. After that, the user only needs to click one button to take a photo and then upload it in the background.
In particular, image pre-processing resizes the raw photo image into a smaller and lower resolution image, and further converts the RGB image into a gray-scale one. By leveraging the computational resource of edge devices, we can reduce the bandwidth overhead and hence increase the scalability and throughput of the entire system without sacrificing the accuracy. Detail of compression ratio and bandwidth saving can be found in the evaluation section.
3.4. Server Side Design
After the server receives the preprocessed photo from a user’s phone, it should accurately extract the domain name in the browser address bar. There are two design options here. First (O1), it may extract all the text contents from the entire photo based on Optical Character Recognition (OCR), and then predict which texts are from the correct domain name (that is, located inside the browser address bar). Second (O2), it may first predict the address bar from the entire photo, and then apply OCR to extract the texts from the predicted address bar only.
Which of the two options is better? It depends on two factors. O2 sounds more appealing than O1 because it can reduce the workload of OCR by only focusing on a very small region, and the output of OCR is precisely the domain name. In practice, however, this may not be the case, because the time complexity of OCR is not linear to the area of the region. Indeed, the saved time could be very limited. Second, whatever technique we will use, address bar prediction is not going to be perfect. As a result, the bounding box for a predicted address bar may not cover all its texts (e.g., top or bottom side of the texts may be cut off), causing the OCR to fail to output the correct domain name. Figure 3(a) shows such an example.
On the other hand, O1 has a performance advantage over O2, because parallelization is possible. Specifically, for each received photo, the server can first make a copy, and then perform OCR and address bar prediction in parallel. Their results can be combined to predict the texts in the address bar based on the coverage rate of text bounding boxes by address bar bounding box. For the above reasons, our final choice will be option O1.
In either option, we need to correctly locate the address bar inside a photo, not only for accuracy reason, but more importantly for security reason. In particular, we deal with two types of domain name injection attacks. Figure 3(b) shows the photograph of a simple webpage with three fields where a domain name may appear. Figure 3(c) is the corresponding OCR result. The first three lines include all the texts extracted from the photo and the fourth line represents the coordinates of the bounding box for all these texts. The following three lines show the texts and area locations of the title area, address bar area, and web page content area, respectively. Here OCR successfully extracts the texts and outputs their locations. However, it does not tell us which is the text from the address bar. As such, an adversary may want to get around PhotoAuth by injecting the valid domain name either in the title of the webpage or in its content.
To prevent the above injection attack, we need to make sure the correct domain is extracted from the OCR texts inside the predicted address bar area. Let and donate the area of a bounding box from OCR and that of a predicted addressbar, respectively. Then we can define the metric cover rate in the following formula, which basically indicates how much a text bounding box resulted from OCR overlaps with the predicted address bar.
Only if CR is above a threshold value (to be determined in the evaluation section) and the extracted domain name matches with the server’s, the user is accepted. In all the other cases, the user needs to retake a photo and try the second time.
The other injection attack happens when an attacker embeds a fake browser address bar as an image inside the actual browser, the so-called ‘‘picture-in-picture’’ phishing attack (Jackson et al., 2007) (or a similar case when the user has multiple browsers open). As far as we are aware of, no real websites embed a second address bar in their login page, so it is certainly a very suspicious case. Therefore, in our system, whenever the server detects more than one address bar, it will reject the result and warn the user about phishing attack possibility. Meanwhile, it will request the user to retake a photo with only one address bar, which includes the website the user is actually visiting.
Currently, there is no tool for address bar identification, especially for address bars in photographed browser screens. Address bar identification is not a trivial task because users may take photos of different types of browsers at different angles, different distances, and different illuminations
. Simply applying some heuristics based algorithms will not work well. Thus, we choose to train a deep learning based object detection model to predict the address bars .
4. Prototype Implementation
On the server side, we choose Apache 2.4.39 as the web server, MySQL 5.7.26 as the database, and PHP 7.2.18 as the script language. The hardware configuration of the server is: CPU (I7-8700K Up to 4.7GHz), GPU (GTX 1080Ti 11GB GDDR5X), RAM (32GB 3200MHz) and it runs Windows 10. We use a Samsung S10 phone with 6GB RAM and Android 10.0.
4.2. Website Front-end Implementation
4.3. Back-end Implementation
We use python to develop the server side back-end logic. Specifically, we use concurrent programming and multi-threading to achieve multi-tasking for better performance. For OCR, we directly use google vision API, mainly as a proof-of-concept. In practice, the server should deploy its local OCR software (e.g., the open-source PaddleOCR(26) for better privacy and efficiency). We use ‘‘from google.cloud import vision" to upload a photo to the google vision server for OCR and get the result. For address bar detection, our main job here is to train a deep learning model. The question is: what are the important features of an address bar to differentiate it from any text-filled rectangle object? Our intuition is to leverage the commonly displayed icons, including the backward/forward arrows, reload and other icons on the left of an address bar. Therefore, for training our model, we need to manually label the address bars with such surrounding areas as the ground truth. Manual labelling of address bars, however, is a time-consuming process. Therefore, for the proof-of-concept purpose, instead of labelling a huge dataset to train an entirely new deep learning model, we adopt transfer learning with a relatively small training set of 15,062 photos.
. All its parameters and layers are kept as is. We use 50 epochs for transfer learning with learning rate 1e-3, batch size 32, and the Adam optimizer. In this stage, we only update the weights in the last three layers by freezing the weights in all the other layers. Then, in the next 50 epochs, we fine-tune the model with learning rate 1e-4, batch size 8 and Adam optimizer, while unfreezing all the layers to update all weights.
As address bar detection is one type of object detection, we also use the Intersection over Union (IoU) metric to determine whether an address bar is detected correctly. IoU reflects how much the ground truth area and the predicted area overlap. During the training process (and also the testing process), IoU is an input parameter. If one sets the IoU score too big (for example above 0.8), only when the two areas fit very well with each other will it be considered a correct prediction; thus, the precision of the model will be very low. On the other hand, if the IoU score is too small (e.g. 0.2), then the predicted area might be too large (even cover some title areas). Therefore, in our model training, we use the default score 0.5, as used by other object detection models.
In PhotoAuth, authentication accuracy and scalability are very important performance metrics. Next, we evaluate bandwidth use, user non-action waiting time, PhotoAuth’s OCR accuracy, address bar detection accuracy, and whole system accuracy.
5.1. Dataset Composition
To train our transfer-learning based address bar detection algorithm, we took totally 16,454 photos. Our data set is very diverse. First, there are two classes of devices initiating login requests, ‘‘desktop", and ‘‘laptop". ‘‘Desktop" includes monitors of different aspect ratio (e.g., 16:9, 21:9). ‘‘Laptop" includes different screen sizes (e.g., 14 inch, 15 inch). Second, because different browsers have different fonts and different styles of address bars, our data set covers many famous browsers, including Chrome, Firefox, Edge, IE, Opera. Third, there are numerous browser skins available (e.g., Chrome web store provides many themes), so it is too complicated to enumerate all browser skins. We chose two different windows color modes, ‘‘light" and ‘‘dark" themes. These themes change browser skins accordingly. Note that the PhotoAuth web app converts the colorful photos into gray-scale ones before uploading them to the server for OCR. In gray scale, skin personalization makes little difference from that of the light theme or dark theme.
Fourth, the tilting angles and turning angles from the perfect shot angles are within the range and the shot distances are between 30-50 cm. Note that if we request users to take photos at a very close distance (e.g., 10 cm) and only cover the address bar region, the detection accuracy would be close to perfect. In our experiment, however, to make it more challenging, we took the photos from much farther away and the photos covered a very large portion of the PC screens, if not entirely. Furthermore, we also took photos in different environments and illumination settings. Finally, all photos in our training and testing are resized into the resolutions of 1920x1080. 1920x1080 resolution is supported by most cameras today. Therefore, if a user sets his camera at a higher resolution, the detection accuracy will be about the same because of the above resizing.
Among 16,454 photos, we randomly picked 15,062 photos as the training set, 792 photos as the validation set, and 600 as one part of the test set. To test the transferability of our addressbar detection model, we then took 248 photos from 360jisu and brave browsers as the second part of the test set. Our trained model has not seen these two browsers before. In the end, the test dataset contains 848 browser photos, which covers not only the the login page of our own test website but also 57 other popular websites (e.g., Chase, Bank of America). Figure 6 shows the composition of our test dataset.
5.2. Bandwidth Use
To save bandwidth for photo uploading, in our prototype the PhotoAuth web app compresses the photos into small size, gray-scale ones. As OCR and objection detection algorithms do the same for their input, as long as the resolution of the compressed photos is good enough, it will not introduce errors into our system while saving bandwidth.
The question is: what is the good photo resolution? To answer this question, we performed an empirical study by taking photos of monitor screens at different distances. We found that at the capturing distance of about 1 meter, picture resolution starting from 1920x1440 (when camera aspect ratio is 4:3) or 1920x1080 (when camera aspect ratio is 16:9) up did not make difference in terms of OCR accuracy and address bar detection accuracy. In practice, people would most likely capture the photos of screens in a shorter distance, say within 50 cm, the typical distance between one’s body and the PC. In this case, we found resolution of 800x600 is often sufficient. Anyway, to be more conservative, we recommend and adopt the 1920x1440 or 1920x1080 resolution in our study, depending on the camera aspect ratio. The size of a jpeg format file at this resolution is around 160KB in our study.
5.3. User’s Non-action Waiting Time
To end users, the latency of our 2FA system is an important performance metric to care about, because it reflects the usability of our system. As users may spend different time to take photos, we will not consider the part of latency due to user’s action. Instead, we test user’s non-action waiting time, which is mainly composed of photo uploading time and server side processing time. On the server side, OCR and address bar predication are carried out in parallel. Here we only count the OCR time, because our tests showed that it was always more than that for address bar prediction. We ignore the other types of waiting time in the system, because they are negligible compared to the two listed above.
Figure 7 shows the uploading transmission time and overall non-action waiting time in both WiFi and LTE settings. The bandwidth of LTE is typically smaller than WiFi, so the transmission time of LTE (in our experiment the median is 1393 ms) is higher than that of WiFi (median 80 ms). Moreover, because in LTE many users share the same base station, so its transmission time fluctuates more than in WiFi case. The server side photo processing time does not depend on the uploading channel, either via LTE or via WiFi. In the WiFi setting, such processing time domains the overall waiting time, whereas in the LTE setting, the processing time accounts for roughly half of the waiting time.
5.4. Domain Name OCR Precision and Recall
PhotoAuth relies on accurate text output from OCR (specifically the texts from the address bar) to decide whether to authorize an authentication request. In this section, we test the precision and recall of OCR under various real-world settings to see how robust our system is to extract the domain names. Note that in the case when precision and recall are not perfect, it does not mean an adversary can bypass our system. We will elaborate on this point in Section5.6.
In traditional object detection, an area is called a predicted area if the confidence score of detection is above a threshold. If the IOU (Intersection over Union) of the predicted area and the ground-truth area is above a specific threshold, it is a true positive; otherwise it is a false positive. If no area is predicted, it is a false negative.
Domain name OCR is different from traditional object detection in that it has two stages: detection and recognition. A true positive occurs only when both the right area (in our case, the address bar area) in an image is detected and the domain name (not the entire URL) inside the area is correctly recognized. Otherwise, we may have false negatives (when the area is not detected) or false positives (when the domain names are wrongly recognized).
Figure 7(a) shows the results with the 848 images (including 56 unique domains) introduced in our test set. The recall was 100%. This means, as long as there was an address bar in the photo, OCR could detect the text area accurately. The OCR generated 35 false positives. There were two types of false positives. First, the dot ‘.’ in the domain names was too small to be recognized. Second, certain letters were mis-recognized, e.g., in one case ‘o’ was recognized as ‘a’ and the other case ‘o’ as ‘e’. The overall precision was 95.87%.
Figure 7(b) shows a failing example of OCR with the dark color mode of a browser. This is because the browser automatically made the ‘‘www.’’ part of a hostname (e.g., ‘‘www.google.com’’) darker, causing the OCR to miss ‘.’. Note that even in this challenging setting, this type of error only happened occasionally.
Note that the precision of domain name OCR can be greatly improved with better quality photos. We believe that in practice, once users understand that the 2FA is based on the address bar content, they can naturally take photos at smaller distances while focusing on the address bars instead of the entire PC screen. We did an additional test with the domain names of Alexa top 50 websites (2). We randomly changed 11 (out of 37) ‘o’s into ’0’s, 5 (out of 17) ‘l’s into ‘1’s in the names. Differently, this time we took photos at the distance of about 20 cm and focused on the top-left corner. In the end, among the 527 characters in all these domains, there was only a single recognition error – one ‘1’ was recognized as ‘l’. The accuracy can improve further with a smaller distance.
5.5. Address Bar Detection Precision and Recall
With the dataset introduced in Section 5.1, we train our addressbar detection model and measure its performance. Figure 8(a) shows that the precision and recall of addressbar detection for known browsers (i.e., covered in the training set) are 98.22% and 93.56%, respectively. The precision and recall of addressbar detection for unknown browsers are 93.81% and 83.83%, respectively, which look reasonably good. This relatively lower accuracy is not surprising because the address bar features of Brave and 360jisu are different from that of the other five browsers. For wider deployment of our system, we believe a better approach is to train a model with additional types of browsers, for example, top 10 browsers.
Figure 8(b) shows an example output. The green bounding box is the labelled ground truth address bar area, and the blue one is the address bar area predicted by our deep learning model. One may notice that the ground truth area covers not only the address bar, but also commonly displayed icons on the left of the address bar as they provide the important features for address bar detection.
As an object detection task, not an object classification task, our model either outputs a predicted address bar, or outputs nothing. It does not know the ground truth area, although in our evaluation we manually labelled the ground truth areas to measure the detection accuracy. When no address bar is predicted in our testing, it is clearly a false negative. Now, when a address bar is predicted, it can be either a real one (IOU score above 0.5) or a false one (IOU score below 0.5). In the example in Figure 8(b), the actual IOU is 0.77, which is above the threshold, so it is a true positive case. If the actual IOU is under 0.5, which means the predicted area is much different from the ground truth area, it will be counted as a false positive case with respect to object detection.
In practice, both false positive and false negative errors could cause the failure for the server to extract domain names correctly, and hence photo re-takes would be necessary. Also, in practice, our system may request users to take a photo that only focuses on the top-left corner of the browser at a closer distance. In this case, the address bar would be much easier to detect.
Based on the test dataset, we found that the median address bar detection time for one photo is 71 ms (the maximum 88 ms). It does not lag the whole system when parallelized with OCR because OCR takes at least 1 second to return the result.
5.6. Whole System Evaluation
In the whole system evaluation, we combine the detection results of OCR and address bar detection and report the final results. Figure 8(b) shows an example with five bounding boxes for OCR texts, two for the texts in top titles, one for the URL in the address bar, and two others for texts in page content, they are all in red rectangles. Here we do not show the areas for texts extracted from the web page. The blue rectangle is the predicted address bar. For each red rectangle above, we calculated the cover rate (CR) (defined in Section 3.4), and got 0.32 (for the top-right title) and 0.99 (for the URL in the addressbar) and 0 for the rest. After analyzing all the data, we found that the CR threshold of 0.8 can distinguish texts in webpage content/title from texts in the address bar very well.
Finally, we measure the error of our entire system with the metric named retake rate. Despite the cause of errors, as long as the server failed to recognize the domain name correctly, in our measurement it was counted as a retake case, where the user is requested to take a photo again. We used the 600 photos in our test set to measure the retake rate while setting CR=0.8. The retake rate was 6.83%. It is relatively high mainly because the low quality of the photos (Figure 10 shows an example), which introduced errors into OCR and address bar detection. In practice, a user can easily fix the problem by taking a photo at a closer distance, at the right angle and focusing on the top-left corner. In our testing, with the CR threshold of 0.8, there was not a single case where a title or any texts inside a webpage was mis-identified as a domain name. Only the address bar areas have been detected, which shows that the system worked as expected.
Finally, we also conducted a preliminary user study over a demo version of PhotoAuth with 33 participants (IRB approved). The 33 participants showed a positive attitude on the usability f PhotoAuth (e.g., over 50% considered it as convenient as the 2FA they have used before). Due to page limit, we present the details of our user study in the appendix for potential interest.
Evasion Attacks: An attacker may attempt to evade our system in different ways. First, as PhotoAuth relies on OCR to extract correct domain names, an attacker may register visually similar domain names (e.g., by registering g0og1e.com replacing ‘o’ with ‘0’, ‘l’ with ‘1’, also called typosquatting domain names). This is one type of homograph attack (17). In Section 5.4, we have already shown that, with better-taken photos, the chance for this attack to succeed could be very low (1 error out of 527 characters in Alexa Top 50 domain names). As the OCR technique is advancing, such errors will be further reduced. Moreover, even if the attacker has successfully tricked a user into trusting his website (e.g., through phishing emails), he does not have the control over how the user takes photos; therefore, an OCR error may rarely result in a valid domain name, not to mention the case of perfectly turning into the target domain name. Moreover, the server may configure its web app to output a higher resolution photo instead of 1920x1080 , as a tradeoff between communication time and OCR accuracy.
Another type of homograph attacks (E. Gabrilovich and A. Gontmakher (2002); 17) explore unicode for better success. For example, "apple.com" is different from "аpple.com". Even though they are the same looking, the letter "a" in the former one is ASCII (U+0041), whereas the letter "а" in the latter one is Unicode (U+0430). Not only a human user cannot recognize such Unicode letters easily, but the state-of-the-art OCR tools cannot recognize them correctly either. Fortunately, all major browsers only allow ASCII letters in the address bar, and they automatically convert the Unicode letters into the ASCII letters (Punycode). For example, Figure 11 shows that ‘‘аpple.com” is converted to ‘‘http://xn–pple-43d.com” in the address bar. For this reason, such homograph attacks will not succeed.
The second type of evasion attacks, which is specific to our system, is the domain name injection attack with multiple address bars. As we mentioned in Section 3.4, by design our system addresses this attack by requesting users to retake a photo containing only one address bar when the server detects more than one address bar in the photo. The third type of evasion attacks is a potential redirection attack against the workflow of our system. Specifically, after the user input his login credentials (Step 5 in Figure 3), an attacker may try to redirect the user to microsoft.com, so that the user with PhotoAuth will take the picture on the genuine domain. However, this attack will not work. Recall that in Step 3, the real server (here ‘‘microsoft.com’’) sends a web cookie to the attacker. If the attacker replays (forwards) this cookie to user’s browser (Step 4), the user browser will store it for the fake webserver because it was received from the fake server instead of the real web server. Based on the same origin policy (SOP), the user’s browser will not send this web cookie to the real web server if the user is instead redirected to the real web page that displays the same real login user interface. As a result, the real server will not receive a (valid) cookie from the user’s browser, hence denying the login process.
When the web server determines that the detected domain name does not match with the real one, there is a small chance of detection errors. Since the error rate is very low in our system, we may set a threshold value (e.g., up to five) for the maximum number of retakes, meanwhile warning the user about potential phishing attacks and asking them to check the correctness of the domain name. Two factors may cause a legitimate user to fail in the 2FA process: poor photo quality (e.g., not focused, big angle, too dark) and poor network connection with the server. In either case, the user needs to redo the 2FA by taking a better photo or moving to a location with stronger wireless signal. The server can give the user some feedback on the cause, e.g., poor quality or timeout. For example, based on frequency information, with Fast Fourier Transform (FFT), it is easy to automatically detect whether an image is blurry(6).
Finally, just like many other 2FA systems, a user should register alternative authentication mechanisms (e.g, SMS one-time password), although they may provide weaker security guarantee. Certainly, caution has to be taken before allowing the system to fall back into a weaker mode. The user will need to be alerted about possible phishing attacks and check the running environment.
Possible Limitations: To users who have never used any 2FA system before or even do not know what browser address bar is, they will need to first spend a few minutes to understand the workflow of PhotoAuth (e.g., watching a short tutorial video, as provided in our user study) before logging into a PhotoAuth-enabled server. Otherwise, they may fall into various phishing attacks made possible through social engineering or other types of human errors. Clearly, it requires a joint effort from multiple parties (e.g., web server, web browsers, ISPs) to protect all users, both tech-savvy and non-tech-savvy, from such online phishing.
7. Related Work
7.1. Industrial Solutions
Google 2-Step Verification (15) is a phone application to generate a time-based one-time code for the user in every 30 seconds. No network connection is required between the app and the server. The mechanism requires the user to manually enter the one-time password into the browser. Duo Push (11) is also a phone application that receives push information when the user sends a request in the browser login page and the user taps a button to respond. Unfortunately, they are vulnerable to the RTP attacks. In the 2nd factor authentication phase, the adversary can deceive the user to pass his one-time password or press the ‘‘Approve’’ button in the above two applications. The user may think the 2FA is for himself to authenticate to the website, but the truth is that the 2FA is for the adversary to get authentication.
Recently Google released a new software-based 2FA tool leveraging phone’s built-in security key (16). It requires pre-installed phone app to generate the key, special built-in browser support, and Bluetooth (or NFC) to establish a secure channel between the computer and the phone, such requirements could restrict its usability. In 2017 the FIDO Alliance proposed a Universal 2nd Factor (U2F) protocol (34), where end users carry a single U2F device which works with any relying party supporting the protocol. Later, the FIDO Alliance proposed FIDO 2 (13), by integrating its Client-to-Authenticator Protocol (CTAP) with W3C’s Web Authentication (WebAuthn). Users may log into internet accounts using their preferred devices. Web services and apps can turn on this functionality via biometrics, mobile devices and/or FIDO security keys. U2F/FIDO2, based on public-key cryptography, can counter RTP attacks very well.
While the industrial solutions look promising to solve the RTP attacks, it may take a long time to be widely deployed because of several possible factors such as cost and usability issues (32). For instance, U2F devices are not free, commonly ranging from 20 to 60 dollars, hence a non-trivial cost overhead for either end users or companies. Other use options may require pairing between phone and PC, or BlueTooth or NFC. The concepts and procedure for deploying U2F/FIDO2 could still look complicated to some non-tech-savvy users because of the needed registration, installation or configuration. To this end, alternative secure and user-friendly solutions are still very needed.
7.2. Academic 2FA Solutions
Dhamija et al., (Dhamija and Tygar, 2005b, a) proposed Dynamic Security Skins (DSS) to allow the server to prove its identity based on visual hash generated by the browser and the server. It has two major weaknesses: relying on users to determine genuineness, and incapable of preventing the RTP attacks. Shirvanian et al., (Shirvanian et al., 2014) proposed a 2FA system based on mix-bandwidth devices. Even though the system can improve the usability of 2FA, it cannot be widely implemented because the requirements are not met by most devices. Czeskis et al., (Czeskis et al., 2012) proposed a 2FA system named PhoneAuth. Its overall protocol shares the same same spirit with U2F protocol except it involves a smartphone instead of a USB dongle. Parno et al., (Parno et al., 2006) proposed a system to establish a secure bookmark on the phone side to control the authentication. Azimpourkivi et al. (Azimpourkivi et al., 2017) introduced a camera-based 2FA system called Pixie, which establishes trust between a user and his web server based on both the knowledge and possession of an an arbitrary physical object. However, the lack of a binding between the trinket and the website the user is visiting leaves the system vulnerable to RTP attacks.
Karapanos et al., (Karapanos et al., 2015) proposed a 2FA system based on the ambient sound. Basically, both user’s browser and user’s phone record the ambient sound at the same time. If the sound signals are much different, it is likely an attack case. The system can handles RTP attacks with the support of Bluetooth and microphone recording.Ulqinaku et al. (Ulqinaku et al., 2019) proposed 2FA-PP, which leverages a Web Bluetooth API, to create a secure Bluetooth connection between a website and the user’s smartphone that runs a special mobile app. To defeat phishing attacks, it leverages network latency measurements to tell if the user is connected to the legitimate server or to the attacker’s site. The system has high accuracy when the attackers are not located within the same region as the victim.
In this paper, we proposed PhotoAuth, a 2FA system to defend against real-time phishing (RTP) attacks. In PhotoAuth, a user takes a photo of the PC browser with the address bar area, and uploads the photo to the server. The server automatically extracts the domain name information from the address bar and detects fake domain names. PhotoAuth is easy to use and also compatible with the traditional 2FA system to support most legacy devices. It does not require special hardware (except user’s phone), We prototyped the system and tested it in various environment settings and with multiple types of browsers. The results showed that PhotoAuth is able to effectively prevent and detect attacks.
- PhishAri: automatic realtime phishing detection on twitter. In 2012 eCrime Researchers Summit, Cited by: §1.
-  Alexa top 50 sites (freely accessible list). Note: https://www.alexa.com/topsites Cited by: §5.4.
- Camera based two factor authentication through mobile and wearable devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1 (3). Cited by: §7.2.
-  Browser hijacking. Note: https://en.wikipedia.org/wiki/Browser_hijacking Cited by: §2.2.
- Strengthening user authentication through opportunistic cryptographic identity assertions. In Proceedings of the ACM CCS, Cited by: §1, §7.2.
-  Detect if an image is blurry. Note: https://stackoverflow.com/questions/7765810/is-there-a-way-to-detect-if-an-image-is-blurry Cited by: §6.
- Phish and hips: human interactive proofs to detect phishing attacks. In International Workshop on Human Interactive Proofs, Cited by: §1, §7.2.
- The battle against phishing: dynamic security skins. In Proceedings of the 2005 symposium on Usable privacy and security, Cited by: §1, §7.2.
-  DNS spoofing. Note: https://en.wikipedia.org/wiki/DNS_spoofing Cited by: §2.2.
-  Domain hijacking. Note: https://en.wikipedia.org/wiki/Domain_hijacking Cited by: §2.2.
-  Duo push. Note: https://duo.com/product/trusted-users/two-factor-authentication Cited by: §1, §7.1.
-  Evilginx. Note: https://github.com/kgretzky/evilginx2/ Cited by: §1, 1st item.
-  FIDO 2. Note: https://fidoalliance.org/fido2/ Cited by: §3.1, §7.1.
- The homograph attack. Commun. ACM 45 (2). Cited by: §6.
-  Google 2-step verification. Note: https://www.google.com/landing/2step/ Cited by: §7.1.
-  Google phone’s built-in security key. Note: https://support.google.com/accounts/answer/9289445 Cited by: §7.1.
-  IDN homograph attack. Note: https://en.wikipedia.org/wiki/IDN_homograph_attack Cited by: §6, §6.
- An evaluation of extended validation and picture-in-picture phishing attacks. In Financial Cryptography and Data Security, 11th International Conference, FC 2007, and 1st International Workshop on Usable Security, USEC, 2007, S. Dietrich and R. Dhamija (Eds.), Lecture Notes in Computer Science, Vol. 4886. Cited by: §3.4.
- Sound-proof: usable two-factor authentication based on ambient sound. In 24th USENIX Security Symposium (USENIX Security 15), Cited by: §1, §7.2.
- Hindsight: understanding the evolution of UI vulnerabilities in mobile browsers. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, Cited by: §9.2.
-  Man-in-the-middle attack. Note: https://en.wikipedia.org/wiki/Man-in-the-middle_attack Cited by: §2.2.
- Visual-similarity-based phishing detection. In Proceedings of the 4th international conference on Security and privacy in communication netowrks, Cited by: §1.
-  Next generation of phishing 2fa tokens. Note: https://breakdev.org/evilginx-2-next-generation-of-phishing-2fa-tokens/ Cited by: §3.1.
- IPhish: phishing vulnerabilities on consumer electronics. In Usability, Psychology, and Security, UPSEC, Proceedings, Cited by: §9.2.
-  Open texting online. Note: https://www.opentextingonline.com/ Cited by: §4.2.
-  Paddle ocr. Note: https://github.com/PaddlePaddle/PaddleOCRAccessed June 22, 2021 Cited by: §4.3.
- Phoolproof phishing prevention. In International Conference on Financial Cryptography and Data Security, Cited by: §7.2.
- You only look once: unified, real-time object detection. In , Cited by: §4.3.
- Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §4.3.
- A layout-similarity-based approach for detecting phishing pages. In SecureComm, Cited by: §1.
- Two-factor authentication resilient to server compromise using mix-bandwidth devices.. In NDSS, Cited by: §1, §7.2.
-  The authentication solution government has been slow to adopt.. Note: https://www.fifthdomain.com/civilian/2018/02/01/the-authentication-solution-government-has-been-slow-to-adopt/ Cited by: §7.1.
-  U2F. Note: https://duo.com/blog/the-2019-state-of-the-auth-report-has-2fa-hit-mainstream-yet Cited by: §1.
-  U2F. Note: https://www.yubico.com/services-with-yubikey/fido-u2f/ Cited by: §1, §7.1.
- 2FA-pp: 2nd factor phishing prevention. In Proceedings of the ACM WiSec, Cited by: §7.2.
-  YoLo v3 weights. Note: https://pjreddie.com/media/files/yolov3.weights Cited by: §4.3.
9.1. A Preliminary User Study
A comprehensive user study is intractable at this stage of our research for various reasons. Ideally, we may set up a working system for each participant to have the first-hand experience. However, in this case a user’s opinion about our system may be greatly influenced by the details of our implementation, such as the performance of the system at the specific use time, the design of the web GUI, etc. Indeed, in our prototype we use a free online service for delivering SMS messages to mobile phones. Such a service is not stable (in one case it took five minutes to deliver) and occasionally SMS messages may even be blocked by cellular service providers. Too many differences between our prototype system and a real-world commercial system that would deploy PhotoAuth will prevent us from obtaining meaningful user feedback.
As such, in this stage, we only conduct a preliminary user study regarding the overall workflow of the system. We made a short live tutorial video (around 1 minute 30 seconds) to demonstrate the 2FA process step-by-step. It starts with the username/password login page, followed by several UIs in both PC and phone (including those UIs in Figure 5). Each UI in the video has one or two sentences subtitle (instead of voice) to explain the high-level idea or procedure. As a result, users do not know exactly how our technique works through this video tutorial. We choose to do so because our study cares more about the overall usability of the system for ordinary users. After that, a user is asked to answer a questionnaire with three simple questions. We obtained the IRB approval from our university for this anonymous user study, which does not ask and collect any personal information from participants.
We tried to recruit participants from our personal social networks with various technical background. As the survey was anonymous and the participation was totally voluntary, we do not know who submitted the survey and do not know the demographic information of the participants. In the end, we collected 33 survey feedback.
Figure 11(a) shows that most participators have used a 2FA system before, with SMS and Duo Mobile being the most popular, so they have the basic understanding of 2FA. Figure 11(b) shows that slightly over 50% of users considered PhotoAuth ‘‘as convenient as the 2FA system I have used before", 24.2% users rated it ‘‘a little bit more difficult than the 2FA system I have used before but intuitive to use’’, 21.2% of users rated ‘‘a little barrier at the first time, but it becomes intuitive to use afterward’’. Most users held a positive attitude regarding the usability of PhotoAuth.
Clearly, this is only a preliminary user study. In our future work, we will consider designing and conducting a more comprehensive user study.
9.2. Alternative Implementations
Here an alternative idea could be applying PhotoAuth in a similar way by requesting a user to take a screenshot of his mobile browser (instead of taking a photograph of his PC browser). Since screenshots can always guarantee the picture quality without worrying about various factors (e.g., lighting, angle), it seems to be a good idea. However, there are two drawbacks. First, mobile browsers do not have APIs to support screenshotting. Therefore, the web app will not be able to take a screenshot with a button in the web page. While users can manually take a screenshot of his phone at any time by pressing certain hardware buttons in combination, there is no convenient way to pass the screenshot to the web app without additional steps. Second, due to the small screen size, the mobile browser may not be able to display the full domain name, making the system vulnerable to URL-truncation based phishing attacks (Niu et al., 2008; Luo et al., 2017) that use a very long sub-domain name to match the full domain name of a victim site. Due to the above reasons, we recommend our web-cookie based design for mobile logins, which is simpler but more secure.
More Secure Implementation Choices: In our system, we assume the phone link between user’s smartphone and the website is secure from interception (e.g., man-in-the-middle attack) or interruption attacks by the adversary, no matter it is through SMS, email link or push message. In our implementation, we used SMS as the channel to pass the notification messages from the server to the client phone. One advantage is that it can be directly upgraded from the existing common SMS-based 2FA methods. Another advantage is that it is platform independent. That is, no matter which OS the client phone is using, either Android or iOS, our web-based PhotoAuth runs the same. The implementation would be almost the same if we use emails to pass the link.
An alternative is to use push messages. In this case, the server provides its own mobile app (TLS/SSL protected) for users to install and each user registers an account and logs into his account before receiving push messages. While this method may provide a better protection of the phone link than SMS and email do, it is platform dependent. That is, the server needs to provide different mobile apps for different phone operating systems. Moreover, as one of our design goals is to offer high usability which does not require users download and install any additional software on there phones, we did not choose this alternative in our prototype. However, for applications where such a usability issue is not a concern, our system can be easily adapted to use push messages. The main difference is that the mobile app will receive notification messages instead.