Soccer is among the most popular sports in the world. The attractiveness of this sport has gathered many spectators. Various studies are being performed in this area to grow and assist this sport and meet the needs of soccer clubs and media. These researches mainly focus on estimating team tactics, tracking of players  or ball on the field , detection of events occurring in the match [4, 5, 6, 7, 8], summarizing the soccer match [9, 10, 11] and estimating ball possession statistics 
. These studies are carried out using various methods and techniques, including machine learning (ML).
Artificial intelligence (AI), and more specifically ML can assist in conducting the above-mentioned researches in order to achieve better and more intelligent results. ML itself can be implemented in a variety of ways, including deep learning (DL). Distinguishing the events of a soccer match is one of the active research fields related to soccer. Today, various AI methods are utilized to detect events in a soccer game [4, 7]. The use of AI in this area can help to achieve higher accuracy in detecting events.
Detecting the events of a soccer match may have different applications. Event detection, for instance, in a soccer match can help obtaining the statistics of events of the match. Counting the number of free kicks, fouls, tackles, etc. in a soccer game can be done by a manpower. Using a manpower is not only costly and time consuming, but also is may be associated with errors. Nonetheless, with intelligent systems based on event detection, statistics can be calculated and used automatically. Other applications of event detection may be the summarization of a soccer match. . The summary of a soccer match includes important events in which the match took place. To prepare a useful summary of a soccer match, the events should be correctly identified. Using a method for identifying the events with high accuracy can improve the quality of the summarization task.
The purpose of the current study is to detect events in soccer matches. In this regard, various deep learning architectures have been developed. When using deep learning methods, there always exists a need for datasets for training. To this end, first, a rich visual dataset is collected for our task such as (penalty kick, corner kick, free kick, tackle, to substitute, yellow and red cards) and images containing these events of the sides and center of the field are collected. Then, this image dataset is used to train the proposed networks. Solving event detection problem encounters some challenges including the similarity of some images such as yellow and red cards, which leads to some difficulties in separating the images of these two groups from each other. This causes the classifier to have trouble in distinguishing between yellow and red cards, resulting in incorrect detection.
Another problem in detecting events that usually occurs in a soccer match is the issue of no highlights. Not all events happening in a soccer match can be included in specific events to be given to the network for training. Some events that have not been fed to the network before training the network may occur specifically in a soccer match. In such cases, the network must be able to face these images or videos and not mistakenly categorize them in one of the given events. We solved the problem of no highlight image detection by using VAE , setting a threshold and using additional classes in the image classification module.
The proposed method for detecting the soccer match events uses two convolutional neural network (CNN) and a variational autoencoder (VAE). The current study focuses on resolving the problem of no highlight frames which may be wrongly considered as one of the events, as well as dealing with the problem of similarity between red and yellow cards frames. Experiments demonstrate the proposed method improves the accuracy of image classification from 88.93% to 93.21%. The main reason for increasing the accuracy is the use of a new fine-grain classification network to classify yellow and red cards. Also, the proposed method has a good ability to detect no highlight frames with a high precision. All datasets and implementations are publicly available.111https://github.com/FootballAnalysis/footballanalysis
The rest of the paper is organized as follows. Section II provides a literature review and examines the drawbacks of the works done in this area. Section III describes the proposed algorithm, the mechanism of its structure and working. Section IV introduces the datasets collected for this study. Section V presents the experimental results and compares the results with those of other papers. Eventually, Section VI concludes the paper.
Ii Related Work
The research of Duan et. al. 
is one of the first works in this area that implements supervised learning for top-down video shot classification. The authors in
present a method for detecting soccer events using the Bayesian network. The basic methods presented suffered from low accuracy, until some methods have been proposed using DL. By presenting a method based on the convolutional network and the LSTM network, Agyeman et. al. present a method for summarizing a soccer match based on event detection, in which five events including corner kicks, free kicks, goal scenes, centerline, and throw-in are considered. This study uses 3D-ResNet34 architecture in the convolutional network structure. One of the problems with this work is that the number of events is limited and no highlights are taken into account. Jiang et. al. 
, initially, perform feature extraction using a convolutional network, then perform event detection using its combination with RNN model.This method is limited to four events: goal, goal attempt, corner, and card. Sigari et. al. employ the fuzzy inference system. The algorithm presented in this method works based on replay detection, logo detection, view type recognition and audience excitement. This method is also limited to three events: penalty, corner, and free-kick. 11 events are classified in , covering a good number of events; while this method is not capable to distinguish between the red and yellow card events because of the high similarity between these two events. In general, the methods presented in this field work on either image, video [9, 8], or audio signals [18, 19]. Nonetheless, there exist some methods employing two signals, i.e. audio and video, simultaneously [10, 16].
The methods recently presented utilize DL architectures as the main tool for feature extraction. Among the DL architectures suggested for feature extraction, the closest flagship architecture is EfficientNet architecture . This architecture is presented in 8 different versions. Different versions of this architecture offer generally higher performance than other previous models [21, 22, 23]. Also, the proposed architecture has fewer parameters and occupies less memory.
One of the challenges of the event detection problem in soccer matches, which has not been addressed adequately in the literature, is the events that are very similar in appearance but are two separate events. For instance, in the images of yellow and red cards, only the color of the cards is different and the other parts of the image are the same. Although it may come to the mind that both are card-taking operations and may be almost the same, in the soccer game these two events impact the game process significantly. In the literature, both yellow and red card events are considered as one event , which causes problems for event detection. The reasons for this are the very high similarity of the images of these two events, which makes it very challenging to distinguish between the two events. In this paper, fine-grained image classification is used instead of common feature extraction architectures to solve this problem in detecting such events.
Fine-grained image classification is one of the challenges in machine vision, which categorizes images that fall into a similar category but do not fall into one subcategory 
. For example, items such as face recognition, different breeds of dogs, birds, etc., despite the many structural similarities, do not fall into one subcategory and are different from each other. As another example, the California gull and Ringed-beak gull are two similar birds, differing only in beak pattern, and are in two separate subclasses. The main problem with this type of classification is that these differences are usually subtle and local. Finding these discriminating places of two subcategories is a challenge that we face in these methods.
The work of Lin et. al. 
is one of the researches in the field of fine-grained image classification, which is based on deep learning. In this model, two neural networks are used simultaneously. The outer product of the outputs of these two networks is then mapped to a bilinear vector. Finally, there is a softmax layer to specify the classification of images. The accuracy of this method for the CUB-200-2011 dataset is 84.1% while using only the neural network architecture in this architecture provides the maximum accuracy of 74.7%. Fu et. al. 
introduce a framework of recurrent attention CNN, which receives the image with the original size and passes it through a classification network, thereby extracts the probability of its placement in each category. At the same time, after the convolutional layers of the classifier, it extracts an attention proposal network that contains region parameters that it uses to zoom in on the image and then crop it. It now inputs the resulting new image like the original image into a network, and at the same time extracts an attention proposal network, thereby re-extracting another part of the new image. This method reaches an accuracy of 85.3% for the Birds dataset. In another work, the authors in propose a framework of multi-attention convolutional neural network with an accuracy of 86.5%. Following in 2018, Sun et. al.  presented new architecture on an attention-based CNN that learns multiple attention region features per an image through the one-squeeze multi-excitation (OSME) module and then use the multi-attention multi-class constraint (MAMC). Thanks to this structure, this method improves the accuracy of previous methods to some extent. One of the latest method presented in  consists of three steps. In the first step, the instances are detected and then instance segmentation is carried out. In the second step, a set of complementary parts is created from the original image. In the third step, CNN is used for each image obtained, and the output of these CNNs is given to the LSTM cells for all images. The accuracy of the best model in this method for the Birds dataset reaches 90.4%. In general, the problem with the methods presented in this section is the accuracy they achieved, while newer methods attempt to improve the accuracy.
Another issue is that a soccer match can include various scenes that are not necessarily a specific event, such as scenes from a soccer match where players walking, or the moments when the game is stopped. Now, if such images are applied as input to the classification network, the network will mistakenly place them in one of the defined categories. The reason is that the network is trained only to categorize images between events, and is called traditional classification network. In such classifications, the known class is used during training and the known class images should be given to the network during testing. Otherwise, the network will have trouble in detecting the image category, and even though the image should not be placed in any of the categories, it will be placed incorrectly in one of the defined categories. This type of categorization does not suffice for the problem under study. Because, as explained, the input images in this problem may not belong to any of the categories. Thus, we need a network to specify the category of an input image if it falls into one of the seven categories; otherwise, the network rejects it and does not mistakenly place it in these defined categories. In other words, an open set recognition is required to address the mentioned problem  .
Open set recognition techniques can be implemented using different methods. Cevikalp et. al. 
performs this based on support vector machine (SVM). The works in and  are also based on deep neural networks. Today, the use of generative models in this area is reaching its pinnacle, which is divided into two categories: instance generation and non-instance generation  . Finding a suitable method is still challenging, and the literature presented for event detection has not addressed this issue profoundly.
Iii Proposed method
This section describes the proposed method. Initially, the general procedure is explained, then all three main parts of the method are introduced, which includes an image classification module for detecting the images of the defined events, a fine-grain classification module used to classify yellow and red card images, and a variational autoencoder for detecting no highlights.
Iii-a The Proposed Algorithm
As depicted in Fig 1, the received video is first split into several frames based on the video length and frame rate per second, and then each frame is passed separately through a variational autoencoder. If loss value of the VAE network for the input frame is smaller than a specified value, the received frame is considered as an event frame and given to the image classification module. The image classification module classifies the images into nine classes. If the input image belongs to one of the center circle categories, right penalty area, or left penalty area, it is not categorized as an event and is still classified as no highlights. Yet, if it is one of the five events: penalty kick, corner kick, tackle, free kick, to substitute, it is recorded as an event in that image. If the event is a card event, the image is given to the fine-grain classification module to determine if it is a yellow card or red card, and the color of the card is specified there. Finally, to detect events in a soccer match, each event is calculated for every 15 consecutive frames (seven frames before, seven frames after, and the current frame), and if more than half of the frames belong to an event, that 15 frames (half a second) are tagged as that event. Of course, an event cannot be repeated more than once in 10 s, and if repeated, only one of them is calculated as an event in the calculation of the number of events occurred. In the folowing each module is described in detail.
Iii-B No Highlight Detection Module (Variational Autoencoder)
soccer match can include various scenes that are not necessarily a specific event, such as scenes from a soccer match where the director is showing the faces of the players either on the field or on the bench, or when the players are walking and the moments when the game is stopped. These scenes are not categorized as events of a soccer match.
In general, to be able to separate the images of the defined events from the rest of the images, three actions must be performed to complete each other and help us to detect no highlights. The three actions are :
The use of the VAE network to identify if the input images are similar to the soccer event (SEV) dataset images
The use of three additional categories, that is, left penalty area, right penalty area, and center circle in the image classification module, given that most free kick images are similar to images from these categories (if these categories were not placed, the images of the wingers of the field would usually get a good score in the free-kick category)
Applying the best threshold on the prediction value of the last layer of the EfficientNetB0 feature extraction network.
The second and third methods are applied to the image classification module and will be described later. In the first action, the VAE architecture is employed according to Fig 2 to identify images that do not fall into any of the event categories. To this end, the whole images of the soccer training dataset are given to the VAE network to be trained, then using reconstruction loss and determining a threshold on it, it is determined that the images whose reconstruction loss value are higher than a fixed threshold are not considered as the soccer images. Images with a reconstruction loss value less than a fixed threshold are categorized as the soccer game images and then they are given to the image classification module for classification. In other words, this VAE plays the role of a two-class classifier that puts soccer images in one category and non-soccer images in another category. Reconstruction loss is obtained from the difference between the input image and the reconstructed image. The more input images from a more distinctive distribution, the higher reconstruction loss value will be, and if it is trained using the same image distribution, the amount of this error will be less.
Iii-C Image Classification Module
The EfficientNetB0 architecture which is shown in Fig 3 is used to categorize images. This network is responsible for classifying images between nine classes. To place an image in one of the classes of this network, its prediction value at the end layer should be higher than the threshold value which is set 0.9. Otherwise it would be selected as no highlight frame.
If one of the options left penalty area, right penalty area, and center circle is the output of this network, that image is no longer defined as an image of the event that occurred in a soccer match but is included in the no highlight category. However, if it is in one of the categories of penalty kick, corner kick, tackle, free kick, and to substitute, the event will be finalized and decision is made. Finally, if the image is classified in the card category, the image is given to the fine-grain classification module where the color of that card will be determined.
Iii-D Fine-grain Classification Module
The only difference between the red and yellow cards is their color of the card; otherwise, there exist no other differences in their image. Thus, both are in the card category but are separate in terms of subcategories. The main classification does not distinguish these two categories well, hence it is decided in the training phase of image classification module to merge these two cards into one category. Also, to separate the yellow and red cards, a separate subclassification is used that focuses on the details. The final architecture employed in this section can be seen in Fig 4 . Here, the proposed architecture provided in  is exploited, except that instead of the Res-Net50 architecture that is used in  , the EfficientNetB0 architecture is employed and the network is trained using the yellow and red cards data. The input images to this network are the images that were categorized as cards in the image classification module. The output of this network determines whether these images are in the yellow or red card category.
In this paper, two datasets, that is, soccer event (SEV) and test event datasets have been collected. The collection of these two datasets has been done in two ways:
By crawling on Internet websites, images of different games were collected. These images are unique and are not consecutive frames
Using the video of the soccer games of the last few years of the prestigious European leagues and extracting images related to events.
Iv-B SEV Dataset
This dataset includes images of soccer match events that have been collected specifically for this study. In the SEV dataset, a total of 60,000 images were collected in 10 categories. The images of this dataset, as described, were collected in two different ways. Seven of the ten image categories are related to the soccer events defined in this paper, and the rest of the categories are used for no-highlight detection so that the images related to them are not mistakenly included in the seven main categories. Table I shows how dataset SEV is divided into train, validation, and test datasets. Also samples of SEV dataset shown in Fig 5
Iv-C Test Event Dataset
The test dataset is exploited to evaluate the proposed method. Samples of that dataset shown in Fig 6. The dataset consists of three classes, the first class contains images of the events selected from the SEV dataset, and an equal number of images (200 instances from each category) are selected from each category (only defined events). The second class includes other images of soccer, in which none of these seven events are included. The third class includes images that are not generally related to soccer. Details of the number of images in this dataset are given in Table II.
V Experiment and Performance Evaluation
All three networks are trained independently and the end-to-end method is not employed. The training methods of the 10-class classifier network, the yellow and red card classifier network, and the VAE network are explained in the following subsections, respectively.
To train this network, the images of seven events defined from the SEV dataset are selected and given to the VAE network as training data. Test and validation data of these seven categories are also used to evaluate the network. The specifications of the simulation parameters used in this network are summarized in Table III.
V-A2 Image Classification (9 Class)
The image classification network is first trained on the ImageNet image collection with dimensions of 224 * 224 * 3. Then, using transfer learning, the network is re-trained on the SEV dataset and is fine-tuned. Dimensions of input images of the SEV dataset are 224 * 224 * 3. The network is trained in 20 epochs with the simulation parameters specified in Table IV. The yellow and red card classes are merged and their 5,000 images are used for network training together with images of other SEV dataset classes.
V-A3 Fine-grain image classification
The network shown in Fig 4 is trained using two classes of red card and yellow card of the SEV dataset. Data from each category is partitioned to train, test and validation with 5000, 500, and 500 images. The specifications of the simulation parameters used in this network are given in Table V.
V-B Evaluation Metrics
Different metrics are exploited to evaluate this network. In order to evaluate and compare different image classification architectures of EfficientNetB0 and Fine-grain module networks, the accuracy metric is used as the main metric; recall and F1-score are also used to determine the appropriate threshold value of the EfficientNetB0 network. Also, precision is used to evaluate the performance of the proposed method to detect events.
The accuracy metric can be used to determine how accurately the trained model predicts and, as described in this paper, to compare different architectures and hyperparameters in the EfficientNetB0 and Fine-grain module networks.
The F1 score metric considers both the recall and precision criteria together; the value of this criterion is one at the best-case scenario and zero at the worst-case scenario.
Precision is a metric that helps to determine how accurate the model is when making a prediction. This metric has been used as a criterion in selecting the appropriate threshold.
Recall metric refers to the percentage of total predictions that are correctly categorized .
The various parts of the proposed method have been evaluated to achieve the best model in order to detect an event in a soccer match. In the first step, the algorithm should be able to effectively classify the images of the defined events correctly. In the next step, the network is examined to see how the network can detect no highlights, and the best possible model is selected. Eventually, the performance of the proposed algorithm in a soccer video is examined and compared to other state-of-the-art methods.
V-C1 Classification evaluation
The image classification module is responsible for classifying images. To test and evaluate this network, as shown in Fig 7, different architectures were used for training and different hyperparameters were also tested for these models. As shown in Fig 6(a) , the EfficinetNetB0 model has the best accuracy among the other models.
If in the above model, we divide the card images into two categories of yellow and red cards and give the dataset to the network in the form of the same 10 categories of SEV datasets for training, the accuracy of test data is reduced from 94.08% to 88.93% . The reason for this is the interference of yellow and red card predictions in this model. Consequently, the yellow and red cards detection have been assigned to a subclassification, and only the card category in the EfficientNetB0 model has been used.
For the subclassification of yellow and red card images, various fine-grained methods were evaluated and the results are shown in Table VI. The bilinear CNN method  using the EfficientNet architecture achieves 66.86% accuracy and the OSME method that employs the EfficientNet architecture reaches 79.90% accuracy, which shows higher accuracy than the other methods. Nonetheless, the main architecture (EfficientNetB0) used in the image classifier shows 62.02% accuracy, which gives difference of 17.88%.
Table VII compares the accuracy of combining the image classification module and fine-grain classification module for image classification with other models. As shown in Table VII, the accuracy of our proposed method for image classification is 93.21%, which shows the best accuracy among all models. Also, the proposed method is still faster than the other models except MobileNet and MobileNetV2. To prove this point, The execution time of each model is calculated for 1400 images, and their average run time as the mean inference time is given in Table VII.
As shown in Table VIII, the problem of card overlap is also solved, and the proposed method in the image classification section separates the two categories almost well.
V-C2 Known or Unknown Evaluation
In order to determine the threshold value on the output of the last layer of the EfficientNetB0 network, according to Table IX, different threshold values were tested and evaluated; and then the value 0.90 which gives the highest sum of F1-score for detecting the event and also gives the best recall to detect no highlight images was determined as the threshold for the last layer of the network.
The threshold value for the loss of VAE network was also examined with different values as shown in Fig 8 , the value of 328 as the threshold for the loss gives the best distinction between categories.
To test how the network detects no highlight images and the defined event images, the test dataset is used. Using this dataset helps to know the number of main events incorrectly classified by the proposed method as events not related to soccer. Also, it clarifies the number of images related to soccer classified as the main events, and the number of images categorized in the no highlight category. Moreover, it specifies the number of images related to the real event categorized correctly into the correct events, and the number of events that are not included in the main events. The results of this evaluation are provided in Table X.
V-C3 Final Evaluation
Ten soccer matches have been downloaded from the UEFA Champions league and then, using the proposed method, the task of events detection has been carried out. In this evaluation, events that occur in each soccer game are examined. In other words, the number of events correctly detected, and the number of events incorrectly detected by the network have been determined. Details of the results are given in Table XI and compared with other similar methods. in state-of-the-art methods
Vi Conclusions and Future Work
In this paper, two novel datasets for soccer event detection has been presented. One is the SEV dataset including 60,000 images in 10 categories, seven of which were related to soccer events, and three to soccer scenes; these were used in training image classification networks. The images of this dataset were taken from top fine leagues in Europe and the European Champions league. The other dataset is the test event that contains 4200 images. Test event includes three categories: the first category consists of the events mentioned in the paper, the second category comprises other images of a soccer match apart from the first category events, and the third category includes images off the soccer field. This dataset was exploited to examine the network power in detecting and distinguishing between images with highlight and those with no highlight.
Furthermore, a method for soccer event detection is proposed. The proposed method employed the EfficientNetB0 network in the image classification module to detect events in a soccer match. Also, the fine-grain image classification module was used to differentiate between red and yellow cards. If this module is not employed, red and yellow cards would have been categorized in the image classification module, and the differentiation accuracy would have been 88.93%. However, the fine-grain image classification module increased the accuracy up to 93.21%. In order to solve the network problems in predicting the images other than those defined, a VAE was employed to adjust the value of the threshold and several images, other than those of the defined events, were used to make a better distinction between the images of the defined events and other images.
-  G. Suzuki, S. Takahashi, T. Ogawa, and M. Haseyama, “Team tactics estimation in soccer videos based on a deep extreme learning machine and characteristics of the tactics,” IEEE Access, vol. 7, pp. 153 238–153 248, 2019.
-  M. Manafifard, H. Ebadi, and H. A. Moghaddam, “A survey on player tracking in soccer videos,” Computer Vision and Image Understanding, vol. 159, pp. 19–46, 2017.
-  P. Kamble, A. Keskar, and K. Bhurchandi, “A deep learning ball tracking system in soccer videos,” Opto-Electronics Review, vol. 27, no. 1, pp. 58–69, 2019.
-  Y. Hong, C. Ling, and Z. Ye, “End-to-end soccer video scene and event classification with deep transfer learning,” in 2018 International Conference on Intelligent Systems and Computer Vision (ISCV). IEEE, 2018, pp. 1–4.
B. Fakhar, H. R. Kanan, and A. Behrad, “Event detection in soccer videos using unsupervised learning of spatio-temporal features based on pooled spatial pyramid model,”Multimedia Tools and Applications, vol. 78, no. 12, pp. 16 995–17 025, 2019.
A. Khan, B. Lazzerini, G. Calabrese, and L. Serafini, “Soccer event
4th International Conference on Image Processing and Pattern Recognition (IPPR 2018). AIRCC Publishing Corporation, 2018, pp. 119–129.
-  M. Z. Khan, S. Saleem, M. A. Hassan, and M. U. G. Khan, “Learning deep c3d features for soccer video event detection,” in 2018 14th International Conference on Emerging Technologies (ICET). IEEE, 2018, pp. 1–6.
-  H. Jiang, Y. Lu, and J. Xue, “Automatic soccer video event detection based on a deep neural network combined cnn and rnn,” in 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE, 2016, pp. 490–494.
-  R. Agyeman, R. Muhammad, and G. S. Choi, “Soccer video summarization using deep learning,” in 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 2019, pp. 270–273.
-  M. Sanabria, F. Precioso, and T. Menguy, “A deep architecture for multimodal summarization of soccer games,” in Proceedings Proceedings of the 2nd International Workshop on Multimedia Content Analysis in Sports, 2019, pp. 16–24.
M. Rafiq, G. Rafiq, R. Agyeman, S.-I. Jin, and G. S. Choi, “Scene classification for sports video summarization using transfer learning,”Sensors, vol. 20, no. 6, p. 1702, 2020.
-  S. Sarkar, A. Chakrabarti, and D. Prasad Mukherjee, “Generation of ball possession statistics in soccer using minimum-cost flow network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
-  H. M. Zawbaa, N. El-Bendary, A. E. Hassanien, and T.-h. Kim, “Event detection based approach for soccer video summarization using machine learning,” International Journal of Multimedia and Ubiquitous Engineering, vol. 7, no. 2, pp. 63–80, 2012.
-  L.-Y. Duan, M. Xu, Q. Tian, C.-S. Xu, and J. S. Jin, “A unified framework for semantic shot classification in sports video,” IEEE Transactions on multimedia, vol. 7, no. 6, pp. 1066–1083, 2005.
-  M. Tavassolipour, M. Karimian, and S. Kasaei, “Event detection and summarization in soccer videos using bayesian network and copula,” IEEE Transactions on circuits and systems for video technology, vol. 24, no. 2, pp. 291–304, 2013.
-  M.-H. Sigari, H. Soltanian-Zadeh, and H.-R. Pourreza, “Fast highlight detection and scoring for broadcast soccer video summarization using on-demand feature extraction and fuzzy inference,” International Journal of Computer Graphics, vol. 6, no. 1, pp. 13–36, 2015.
-  J. Yu, A. Lei, and Y. Hu, “Soccer video event detection based on deep learning,” in International Conference on Multimedia Modeling. Springer, 2019, pp. 377–389.
-  H. Duxans, X. Anguera, and D. Conejero, “Audio based soccer game summarization,” in 2009 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting. IEEE, 2009, pp. 1–6.
A. Raventos, R. Quijada, L. Torres, and F. Tarrés, “Automatic summarization of soccer highlights using audio-visual descriptors,”SpringerPlus, vol. 4, no. 1, pp. 1–19, 2015.
-  M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition (2015),” arXiv preprint arXiv:1512.03385, 2016.
-  X. Dai, S. Gong, S. Zhong, and Z. Bao, “Bilinear cnn model for fine-grained classification based on subcategory-similarity measurement,” Applied Sciences, vol. 9, no. 2, p. 301, 2019.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji, “Bilinear cnn models for fine-grained visual recognition,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1449–1457.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
-  J. Fu, H. Zheng, and T. Mei, “Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4438–4446.
-  H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolutional neural network for fine-grained image recognition,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5209–5217.
-  M. Sun, Y. Yuan, F. Zhou, and E. Ding, “Multi-attention multi-class constraint for fine-grained image recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 805–821.
-  W. Ge, X. Lin, and Y. Yu, “Weakly supervised complementary parts models for fine-grained image classification from the bottom up,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3034–3043.
-  C. Geng, S.-j. Huang, and S. Chen, “Recent advances in open set recognition: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
H. Cevikalp, “Best fitting hyperplanes for classification,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1076–1088, 2016.
-  M. Hassen and P. K. Chan, “Learning a neural-network-based representation for open set recognition,” in Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 2020, pp. 154–162.
-  A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1563–1572.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.
-  B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” arXiv preprint arXiv:1602.07261, 2016.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.