More than 65% of Americans play video games on at least one type of device [stats]. Furthermore, the combined gaming industry produced a revenue of $120B in 2019, a 4% increase from 2018 [takahashi_2020]. For these games to be immersive, gameplay must remain high-quality and error-free. While games may contain various defects and issues, graphical corruption and visual artifacts are some of the largest complaints from users [taxonomy]. These artifacts typically occur due to software or hardware errors that alter visual appearance of the game or its individual frames. Figure 1 shows some examples of such corruptions observed in real games.
In this paper, we conducted a proof of concept study to automate the detection of graphical artifacts. This is a novel problem that, if solved, would lead to significant quality improvement and enhanced experience for consumers of images and video. Currently, artifact detection is a labor intensive process where glitches are reported individually by users who experience them [glitch_arxiv]. This process is manual and time consuming, and a large number of users typically choose not to report these glitches, resulting in a number of unreported and unresolved issues. Furthermore, even if these glitches are reported in large scales, it still takes a lot of human effort to sort and catalogue these them. The first implication of automating this process is that there is no need for human intervention. Once a glitch occurs, it will be automatically captured and sent to the corresponding company for correction. Second, with increased knowledge of the source and cause of glitches, this automated software will be able to catch and correct the glitch before it is displayed to the user, resulting in an uninterrupted and smooth gameplay experience.
To the best of the authors’ knowledge, no existing work has systematically synthesized glitches in images, cataloged and labeled them, and proposed an automated solution for detecting them.
The contributions of this paper are as follows:
Creation of the open source software, calledGlitchify, for reproducing a basket of common gaming artifacts.
Generation of a labeled large dataset consisting of 50,000 normal and glitched gaming images.
Using dimensionality reduction and feature extraction techniques on gaming images and building an ensemble model to classify gaming images to automate the process of artifact detection.
2 Gaming Artifact Creation
Automating artifact detection, just like any other classification task, requires large amounts of data (e.g. labelled corrupted frames). There are no publicly available large-scale databases which provide this. Therefore, we created a large dataset of real images from game plays and injected different types of graphics artifacts to obtain glitched images. Specifically, we focused on the artifacts that are caused by software defects, as hardware artifacts which are highly content related are produced during rendering in the GPU and are thus much more challenging to reproduce. This data generation was implemented in an open source software Glitchify which is publicly available at https://github.com/AMD-RIPS/ST-2019.
The prototype for our synthetic data was a limited collection of corrupted images provided by AMD that were representative of graphic artifacts observed frequently during gameplay. The Glitchify software then artificially generated images with different types of corruption, designed to closely mimic what was observed in the sample set. Unlike artifacts shown in Figure 1, some of the images in the sample appeared to also have content-related artifacts. These content-related glitches depend upon the objects and their interactions in scenes and are thus difficult to represent and resolve. Since these glitches would further require video segmentation and object recognition techniques for detection, they are not addressed in the paper.
2.1 Types of Reproduced Artifacts
To reproduce different artifacts, we defined 10 different classes of corruptions based on their appearances. Gameplay data was obtained from publicly available YouTube videos. The dimension of the images or frames extracted from gameplay videos was set to pixels. These different kinds of artifacts are described below.
2.1.1 Shader Artifacts
The shader program within a GPU performs frame rendering and determines various surface properties such as texture, reflection and lightning[Shader]. For our work, shader artifacts are marked by the presence of polygonal shapes of different colors that either blend together or display gradual fading in certain directions. This glitch (Figure 2) was reproduced by choosing a random number of points in random positions of the image as the starting point and then setting a random number of edges to form a polygon. We then chose a random color close to that of the initial point, and reassign the pixel values in that polygon by slowly changing the color and reducing the intensity of the color.
2.1.2 Shapes Artifacts
Random polygonal monocolor shapes are also common in video games, especially in first-person shooting games. Shapes artifacts tend to appear in the darker part of the frames. We reproduced this artifact (Figure 3) by choosing a random staring point in the darkest rectangular region (of random size within a certain range) of the image, and drawing a random number of dark thin polygons out of that point.
2.1.3 Discoloration Artifacts
Discoloration artifacts manifest as bright spots in images that are colored differently. This artifact (Figure 4) was reproduced by changing the color of pixels. For example, one could set the red component of the pixel to a predefined value above or below a certain threshold.
2.1.4 Morse Code Pattern
The morse code pattern (shown in Figure 5) appears when memory cells on a graphic card become stuck and display their stuck values on the screen rather than displaying the true image. Running a GPU at a higher speed than it was designed for, or at a high temperature, may result in such corruption. To reproduce this type of artifact, we add morse-code-like patterns to random locations in the frame, as shown in Figure 5.
2.1.5 Dotted Lines Artifacts
Dotted Lines artifacts are often hard to recognize unless one magnifies the corrupted images. The Dotted Lines either have random slopes and positions in the input frames, or they can be radial lines emanating from a single point. To generate the random Dotted Lines artifacts (Figure 6), a random color is first determined. Then dotted lines of this color replace the original pixel values of the image, where
is randomly chosen from a uniform distribution. We chose the starting point of each line segment such that it does not lie close to the edges of the image. The radialDotted Lines are different from the random Dotted Lines in that they originate from a single point and dotted line segments of the image are replaced with new pixel values, where is chosen from a uniform distribution.
2.1.6 Parallel Lines
Parallel Lines artifacts are visually discernible, where the corrupted images may contain lines, where the color of each line segment is the pixel color of the starting point of the line. To reproduce the Parallel Lines artifact (Figure 7), parallel lines in the image are replaced by new pixel values, where The angle between each line and the horizontal axis is such that . We choose the starting point of each parallel line such that and . Each parallel line has a thickness such that .
Triangulation typically occurs in intensive 3D games, where surfaces are rendered by little triangles that form triangle meshes. Due to graphic defects, such triangular meshes are displayed at a coarse resolution and incorrectly colored, instead of smoothly rendered. To reproduce this artifact (Figure 8), we divide the image into triangular sections and color each triangle with the average of all the pixels within.
2.1.8 Line pixelation
Line pixelation is characterized by noisy stripes with random orientations and positions on the image. To capture the distribution (pixel intensity variations) in this artifact, we insert a random number of noisy stripes with random orientations and positions in addition to randomly positioned halos around some pixels. This artifact appears at the very bottom of the left image in figure (Figure 9).
2.1.9 Screen Stuttering
Screen stuttering occurs when neighboring columns and rows of the image are swapped. We reproduce the Screen stuttering artifact (Figure 10) by swapping neighboring columns and rows in one direction, and then swapping the neighboring rows and then columns in the other direction respectively.
2.1.10 Screen tearing
Screen tearing occurs when two consecutive frames in a video are rendered in same image. Therefore, part of the image shows the scene at a certain point in time, while the other part of the same image shows that scene at a later time. To reproduce this artifact, we select two frames in a video that are 100 frames apart from each other, and then randomly replace some rows (or columns) of the first frame with the corresponding rows (or columns) of the second frame (Figure 11).
3 Feature Extraction
Since we extracted high-quality colored images ( pixels), the dimensions of the images were too large to be used directly by machine learning algorithms. Therefore, the following methods were used to extract low dimensional features from images.
3.1 Discrete Fourier Transform
The two-dimensional Discrete Fourier Transform (DFT) is often used to process two-dimensional discrete signals such as images. Given a one-channel signal with dimensions , its DFT is given by
where and are the spectral coordinates.
The original signal can be extracted from via inverse Fourier transform:
The rationale behind considering DFT as a feature for the graphics artifact classification is that several types of artifacts (e.g. morse code, Parallel Lines
, etc.) exhibit periodic patterns that could be best identified in the frequency domain. Previous studies provide evidence of successful application of this technique to the detection of periodic patterns in signals[russians]. Additionally, most graphics corruptions have sharp edges and fine structure, which finds reflection in the high-frequency components.
3.2 Histogram of Oriented Gradients (HoG)
Histogram of oriented gradients is a feature used in computer vision to detect edges in an image . An color image can be represented using three functions that map each coordinate to the corresponding red, green, and blue color intensity value, respectively. The gradient of the functions at each coordinate can be approximated by applying discrete derivative masks at each coordinate.
The image is then divided into small patches, and the magnitude and orientation of gradients within each patch are computed and summarized by a histogram of gradients containing bins corresponding to angles . For each gradient with magnitude and orientation , we select the two consecutive bins (here we consider the last bin and the first bin to be consecutive) such that lies in the range determined by the two bins. Suppose the two selected bins correspond to angles and , then the values go to the two bins are and , respectively. Finally, we normalize the histograms and then concatenate them together to form a feature descriptor of the entire image.
3.3 Pixel-wise Anomaly Measure
Given an image, we approximate the distribution of red, green, and blue intensities, and then assign each individual pixel an anomaly score based on how much the pixel’s intensity deviates from the estimated global distribution[RX_detector]. This process can be done using graph-based method described below [graph_lap] as follows.
Consider an undirected, weighted graph composed of a vertex set corresponding to the three color intensities, and an edge set specified by , where , and is the edge weight between vertices and . In our case, the edge weights are defined as:
where and are the average red, green, and blue intensities in the image, respectively. From the adjacency matrix , the combinatorial graph Laplacian matrix can be computed, where is the degree matrix defined by:
Finally, we normalize the Laplacian matrix and define an anomaly measure for each pixel in the image by:
where is the color intensity of .
3.4 Randomized Principal Component Analysis
Principal component analysis (PCA) is a commonly used dimensionality reduction technique in machine learning and statistics [pca_reduction]
. It attempts to find directions, or principal components, that maximizes the variance of projected data. Data projected onto the space determined by the first several principal components are used as a low dimensional representation of the original data matrix.
Principal components are often computed via singular value decomposition (SVD)[pca_via_svd]
since the principle components are exactly the normalized right singular vectors of the data matrix. However, computing the exact value of SVD takestime, where
is the dimension of the data matrix. Therefore it is computationally infeasible to find the exact decomposition to our high dimensional data matrix. Instead, we apply randomized power iteration SVD algorithm described in[randomized_svd, random_svd2].
For each artifact, we explored various combinations of feature representations and classification algorithms, and picked the best performing combination as a specialized classifier for that artifact. We then combined these specialized classifiers into an ensemble model as they are shown to perform better than any single individual classifier (figure 12) [why_ensemble]. The following section provides a brief overview of the justification behind the feature/model combinations that we hand selected.
4.1 Feature and Model Selection
Given the nature of each artifact, we only considered those feature representation/classifier combinations that seemed capable of capturing that artifact. For example, there are no repetitive patterns in Screen tearing
artifacts so we did not use Fourier transform to represent this artifact. The set of potential classifiers consisted of Convolutional Neural Networks (CNN), Logistic Regression (LR), Random Forest (RF), Support Vector Classifier (SVC), and Linear Discriminant Analysis (LDA). We used accuracy, recall, and precision in that order as evaluation metrics of our models. Additionally, if two models had similar performances, we picked the one with the lowest training time. It is important to note that if in the process of testing a feature/classifier combination on a given artifact we reached accuracy, recall, and precision rates higher than %90 on the test set, we did not try other combinations on that artifact.
: Convolutional Neural Networks are deep learning classifiers for computer vision tasks. CNNs are typically not capable of performing well on large images, so we had to resize the images. Our CNN had the architecture of ConvolutionMaxpool Convolution Maxpool
Softmax. We did not perform hyperparameter tuning, however, it is a promising future direction to pursue. We tried this combination on artifacts that were still detectable after resizing:Parallel Lines, Shapes, Shader, and Discoloration. For the artifacts with a repetitive pattern, we added a Fourier transform step prior to resizing the image as the transformed image will have distinctive features that are more dominant than in the original image, especially after the image has been resized. The Fourier Transform Resize CNN combination was used on every artifact except for Parallel Lines, Screen Tearing, and Discoloration.
Fourier Transform Resize (PCA) SVC, LDA, LR: Given that the classification models in this category are shallow and do not take in images as input, we either used PCA as a dimensionality reduction technique, or we flattened the image into a one dimensional vector before handing the to the classifiers. Also, since our transformed images were still too large, we downsized the images before the application of PCA or flattening. Fourier transform was used as a first step because in most artifacts, the transformed image has more distinctive and less localized visual features than the original image. We used this combination on every artifact except for Screen Tearing and Discoloration.
Resize (HOG) SVC, LDA, LR: This is similar to the above combination, with the difference that HOG is used instead of PCA. We tried this combination on artifacts that would have distinctive straight edges in their appearance which HOG is good at capturing: Screen Tearing, Shapes, Shader, Line Pixelation.
Anomaly Measure (HOG) Threshold: This computationally cheap combination produced good results on artifacts that typically have sharp color contrasts with the neighboring pixels. We tried this on all artifacts except for Triangulation, Screen Tearing, and Discoloration.
Table 2 illustrates the best performing feature representation and classifier combinations that were used for each artifact. We discovered that even though CNNs and random forests are widely and successfully used in computer vision and classification tasks respectively, Logistic Regression outperforms them in most of our artifact detection tasks, especially given their relatively low training time. We used Logistic Regression model provided by the sklearn package[scikit-learn] and used their default arguments; that is norm for penalization and a regularization strength of . As with LR, we used the sklearn SVC model (RBF kernel) to perform binary classification of the extracted features.
5 Training and Testing of the Ensemble
Here we describe the data used for our experiments and the training process of the ensemble in three stages. In training stage I we trained the specialized classifiers for each glitch, that is, the components of our ensemble just like any other ensemble. For combining the results of the specialized classifiers, using the OR logic - which is the most basic combining method - did not seem reasonable for our application, therefore we used a Logistic Regression instead [LR_ensemble_1, LR_ensemble_2]. That is because, if some of our specialized classifiers happened to have a pattern in false classifications - for example when the Screen tearing and Shader models tend to label a normal image as corrupted while no other specialized classifier would - the Logistic Regression has a much better tendency to capture that than the OR logic. Therefore, we used training stage II to train the ensemble Logistic Regression. Finally, to test the generalizabilty of our classifier to new games, we tested the ensemble on a dataset consisting of games that have never been seen by either the specialized classifier or the ensemble Logistic Regression. These three stages are explained in detail in the following paragraphs.
We applied Glitchify to generate the artifacts described previously, to obtain a large dataset for training. The initial, uncorrupted images were collected by downloading long (2-6 hour) gameplay videos from 30 different games. We then extracted around 2,000 images from each game until we had a total of 50,000 images. After extracting all the images, we visually inspected all of the images to make sure they appeared glitch-free. We then added all the 12 different types of artifacts to half of the images, leaving the other half unchanged and labeled them as normal (glitch-free) images.
We split this dataset of 50,000 images into three: A, B and C consisting of approximately 35,000, 7,500 and 7,500 images respectively . Dataset A consisted of gaming images from 24 games and was the largest dataset of the three. It was used to train specialized classifiers. Each specialized classifier was responsible for capturing only one type of graphic corruption. Dataset B contains gaming images from 3 games distinct from those in Dataset A, and it is used to train a Logistic Regression model that combines the outputs from the specialized classifiers and makes a final prediction on whether the input image is corrupted or not. Dataset C is the holdout dataset which is reserved for testing of the ensemble model. All of the data
|Game||Dataset||Stage of Usage|
|3||Detroit: Become Human|
|4||Devil May Cry 5|
|5||Dirt Rally 2|
|6||Far Cry 5|
|8||Hollow Knight (2d)|
|11||League of Legends|
|13||Star Control: Origins||A||Training Stage I|
|14||Total War: 3 Kingdoms|
|15||Star Craft 2|
|19||Need For Speed Payback|
|20||Mutant Year Zero|
|24||The Sinking City|
|26||DOTA 2||B||Training Stage II|
|29||Crackdown 3||C||Ensemble Testing|
5.2 Training Stage I
During this stage, we trained the specialized classifiers. For this training procedure, we extracted images from 24 games (dataset in Table 1), labeled 2400 of these images as normal (artifact-free) and applied Glitchify to the rest, obtaining around 1,500 corrupted frames per artifact. It should be noted that the training sets for each of the specialized classifiers were mutually exclusive (they did not share any normal images in common). A variety of models introduced in sections 4 were paired with features from section 3 and trained as binary classifiers for each artifact type. We used a train-test split of the data. The best-performing feature/model combinations were chosen based on the accuracy and recall and are recorded in table 2. The performance of these specialized classifiers on the test set (i.e. on familiar games) is shown in figure 13.
5.3 Training Stage II
After training the specialized classifiers, we form an ensemble by training an LR model that takes the concatenated outputs of the specialized classifiers as input, and outputs a 0 (normal) or 1 (corrupted). To ensure generalizability across games, our dataset for this training stage comprised of 3 games (dataset in Table 1) that were never seen by the specialized classifiers. The dataset for this stage consisted of 1650 normal images and 150 images of each type of artifact. A train-test split was applied to the data. At this stage, all of the specialized classifiers are shown normal images in addition to all types of artifacts. The performance of the specialized classifiers on the test set (i.e. on new games) is reported in figure 14. Given that the real-world application of this ensemble is expected to work well on games that have not been seen before, we did not train the Logistic Regression on familiar games.
5.4 Testing the Ensemble
The goal of this stage is to measure the generalizability of the ensemble model as a whole across different games. The dataset for this testing stage comprised of 3 games that are never seen by the ensemble (dataset C in Table 1), with 1650 normal images and 1800 corrupted images, which is 150 for each artifact type. The results are reported in figure 15.
|Shapes||FT + Resize||LR|
|Line pixelation||Anomaly Measure + Dilation||Threshold|
|Shader||Resize + HOG||LR|
|Morse code||FT + Resize||SVC|
|Parallel Lines||FT + Resize + PCA||LR|
|Dotted line||FT + Resize + PCA||LR|
|Stuttering||FT + Resize + PCA||LR|
|Triangulation||FT + Resize + PCA||LDA|
|Discoloration||FT + Resize + PCA||LR|
|Screen tearing||Resize + HOG||LR|
6.1 Evaluation Metrics
To evaluate the performance of our models, we used three different metrics: accuracy, precision and recall. We refer to corrupted images aspositive and normal images as negative.
6.2 Performance of our Model on Familiar Games
We first tested the performance of the specialized classifiers and the ensemble model on familiar games. Figure 13 illustrates the performance of the specialized classifiers on familiar games where we used the held out test set from training stage I (dataset A). Figure 15 illustrates the performance of the ensemble Logistic Regression on familiar games where we used the heldout test set from training stage II (dataset B).
6.3 Performance of our Model on New Games
An important metric in assessing artifact detection models is generalizability. In our case, generalizability refers to the ability of the model to perform well on images extracted from games that have not been encountered before. We will measure both the generalizability of the ensemble Logistic Regression and that of the specialized classifiers.
In training stage 2, even though the Logistic Regression model is trained using the games in dataset B, the specialized classifiers have never seen any image from dataset B. Therefore we used images from dataset B to test and evaluate the generalizability of the specialized classifiers. We also tested the generalizability of the ensemble model on dataset C which neither the ensemble model nor the individual classifiers have seen before, and obtained an accuracy of 69%. Figure 14 shows the testing results of each individual classifier.
We now discuss possible interpretations of our results in addition to their limitations and sources of possible bias. One fundamental concern relates to our generated dataset and its representation of the actual graphics corruptions that occur during gameplay. First, our approach in developing the Glitchify program consisted of classifying artifacts into categories based on their appearance, however, the viability of this procedure is debatable. Although the structure and design of Screen tearing and stuttering were sufficiently understood and formally expressed, our definition of Shader/Shapes and Line pixelation artifacts, however, might be too narrow to capture all of the variations of these corruptions seen in reality. This discrepancy might be responsible for some bias present in our artificial dataset, which in turn affects all further results. Second, another data-related concern comes from the frames we extracted from games to feed into Glitchify. We require that all these frames are normal, i.e. do no contain any corruptions, before the application of Glitchify. To ensure this holds, we manually processed collected images in an attempt to perform a first-order quality check. Due to surrealistic contents of most games, working with individual frames rather that a continuous and dynamic gameplay makes it harder, if not impossible, to correctly discriminate between unwanted artifacts and intentional design. Figure 16 displays examples of such images.
Regarding the individual models, the accuracy on the test set drops from training stage 1 (figure 13) to training stage 2 (Table 14). This is mainly due to the fact the models in training stage 2 have never seen the games they are being tested on. In other words, the number and variation of the games in dataset A were not sufficient to ensure generalizabilty of the specialized classifiers across different games. One season for getting a high number of false negatives (contributing to low precision and accuracy) is that the artifacts that were added through Glitchify are too subtle, as we can see in figure 17.
In this application, however, the recall rate is the most important metric to consider as it is crucial to make sure that during gameplay, we detect as many corrupted images as possible so we can fix them. In training stage 2, we can see that most models produce good recalls on games they have never seen, showing a good degree of generalizability. The artifacts comprising Discoloration and Screen tearing have relatively low recall scores in training stage 2 (0.66 and 0.5 respectively) versus 0.95 and 0.80 in training stage 1, demonstrating that those models overfit the training data and are not generalizable to new games. These two artifacts are especially challenging, because as we can see in figure 18, it is possible for images to have a natural separation line or color gradients which will end up confusing the Screen tearing and Discoloration classifiers respectively.
We speculate that the low accuracy score we have obtained on the heldout test set is due to the “glitchy" look of one of the games we have included in dataset C (Crackdown 3).
In this proof of concept study, we developed a set of algorithms and software that automatically detects graphics corruption in frames from video games. Based on a sample of screen corruption examples provided by AMD, 10 of the most common content-unrelated artifacts were selected, described, and then recreated with the Glitchify program. With the help of Glitchify
, a dataset of 50,000 images was created by adding 10 different types of artifacts to normal frames extracted from 34 modern video games. Each of the 10 forms of corruption was used to train a basket of models that included Logistic Regression, Support Vector Machines, and Linear Discriminant Analysis. The output probabilities of these individual classifiers were used for training a mixed experts Logistic Regression model intended to carry out the final classification decision. The overall accuracy on theGlitchify-produced test set based on games unseen during training is 69%.
Overall, this study has demonstrated several important results. The accuracy of the models trained on Glitchify indicates that synthetic generation of defects can be an efficient mechanism to train models to identify real corruption. Put another way, the simple models for synthetic corruption described previously accurately represented some of the actual defects that occur in software or hardware. This hints that synthetic data can be used to efficiently train machine learning models for visual corruption. This result should be explored more generally. If true, this is an important result, because this is an effective means of generating labelled data en masse, and circumventing the need for large, labelled datasets.
Another important contribution of this work is demonstrating that a basket of models can be used as an efficient classifier of visual defects. No single model outperformed the others for identifying all forms of corruption. The efficacy of ’mixture of experts’ approaches are well-understood in other domains in machine learning, but to the authors’ knowledge, have not been previously demonstrated in the visual corruption domain. We note that Generative Adversarial Networks could be used in the future for artifact creation. We did not use them in this work owing to the limited and small number of real world artifact examples available to us, as GANs are computationally expensive and require large amounts of data to achieve superior performance.
There are a few points that were not the main focus of this paper but could improve the results. First, we found that contrary to our assumptions, LR (a shallow model) outperformed CNN (a deep learning model for image classification) in most if not all of the cases. We expect that a larger dataset for the purpose of training in addition to a more exhaustive hyperparameter tuning would prove benefical to the use of CNNs. Second, we did not perform any significance testing to see which feature/model combination performs the best. Adding this step to analysis would certainly make it more rigorous. Lastly, we did not consider the possibility of having multiple kinds of artifacts in the same image. For example, it is possible to have Stuttering and Screen Tearing appear in a single frame, but it was not among the primary goals of this paper to be able to catch such artifact combinations. While we realize this is a preliminary work, we believe that this work is a good starting point that can help facilitate future research on the topic of automating artifact detection which would in turn lead to significant quality improvement of video games followed by an increase in gaming revenue. In addition to individual frames, the model we have created is able to work on videos and capture glitches in real time gameplay as we demonstrate in a short demo here: https://youtu.be/AiZ0Dae7jW4. Further work needs to be done in order to improve the accuracy of our classifier as well as potentially fixing the artifact before it is displayed to the user.