Quality control is a fundamental process in the manufacturing pipeline. Since the ’80s, automatizing the quality control task has been offering potentials to overcome limitations of manual inspection . As a consequence, successful applications of automatic visual inspection have been emerging year after year, and nowadays inspection systems are being employed in a vast number of industries, from food  and fabrics  to railways  and reconstruction .
In this regard, a standard visual inspection hardware setup is typically composed of a digital camera, optics, and an illumination system. The hardware setup is usually coupled with a customised software that controls the acquisition, evaluates the captured images, and eventually takes decisions based on the evaluated results. Hence, hardware selection is a fundamental task in the design of an automatic visual inspection system and is essentially driven by the characteristics of the object to be inspected .
For a given manufactured object, countless different models might exist in production having various material properties (specular, diffusive, directional, transparent) or geometrical shape (flat, curved, prismatic). The surface to be inspected might also contain patterns and adornments which should be distinguished from the undesirable irregularities (see Fig. 1). We define the object to be inspected, which is the subject of this work, as a complex-object if its variable surface characteristics cannot be determined a priori, e.g. it can appear highly reflective and curved in one instance and opaque and prismatic in a different instance. This situation is not uncommon when inspecting assembled and/or decorated products, which can have custom finishing, based on customer requests.111Due to Non-Disclosure Agreement (NDA) restrictions in place, we cannot reveal the identity of the object inspected in this study.
In this context, standard illumination techniques comprising ‘front lighting’, ‘back lighting’, ‘diffuse lighting’, ‘bright-field lighting’ and ‘dark-field lighting’  individually are not sufficient for this task as each of them is merely suitable for inspection of a few certain surface characteristics. Additionally, the surface attributes are not the only factor driving the choice of the illumination setup. In fact,  names immediate inspection environment one of the three factors for an optimal lighting solution, and introduces object geometry and its support structure as two critical factors for the design of lighting solutions that may even limit the choice of standard illumination techniques.
In this work, we aim to propose an illumination system which is capable of dealing with the challenges of automatic visual inspection of the complex-objects, and to define a methodology for analyzing the effect of the proposed illumination system on the final defect detection performance. In particular, we seek to study the impact of the proposed multi-lighting system when deployed in training phase only or in both training and evaluation phases. The first case is specifically relevant in the common situation where deployment of a novel acquisition system cannot be accomplished on the customer site, either due to industrial constraints or technical specifications. To summarize, our contributions are as follows:
We propose an acquisition setup composed of a multi-illumination system (diffused, dark-field and frontal illumination techniques) to guarantee high defect visibility (over 99%, as reported by the annotators) on a wide selection of instances of the complex-object.
We conduct exhaustive experiments to demonstrate the importance of the the multi-lighting system, even though merely deployed in training phase.
We experimentally show that the multi-lighting setup deployment in the evaluation phase, when coupled with late-fusion of detections in each single-lighting conditions, leads to the highest defect detection rate of the system.
Ii Related work
The list of successful applications of the visual inspection systems in the case of non-complex objects is long and in many cases the deployment of standard illumination techniques leads to significant improvements. For instance,  addresses touch panel glass defect detection using dark-field illumination coupled with image processing techniques achieving accuracy on edge defect type, and an ad-hoc illumination technique such as injecting light beams perpendicularly in the glass achieves performance in scratch defect detection and its discrimination from dust .
Inspection of non-regular objects, however, has always been considered a challenging task where a combination of hardware and software techniques was required to achieve the desired outcome. For silver halide films inspection, adopting a combination of dark-field illumination brought to the best results in detecting scratch and dust . In certain inspection scenarios, such as small defects in automotive components, standard illumination techniques were not found suitable. Thus, to ensure identifying defects when they are undetectable to the naked eyes, 
proposed to use x-ray imaging and achieved quality performance using SVM-linear classifier. Yet, in a similar case for detecting small defects onautomobile casting aluminum parts, deployment of x-ray imaging together with the most recent algorithms such as Feature Pyramid Networks leads to mAP in the best case scenario .
In this paper, firstly we present a custom-designed illumination system comprising several heterogeneous lighting techniques including diffused, dark-field, and front lighting, under various camera exposure values to illuminate numerous defect types on a wide range of surface characteristics that a complex-object might be made of. In addition, we discuss that collecting data under various illumination configurations can be understood as representing an artifact in different modalities, although all the modalities are in practice offered in a single data format as the RGB image. As also suggested in , hereafter we will thus regard images acquired under different illumination configurations as different modalities.
Secondly, we provide exhaustive analysis on the potentials of the proposed system to be utilized in either training or evaluation phase. In many cases, the illumination system cannot be arbitrarily chosen or modified due to, for example, out of reach system specifications or cost related issues, especially in customer site (evaluation phase). Hence, we will investigate performance improvement brought by the developed system only in the data collection phase for training of algorithms only in provider site. Further, we experimentally demonstrate that mutual processing of multiple modalities in the form of late-fusion of single detections in each modality leads to considerable improvements in the performance of defect detection algorithms if employed in both training and evaluation phases, thus justifying the suitability of the proposed pipeline.
In this regard, the work most similar to ours is the one proposed in , where to detect and classify defects on a smartphone surface, several images are taken with various cameras and light sources to ensure the visibility of defects in at least some of the images. However, differently from our proposal where images are taken with a single camera under varying illumination conditions, in , as the images are taken with different cameras placed in different locations, the mutual processing of the collected images does not occur.
Our motivations for proposing our design are threefold: first, in our proposal, only one camera is embedded, leading to a more cost-effective setup. Second, our proposed setup is designed to have a moderate physical weight enabling it to be carried e.g. with a robotic arm to spin around the complex-object and acquire images at different positions of it. Third, and possibly of more interest to the pattern recognition community, our proposed setup provides multiple instances of the same defective region that we empirically demonstrate to have a large improving impact on defect detection procedure.
Iii Acquisition setup
Our proposed lighting setup is composed of five flat-dome lights that alternatively activate and deactivate in different combinations. The light positioning has been empirically studied such to reproduce diffused, dark-field and front lighting techniques, while producing the least possible glares on the specular surfaces. Our proposed setup can be seen in Fig. 2.
Dome light offers diffused, shadow-less, and uniform illumination even on shiny, curved, and uneven surfaces. In fact, flat-dome lights provide the same characteristics of dome lights, with the additional advantage of occupying less volume, as of the standard LED light. To minimize the reflectivity of the lighting system, which would make it visible when acquiring highly specular surfaces, we covered all the white flat-dome lights with dark collimator filters.
We identified four lighting configurations which allow the system to produce front lighting (Fig. 2.C) and dark-field lighting in vertical (Fig. 2.UD), horizontal (Fig. 2.LR) and all lateral (Fig. 2.UDLR) directions. Front lighting is mostly suitable for detecting color irregularities or flat defects, while dark-field lighting is extremely useful for acquiring effective images of defects related with surface irregularities such as scratches, bumps, or missing pieces.
In addition to the four modalities and to ensure the appropriate illumination level of the acquired images of any surface independently from their reflective characteristics, each light configuration is activated for 3 different time lengths, mimicking 3 different camera shutter speeds (low, medium, high). Camera exposure time is set to be constant and longer than the maximum time of light activation. Trigger controls are configured such that lights and the camera are properly synchronized. In our study, all the images are acquired using a Basler acA2440-75uc camera and an Edmund Optics 16mm F1.4 lens. The camera is placed at the ad-hoc hole presented in the center of the central light. In order to block out all the external environment light, the entire setup and the complex-object to be inspected were placed in a dark black box.
Given the described acquisition setup, the system can simultaneously acquire 12 images of the same object varying the illumination conditions (4 modalities, each with 3 exposures). A defect, depending on its type and the characteristics of the surface on which it appears, might be visible in all or only some of the captured images. For example, as in the case shown in Fig. 3, the defect is visible in all the images but images captured with central light with medium and high exposures. Note the significantly different representation each one of the light configurations offers from a single defect.
Without predefined instructions on image choice, for each defective object, the annotators label the defect in only one of the images on which they can spot it, as shown by a green bounding-box in Fig. 3. Fig. 4 shows the normalized frequency of annotations for each illumination condition. We expand the single annotation on one image to all the 11 remaining images. If the existing defect on the object is not visible in any of the 12 images captured by the setup, the annotator indicates the non-visibility of the defect in the annotation tool. It is worth mentioning that, the developed setup enabled us to visualize and correctly annotate of the defects in a freely selected collection of complex-objects.
The collected dataset consists of defective regions of complex-objects, where each region may contain more than one defect. For each region, 12 images with varying illumination conditions are collected, obtaining a total number of images. For our experiments, we split the dataset object-wise in training, validation, and test set with the ratio of 70%, 15%, and 15% respectively.
Depending on the defect type and on the surface characteristics, the defect might be better visible in one or more than one image out of the 12 collected using the proposed setting. Given this, is selecting a conventional single illumination technique the most effective choice that the system provider can make? Can the system provider leverage the availability of the multi-modal data in training phase for improving uni-modal testing performance? Can different light conditions be considered as a natural data augmentation technique, or the resulting images are too correlated to actually bring any contribution during the model training? Can inspection scenarios benefit from the multi-modal data availability also in evaluation phase? In the following paragraphs, we explain our proposed methodology for responding to the aforementioned questions.
V-a Study 1: Training and evaluation on one single modality
The most common situation when working with visual inspection systems consists of having the same illumination setup available in both training and evaluation phase, therefore it is fundamental to assess the best performing illumination modality. This scenario will be our baseline: only one illumination modality is available for training and evaluation. In this single modality scenario, we are interested in comparing the performances that can be obtained using each of different modalities, for better understanding the characteristics of our dataset and for exploring which light configuration may better help in solving our task.
Note that given the single modality scenario, only one quarter of the collected data is used, since the related images to all the other 3 modalities are discarded. Yet, in all the experiments, the selected light configuration includes all of its corresponding images taken under all the 3 exposures, unless stated differently.
V-B Study 2: Training on multiple modalities, evaluation on a single modality
As mentioned earlier, in some cases, visual inspection systems cannot be arbitrarily chosen or modified in evaluation phase. In this study, we aim to verify whether deploying a multi-modal inspection system only for acquiring images to be used for model training can lead to improved performances on the unmodified single modality evaluation setup.
In order to be comparable with the results of Study 1, we introduce images acquired using different illumination modalities keeping constant the number of images used during training. In other words, also in this experimental setup, only one quarter of the entire dataset is used. In this case, we choose two possible strategies to select dataset images to preserve:
Out of the 12 images available per each defective region, preserving 3 random images each from a different modality under one randomly selected exposure value only;
Preserving only one quarter of the defective regions in the dataset, but using all of their 12 images acquired with all the light configurations and exposures.
Comparing the performances obtained by training the model on these two datasets will give us an insight on the comparative effectiveness of having either more defective objects or more modalities during training phase, given any of the single modalities in evaluation phase.
V-C Study 3: Training on all the images and modalities, test on a single modality
In Study 2 we discarded three quarter of the collected images for comparing the achieved results with the ones obtained in Study 1. Nevertheless, the proposed acquisition setup enables collecting 12 images per each object with no additional effort required for acquiring or annotating them in comparison to a single modality illumination system.
The possibility of having a bigger training set to exploit, would raise expectations for modeling better the task to be solved. However, in complex-object defect detection scenario, it is not given that the additionally collected images, in fact, provide beneficial information for training a more effective model to be used in a single modality scenario. In case they do, it means that the system is able to transfer the information collected from one light modality to a different modality and that the system can better model the detection task even if only provided with modalities during training which are not available during evaluation. In this study, we aim to evaluate this hypothesis.
In comparison to Study 1 and Study 2, in Study 3 we are using the entire training set introduced in Sec. IV which is four times bigger, while the test set remains intact.
V-D Study 4: Training and evaluation on multiple modalities
After having analyzed the impact of having a multi-modal lighting system available in training phase only, in this Study our aim is to verify the effectiveness of having the same multi-modal lighting system also in evaluation phase.
It is important to highlight that the images of the same defective region collected with different light illuminations share the same annotations and should produce the same output. Combining each generated output is, therefore, essential and we expect it can positively impact the final algorithm performance, as it has been shown in other scenarios .
We propose the following fusing procedure: Let us define the set of 12 images of the same region collected varying the illumination conditions as , let us define the set of the defective bounding-boxes detected in all images , and let us also define the set of the corresponding detection confidences given by the detection algorithm. Our proposal is to apply Non-Maximal Suppression (NMS) algorithm over and replace on every the output of the NMS algorithm. Given the NMS Intersection-over-Union (IoU) threshold as , NMS algorithm operates as written in Algorithm 1.
NMS operates in three steps: Firstly, it sorts all of the detected boxes based on their box confidence scores from high to low; secondly, it selects the box which has the highest box confidence score as the detection result; and finally, it discards other candidate boxes whose IoU value with the selected box is beyond the threshold. Within the remaining boxes, NMS repeats the above two steps until there is no remaining box in the candidate set .
In Study 4 we will compare the performances of the system when the model is trained on the entire multi-modal training set and evaluated on the entire test set, with and without applying the proposed late-fusion technique.
Vi Experimental setup, results and discussion
In all the experiments discussed earlier in Sec.V for automatic defect detection, we used YOLO-v3 end-to-end detection pipeline , given its fast inference time and its ability to detect small defects.222 We would like to mention that a comparative study of detection algorithms is out of the scope of this paper.
YOLO-v3 detector has been originally trained over the COCO dataset 
, then the weights of the network are adapted to our task using the transfer learning approach updating all the layers of the network. Training has been done on a NVIDIA GeForce RTX 2080 Ti GPU, with learning-rate = 0.0001, and momentum = 0.9.
As mentioned in Sec. IV, the dataset is split into training, validation, and test sets. In the experiments where a subset of data is required (Study 1, 2 and 3), that subset is selected within training, validation, and test sets independently and the splits do not vary in the experiments belonging to the same Study, or shared among various Studies (for example, Test - C is common among Study 1, 2 and 3). This allows us to retain the comparability of the experiments from one Study to another. As in standard settings, the validation set is used to tune the parameters of the algorithm and the final results are reported on the test set. Each detection bounding-box proposed by the model is compared with the ground-truth and classified as:
True Positive (TP): the detection has IoU and it is therefore considered correct;
False Positive (FP): the detection has IoU and it is therefore considered wrong;
False Negative (FN): the ground-truth annotation has not been detected.
We report the results of all experiments using the standard metrics used in single-object (defect) detection as Precision, Recall, F1-score, and Average Precision (AP). Among the aforementioned metrics, Precision, Recall, and consequently, F1-score are reported after fixating the acceptance confidence threshold of the algorithm, in this work set to . Precision is defined as , Recall as and F1-score as . AP on the other side, summarizes the Precision-Recall curve as the weighted mean of Precision achieved at different confidence thresholds, with the increase in Recall from the previous threshold used as the weight and is calculated as , where and
are the Precision and Recall at the-th threshold.
To compare the results in the next sections, we will mainly refer to AP, since AP compared to F1-score considers Precision and Recall relations more globally . In this section, results are reported with a fixed threshold with the ground-truth among the experiment.
Vi-a Study 1: Training and evaluation on one single modality
The results of the experiments discussed in Sec. V-A are given in Table I. The most effective configuration according to the AP is the one activating all the lateral lights to produce dark-field illumination from four directions. This configuration outperforms frontal light and dark-field illuminations in any of vertical and horizontal directions and it will be referred to as the baseline for the following studies.
|U D||U D||61.69||44.95||52.01||29.11|
|L R||L R||58.56||41.07||48.28||25.52|
|U D L R||U D L R||61.06||52.73||56.82||34.69|
Vi-B Study 2: Training on multiple modalities, evaluation on a single modality
The results of the experiments discussed in Sec. V-B are reported in Table II. Each training set has been generated 5 different random times for each experiment and results are given in format in the AP column. Precision, Recall, and F1-score values are given for only the first trial.
The results indicate, given the same number of images in the training set, maximizing the heterogeneity in the lighting modalities is more effective than acquiring more samples of defective objects with a limited set of illumination modalities.
|U D L R||66.34||36.61||47.18||26.873.79|
Comparing the results of Study 2 with Study 1, it is noticeable that multi-modal training is beneficial for most of the single lighting modalities in evaluation and that the single-modal test performance is less dependent on the choice of the illumination modality if the algorithm is initially trained with multiple modalities.
Vi-C Study 3: Training on all the images and modalities, test on a single modality
The results of Study 3 are listed in Table III. Comparing these results with ones of Study 2, using a bigger training set leads to a considerable performance boost (at least ). These results are a clear demonstration that acquiring more images using multiple light conditions is actually enriching the information provided to the model during training. Even in the case when 3 modalities out of 4 are not used in evaluation time, their availability during training makes the system able to better model the detection task to be solved, as it has been shown in other scenarios .
|U D L R||72.11||70.37||71.23||52.38|
Eventually, it is worth noting that choosing any illumination modality to be used in production, after training the model with the multi-modal illumination system, would not bring significant variation in the detection performances.
Vi-D Study 4: Training and evaluation on multiple modalities
|All Train||All Test||72.26||70.18||71.20||52.08|
The focus of the experiments until this point was given to the analysis of the effect of the presence of all or selected number of modalities in training while evaluation of the algorithms has been reported on single modalities. In Study 4, we aim to analyze whether it is possible to further improve the overall system performance having the availability of all the modalities also in evaluation phase. With this Study we can also assess the benefits which can be obtained with the deployment of our designed system in the operational scenario. The results of this study are reported in Table IV.
Comparing the results given in Table IV with ones in Table III, having the availability of all the modalities in evaluation phase, leads to performance improvements only if the detection results obtained from each single illumination modality are properly combined, using the late-fusion technique proposed in Sec. V-C. Fig. 5 shows the Precision and Recall values obtained at different detection confidence thresholds in , with and without employing the late-fusion technique. It can be observed that applying late-fusion leads to a higher Area Under Curve (AUC), thus higher AP. Employing late-fusion, Fig. 6 shows three examples of the successful detections of defects employing late-fusion (on the right). The qualitatively better detections after applying late-fusion with regards to detections on single images can be appreciated in all the three cases.
On the other side, Fig. 7 shows five examples of failure cases even after the late-fusion technique in five defective images. In these cases, our observation is that the algorithm fails to detect a defect if it is not fairly visible in any of the images taken under any of the lighting conditions . Besides, false positive detections in some cases occur due to the presence of visually similar-to-defect artifacts on the images. This can be considered to confirm the importance of acquisition hardware setup design, and further, annotation process, for obtaining desirable results by the machine vision algorithms. In the cases where false positive detections are due to missing annotations, thus noise in the labels, the proposed method can be used to provide support for localization of defects to be fixed in the product revision departments in industries, or as an additional supervision method for further improvement of training procedure.
In this paper, we introduced our custom-designed acquisition setup for inspection of a complex-object, and discussed its suitability in visualizing a wide range of surface defects thanks to the proposed illumination setup which holds four standard illumination techniques comprised of diffused, dark-field, and front illumination in one place.
Further, we argued that deployment of the proposed setup might not be feasible in an inspection environment, thus we conducted four studies to exploit the role of each of the illumination sources and whether it is possible to exploit the potentials of the proposed setup when only deployed in training phase. The conclusions from the studies can be summarized as follows. In the case of deployment of the same single illumination modality in both training and evaluation phase, the most effective one is discovered to be activating all the lateral lights resembling dark-field illumination from four directions. However, given the same number of images in training set but with more modalities, the evaluation results on any of the single modalities are less dependent on the type of modality in evaluation phase. Nevertheless, exploiting more samples in all the modalities in training phase brings to a large improvement when evaluated on single modalities, justifying our proposed lighting setup to be employed at least for training purposes. The introduction of all the modalities in evaluation phase though does not lead to any substantial change with regards to a single modality illumination only, unless the proposed late-fusion technique is utilized which is when the highest performance of the proposed pipeline is achieved.
We believe our proposed acquisition setup and pattern analysis of the illumination modalities can be a source of intuition for other researchers in the industrial inspection field for the automatic examination of objects with highly complex characteristics.
-  (2016) Multi-face tracking by extended bag-of-tracklets in egocentric photo-streams. Computer Vision and Image Understanding 149, pp. 146–156. Cited by: §V-D.
Unachievable region in precision-recall space and its effect on empirical evaluation.
Proceedings of the International Conference on Machine Learning., pp. 349. Cited by: §VI.
-  (2016) Development of an optical inspection platform for surface defect detection in touch panel glass. International Journal of Optomechatronics 10 (2), pp. 63–72. Cited by: §II.
A self organizing map optimization based image recognition and processing model for bridge crack inspection. Automation in Construction 73, pp. 58–66. Cited by: §I.
-  (1982) Automated visual inspection: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), pp. 557–573. Cited by: §I.
Approaches for improvement of the x-ray image defect detection of automobile casting aluminum parts based on deep learning. NDT & E International 107, pp. 102144. Cited by: §II.
-  (2018) Modality distillation with multiple stream networks for action recognition. In Proceedings of the European Conference on Computer Vision, pp. 103–118. Cited by: §VI-C.
-  (2015) Microscopy illumination engineering using a low-cost liquid crystal display. Biomedical optics express 6 (2), pp. 574–579. Cited by: §II.
-  (1986) Visual inspection of sealing rings—a case study on lighting and visibility. Lighting Research & Technology 18 (2), pp. 98–101. Cited by: §VI-D.
-  (2016) Deformable patterned fabric defect detection with fisher criterion-based deep learning. IEEE Transactions on Automation Science and Engineering 14 (2), pp. 1256–1264. Cited by: §I.
-  (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §VI.
-  (2007) A practical guide to machine vision lighting. Midwest Sales and Support Manager, pp. 1–3. Cited by: §I.
-  (2017) Automatic defect recognition in x-ray testing using computer vision. In IEEE Winter Conference on Applications of Computer Vision, pp. 1026–1035. Cited by: §II.
-  (2018) Real-time product quality control system using optimized gabor filter bank. The International Journal of Advanced Manufacturing Technology 96 (1-4), pp. 11–19. Cited by: §II.
-  (2016) Ambiguous surface defect image classification of amoled displays in smartphones. IEEE Transactions on Industrial Informatics 12 (2), pp. 597–607. Cited by: §II.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §VI.
-  (2013) Automatic detection of dust and scratches in silver halide film using polarized dark-field illumination. In IEEE International Conference on Image Processing, pp. 2096–2100. Cited by: §II.
-  (2017) The role of visual inspection in the 21st century. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 61, pp. 262–266. Cited by: §I.
-  (2018) Detection of rail surface defects based on cnn image recognition and classification. In International Conference on Advanced Communication Technology, pp. 45–51. Cited by: §I.
-  (1996) Choose the right lightning for inspection. Test and Measurement World 16, pp. 53–60. Cited by: §I.
-  (2013) Colour measurements by computer vision for food quality control–a review. Trends in Food Science & Technology 29 (1), pp. 5–20. Cited by: §I.