I Introduction
In the electronics manufacturing industry, most of the errors during PCB production are caused by solder joint errors; therefore, inspection of solder joints is an important procedure in the electronic manufacturing industry. Accurate detection of solder errors reduces manufacturing costs and ensures production reliability. Solder joints are generally very small and form different shapes on PCBs. The main difficulty in visual inspection of solder joints is that there is generally a minor visual difference between defective and defectfree (normal) solders [moganti1996automatic]. Even a small change in the form of the solder joint can pose a fatal error in PCBs. Considering these, we propose approaching the problem as finegrained image classification (FGIC), where interclass variations are small and intraclass variations are large, which is the opposite of generic visual classification. Several examples of normal and defective solders are shown in Fig. 1
(a). We compare the diversity of the solder joint features to the generic images of ImageNet
[imagenet2009] in Fig. 1(b). For visualization, we sketch the top two principle components of the output features on the last fully connected layer of the deep Convolutional Neural Network (CNN) architecture VGG16
[simonyan2014very]for the solder dataset and ImageNet; and observe that the solder image features’ diversity is much less than ImageNet. Blue and green samples in Fig.
1(c) represent normal and defective solder joints’ features respectively. Feature diversity of the samples across the classes is similar as they show a low interclass variance, whereas the features are diverse for the same class since they present a high intraclass variance. These observations on the dataset show that SJI is a finegrained image classification problem.

As opposed to generic image classification, samples that belong to different classes can be visually very similar in FGIC tasks. In this case, it is reasonable to penalize the classifiers that are too confident in their predictions, which in turn reduces the specificity of features and improves generalization by discouraging low entropy
[wei2021fine]. One way to achieve FGIC is through entropyregularization. Entropyregularization based methods such as label smoothing [szegedy2016rethinking] and maximum entropy learning [dubey2018maximum] penalize model predictions with high confidence. An alternative entropyregularization technique is the JS [nielsen2010family, nielsen2011burbea]. In fact, maximum entropy learning and label smoothing are two extremes of JS [meister2020generalized].In this study, we show that solder joints are actually finegrained, and that their inspection can be treated as a FGIC problem. Then, we propose using JS as an entropyregularization method that improves inspection accuracy over the other entropyregularization methods. Additionally, we compare the proposed JS regularization method to recent FGIC methods that are based on different approaches such as employing segmentation techniques, attention mechanism, transformer models, and designing specific loss functions. We show that regularizing with JS achieves the highest F1score and competitive accuracy levels for different models in finegrained classification of solder joints.
Our contributions are as follows: () We propose using JS as an entropy regularization in finegrained classification tasks for the first time. ( We illustrate that SJI is a FGIC task. () We show that entropy regularization with JS improves classification accuracy over the existing methods across different models on finegrained solder joints. () We demonstrate that regularizing entropy with JS can focus on more distinctive parts in the image, and this improves the classification accuracy.
We employ gradientweighted class activation mapping (GradCAM) [selvaraju2017grad], an activation map visualization method, to see the effect of regularizing entropy on the activation maps to classify the solder joints. We find that localized regions of the image used in classification are more reasonable for a model trained with entropyregularization and less affected by background noise. Although the proposed regularization method is tested for the finegrained solder joints dataset, it holds potential for other FGIC datasets as well.
Ii Related Work
Solder joint inspection is a long standing problem in the literature [moganti1996automatic, abd2020review]. Proposed approaches can be divided into
groups: referencebased, traditional featurebased, and deep learning based. Referencebased methods use a defectfree (template) version of the defective PCB to compare. This comparison can be made in several ways, including pixelwise subtraction to find the dissimilarities
[west1984system, wu1996automated]; template matching via normalized cross correlation to find the similarities [annaby2019improved, crispin2007automated]; frequency domain analysis
[tsai2018defect]; and template image modelling using feature manifolds [jiang2012color], Principle Component Analysis [cai2017ic], and gray values statistics [xie2009high]. The generalization of referential methods rely on the number of templates of the defective image, and this restricts the practical applications.In the traditional feature extraction based methods, color contours
[capson1988tiered, wu2013classification], geometric [acciani2006application, wu2014inspection], wavelet [acciani2006application, mar2011design], and gabor [mar2011design] features have been used followed by a classifier. In order to extract these features, solder joints are required to be segmented or localized from PCBs. For segmentation, colour transformation and thresholding [jiang2007machine, mar2011design], as well as colour distribution properties [zeng2011algorithm][zeng2011automated] under a special illumination condition are used for segmentation. Then, to extract features, the localized solder joints are divided into subregions [hongwei2011solder, wu2011feature], the selection of which can also be optimized as in [song2019smt].After the introduction of CNN, deep learning methods outperformed the traditional methods in solder inspection, despite a need for a lot of data. Although defectless (normal) solder joints are easy to obtain, defective solder joints are quite rare. To alleviate this problem, some studies modelled the normal solder data distribution with generative models [li2021ic, Ulger2021]. When enough data is available, Cai et al. [cai2018smt] obtained very high accuracy with CNNs in solder joint classification under special illumination conditions. Li et al. [li2020automatic] utilized an ensemble of Faster RCNN and YOLO object detectors to both localize a solder from PCB and to classify it. YOLO detector is also employed on thermal PCB images to detect electronic component placement and solder errors in [jeon2022contactless]. A lightweight custom object detector is designed by fusing convolutional layers for defect localization in [wu2022pcbnet]. The two surveys [abd2020review, moganti1996automatic] also provide a comprehensive review on SJI methods.
Finegrained image classification. In this study, we further analyse the solder joints in another frame and show that they have less feature diversity, and form finegrained images. By regularizing the entropy in model training, we improve the classification accuracy. The main challenge in FGIC is that classes are visually quite similar and intraclass variation is large in contrast to generic classification. In the literature, mainly (i) localizationclassification subnetworks and (ii) endtoend feature encoding are applied to this problem [wei2021fine]. Localizationclassification subnetworks aim to localize image parts that are discriminative for FGIC. In [zhang2019learning] expert subnetworks that learn features with attention mechanism are employed to make diverse predictions. In [ji2020attention], attention convolutional binary neural tree architecture (ACNet) is introduced with an attention transformer module in a tree structure to focus on different image regions. In [behera2021context], a contextaware attentional pooling module is used to capture subtle changes in images. In [rao2021counterfactual], causal inference is used as an attention mechanism. Endtoend feature encoding tries to learn a better representation to capture subtle visual differences between classes. This can be achieved by designing specific loss functions to make the model less confident about its predictions [szegedy2016rethinking, pereyra2017regularizing]. During model training, crossentropy between the model and target distribution is minimized, which corresponds to maximizing likelihood of a label; however this can cause overfitting. Several methods are proposed to increase model generalization. Szegedy et al. [szegedy2016rethinking]
propose label smoothing regularization that penalizes the deviation of predicted label distribution from the uniform distribution
[szegedy2016rethinking]. Dubey et al. [dubey2018maximum] show that penalizing confident output distributions [pereyra2017regularizing] improves accuracy in FGIC. Mukhoti et al. [mukhoti2020calibrating] show that focal loss [lin2017focal] can increase the entropy of model predictions to prevent overconfidence. Meister et al. [meister2020generalized] derived an alternative entropyregularization method from JS, called as generalized entropyregularization, and showed its success in language generation tasks. Designing loss functions is not limited to entropy regularization. For instance, the mutual channel loss in [chang2020devil] forces individual feature channels belonging to the same class to be discriminative.Recently, different approaches are also proposed to tackle the FGIC task. Du et al. [du2020fine] used a jigsaw puzzle generator to encourage the network learn at different visual granularities. Luo et al. proposed CrossX learning [luo2019cross] to exploit relationship between features from different images and network layers. He et al. [he2022transfg] added a part selection module to Transformer models for FGIC tasks, which integrates attention weights from all layers to capture the most class distinctive regions. In the results section, we compare our approach to both the entropyregularization and these alternative methods.
The goal in regularizing the entropy is to get the most unbiased representation of the model which is achieved by the probability distribution with maximum entropy
[jaynes1957information]. Several methods have been proposed to increase generalization of a classifier by maximizing entropy. The classifier is maximally confused when the probability distribution over the model predictions is uniform, where the classifier assigns the same probability in predicting between all the classes. Three alternatives to entropyregularization are given below.
Focal loss [lin2017focal] adds a weighted term to the crossentropy loss to put more weight on the misclassified examples. Additionally, it ensures entropy of the predicted probability distribution to be large. This leads to less confident predictions and better generalization. Focal loss is defined in Eq. 1 where , and is a scalar value. represents the groundtruth class, and is the probability of the sample to belong to class as given in Eq. 2.
(1) 
(2) 
Label smoothing regularization [szegedy2016rethinking] replaces target label distribution by a smoothed distribution to make the model less confident about its predictions. It encourages model distribution ^{1}^{1}1Conditional probability distribution is represented by for brevity. to be close to uniform distribution to maximize entropy. This is achieved by minimizing crossentropy as given in Eq. 3, where is a smoothing term, .
(3) 
Maximum entropy learning [dubey2018maximum] minimizes the KullbackLeibler () divergence between the model distribution and the true label distribution . A confidence penalty term is added to maximize the entropy of model predictions and make it more uniform as defined in Eq. 4, where is a term for strength of the penalty.
(4) 
Iii Methods
Our proposed pipeline includes segmentation and classsification of individual solder joints on PCBs as shown in Fig. 2. Solder joints are segmented with You Only Look at Coefficients (YOLACT) [bolya2019yolact], an instance segmentation method. Normal and defective solder joints are classified with a CNN with entropy regularization.
Iii1 Segmentation: YOLACT
YOLACT mainly employs a Feature Pyramid Network (FPN) as a backbone to extract features. The feature maps are used in two branches to produce prototype masks and bounding boxes with associated labels. The solder joints are segmented from high resolution PCB images that are divided into tiles for easier processing. Initially, more than normal solder joints and defective samples are labelled on PCB tiles by annotating through Labelme [labelme] in model training. Both normal and defective solder joints are segmented with YOLACT. Only imprecise segmentation results are corrected manually to create a solder dataset for FGIC. The proposed pipeline is given in Fig. 2. All the individual solder joints are obtained as a result of this segmentation.
Iii2 Classification with Entropy Regularization: Js
In supervised classification, the objective to minimize the entropy between the model distribution and distribution over the labels leads to overconfidence in assigning probabilities; and this can cause overfitting [szegedy2015going]. To alleviate this problem, overconfidence is penalized by maximizing the entropy. A way to maximize entropy is encouraging model entropy to be as close as possible to uniform distribution, where the entropy is maximum. We propose using JS as an entropyregularization method in finegrained classification of solder joints. JS is obtained by the sum of weighted skewed KullbackLeibler (KL) divergences by introducing a skewness parameter, that determines weights for probability distributions and , and scaled by as given in Eq. 5. Derivation is available in Appendix B.
(5) 
The penalty term given in Eq. 6 is added to encourage high entropy by minimizing the distance between the model and uniform distribution.
(6) 
As a result, the optimized objective function to train the model is given in Eq. 7.
(7) 
In fact, using JS with as a regularizer is equivalent to maximum entropy learning, and label smoothing regularization with . The derivation is available in [meister2020generalized].
In this study, we propose using JS as an entropy regularizer as given in Eq. 7. The skewness parameter () in JS allows us to set different entropy values for the model by giving different weights to the uniform and model distributions. Since we observe that higher entropy does not necessarily correspond to higher classification accuracy, which is observed with the experiments in Sec. IVA, by tuning the parameter and testing different entropy values, we show that higher accuracy can be achieved for FGIC of solder joints across different models.
Iv Dataset and Experiments
Dataset: As a result of the segmentation in Section III1, the solder dataset has normal and defective solder joints. The labels are confirmed by experts. The type of solder joint errors include mostly solder bridge, excessive, and insufficient solder errors. Additionally, there are some cracked solder, voids, shifted component, corrosion and flux residue errors. About of the dataset that corresponds to normal and defective solder joints are set aside randomly as the test set. The rest of the dataset is used for stratified fivefold cross validation, which retains a percentage of samples for each class in the folds, due to class imbalance. The segmented solder image sizes vary from to , therefore we trained the models with different image sizes: , , and . Image size of decrease accuracy, and large image size require a lot of training time; hence all the samples were resized to . The data is normalized to have zero mean and unit variance prior to training.
Experimental settings: We experiment on different architectures such as GoogleNet [szegedy2015going], VGG16 [simonyan2014very], ResNet18 and ResNet50 [he2016deep] with the same initial weights to evaluate the robustness of the proposed method in FGIC. All the models are trained for to iterations until convergence with batch size of
samples. Root Mean Squared Propagation (RMSprop) with momentum and Adam optimizer
[kingma2014adam] is used interchangeably with an initial learning rate of . regularization is applied to mitigate overfitting. The final results are reported by the model that performed the best in the validation set with either minimum validation loss or maximum validation accuracy, then the model is tested in the test set. Entropy regularization strength term is set to for both maximum entropy learning and JS for fair comparison. For label smoothing, is selected, and focal loss is experimented with and, that is widely used in the literature. F1score is used as an evaluation metric due to the imbalance in the dataset.
Iva Experimental Results
We compare the proposed approach with (i) entropyregularization based models in Sec. IVA1
, and (ii) the models that employ segmentation techniques, attention mechanism, transformer networks and new loss functions in Sec.
IVA2.Method  Backbone  Accuracy  F1score 
ACNet [ji2020attention]  VGG16  
APCNN [ding2021ap]  VGG16  
CAP [behera2021context]  VGG16  
MCLoss [chang2020devil]  VGG16  
DFL [wang2018learning]  VGG16  
CrossX [luo2019cross]  ResNet50  
CAL [rao2021counterfactual]  ResNet50  
WSDAN [hu2019see]  ResNet50  
PMG [du2020fine]  ResNet50  
MGECNN [zhang2019learning]  ResNet50  
TransFG [he2022transfg]  ViT B/16  
JS  VGG16  
ResNet50 
IvA1 Comparison with EntropyRegularization Methods
The results of entropyregularization based models is given in Table I. Here, we compare the JS regularization across different architectures to entropyregularizationbased methods such as label smoothing, focal loss, and maximum entropy learning as well as the models trained without regularization. These methods are chosen since they are designed to maximize the entropy for FGIC through a new loss function as opposed to changing the model architecture or employing alternative mechanisms. The highest F1score is achieved with JS and for VGG16, for ResNet18, and ResNet50, and for GoogleNet. An improvement in accuracy by approximately for VGG16, for ResNet18, for ResNet50, and for GoogleNet is achieved over the model trained without regularization. The results show that entropyregularization based methods are very effective in classifying solder joints, especially the JS, which outperformed the other regularization methods.
Effect of entropyregularization is visualized in Fig. 3. Note that there were many misclassifications in the vicinity of normal samples, and among the scattered defective samples as shown in green. The green circles represent the normal samples that are misclassified by the model trained without regularization (false negatives), classified correctly by JS (true positives). The green crosses stand for the correctly classified defective solder samples (true negatives) by JS that were misclassified otherwise (false positives).
In order to interpret the effect of entropyregularization and model predictions in images, activation maps of each test image are visualized through GradCAM. We used the implementation in [jacobgilpytorchcam] for GradCAM visualization. The last convolutional layer of each model are visualized to see which part of the image is used in making decision. In Fig. 4, (a) test images, (b) activation maps on VGG16 model trained without any regularization, (c) DFL [wang2018learning], (d) PMG [du2020fine] on ResNet50, and (e) with JS are shown. The model regularized with JS yields more precise classdiscriminative regions (solder joints/errors) compared to the other approaches. Shorted solder in the first and last row of Fig. 4, flux residues in the second row, and passive electronic component solder joint in the third row are localized more accurately with JS. Additionally, it is not affected by background noise such as PCB background and nonsoldered regions as much as the other models.
IvA2 Comparison with the Other Approaches
Comparison with the other stateoftheart approaches are presented in Table II. We experimented with not only the recent stateoftheart methods based on designing FGICspecific loss functions [chang2020devil] similar to our approach, but also experimented with employing segmentation networks [wang2018learning, behera2021context], attention mechanism [hu2019see, ji2020attention, ding2021ap, rao2021counterfactual], and others [zhang2019learning, du2020fine, luo2019cross, he2022transfg]. All the models are trained until convergence ( iterations) from scratch, and tuned by changing their learning rate and optimizer. The best results are reported on the test set at the end of training. The proposed approach achieves competitive results on the solder joint dataset without changing the model architecture, using attention mechanism or employing segmentation techniques. The models are compared both with respect to accuracy and F1score. Accuracy is calculated by the ratio of correct predictions to the number of test samples. For VGG16, JS leads to and improvement over the closest method [wang2018learning] in terms of accuracy and F1score respectively. The proposed regularization with JS is competitive for ResNet50, closely following [rao2021counterfactual] with and in terms of accuracy and F1score respectively at the expense of extra computational cost for the attention module.
We investigate the effect of the skewness parameter () on the JS for discrete binary distributions. JS and entropy values as a function of for VGG16, ResNet18, ResNet50, and GoogleNet are shown in Fig. 5 (a) and (b) respectively. By calculating the divergence as a function of , we observe that the increase in resulted in lower divergence between uniform distribution and model distribution . This monotonic relation shows that the model distribution becomes closer to the uniform distribution with higher values. Taking this into consideration, getting closer to the uniform distribution yields higher model entropy as expected. The divergence of the models are calculated on the test set.
In line with this information, one can think that the highest accuracy should be achieved with the model that is closest to the uniform distribution i.e. lower JS and higher entropy; however this relation is not simple. We observe that higher entropy does not always correspond to higher accuracy. We plotted the F1score results as a function of normalized entropy and did not observe a monotonic relation between these for ResNet50, VGG16, ResNet18, and GoogleNet architectures as shown in Fig. 6. This complicated relation with (skewness parameter) and model accuracy motivates us to search for intermediate values rather than testing only extreme values (label smoothing and maximum entropy regularization).
V Conclusion
In this paper, we show that normal vs. defective solders exhibit a low interclass variance and high intraclass variance; and when compared to the datasets such as ImageNet, they show little diversity. With this information, we tackle the SJI as a fine grained image classification task, and propose JS based entropy regularization as a solution to FGIC of solder joints. We compare our proposed solution to both the entropyregularization based approaches as well as the approaches that employ segmentation, attention mechanism, transformer models, and design loss functions for FGIC. We show that the proposed JS based entropy regularization method achieves the highest and competitive accuracy in SJI across different model architectures. Using activation maps, we show that better classdiscriminative regions are obtained with entropyregularization, and this improves the classification accuracy.
Appendix
Va Preliminaries
Entropy
is a measure of uncertainty of a random variable. Entropy of a discrete random variable
is defined by Eq. 8, where is the states of and .(8) 
Crossentropy between the model distribution and true distribution of the labels
is minimized to train a neural network in supervised classification.
is the model parameters. For brevity, crossentropy for binary class classification is given in Eq. 9.(9) 
where is defined as,
represents the groundtruth class. is the probability of the sample to belong to class .
KullbackLeibler (KL) divergence is a difference measure between two probability distributions and and defined as in Eq. 10 for discrete random variables . KL divergence is not symmetric i.e. the forward and reverse KL divergences are not equal.
(10) 
JensenShannon divergence (JSD) is a symmetric version of KL divergence as given in Eq. 11.
(11) 
VB skew JensenShannon Divergence
JS is obtained by the sum of weighted skewed KL divergences [lee2001effectiveness] of the probability distributions and as given in Eq. 12 [nielsen2020generalization]. is a skewness parameter that determines weights to the probability distributions.
(12)  
(13)  
(14)  
(15)  
JS is scaled by to guarantee continuity for [nielsen2011burbea, nielsen2015total]: