Log In Sign Up

Fine-grained Classification of Solder Joints with α-skew Jensen-Shannon Divergence

by   Furkan Ulger, et al.

Solder joint inspection (SJI) is a critical process in the production of printed circuit boards (PCB). Detection of solder errors during SJI is quite challenging as the solder joints have very small sizes and can take various shapes. In this study, we first show that solders have low feature diversity, and that the SJI can be carried out as a fine-grained image classification task which focuses on hard-to-distinguish object classes. To improve the fine-grained classification accuracy, penalizing confident model predictions by maximizing entropy was found useful in the literature. Inline with this information, we propose using the α-skew Jensen-Shannon divergence (α-JS) for penalizing the confidence in model predictions. We compare the α-JS regularization with both existing entropyregularization based methods and the methods based on attention mechanism, segmentation techniques, transformer models, and specific loss functions for fine-grained image classification tasks. We show that the proposed approach achieves the highest F1-score and competitive accuracy for different models in the finegrained solder joint classification task. Finally, we visualize the activation maps and show that with entropy-regularization, more precise class-discriminative regions are localized, which are also more resilient to noise. Code will be made available here upon acceptance.


page 1

page 2

page 3

page 5


Exploring Vision Transformers for Fine-grained Classification

Existing computer vision research in categorization struggles with fine-...

Fast Fine-grained Image Classification via Weakly Supervised Discriminative Localization

Fine-grained image classification is to recognize hundreds of subcategor...

Mask guided attention for fine-grained patchy image classification

In this work, we present a novel mask guided attention (MGA) method for ...

Maximum Entropy Regularization and Chinese Text Recognition

Chinese text recognition is more challenging than Latin text due to the ...

Maximum-Entropy Fine-Grained Classification

Fine-Grained Visual Classification (FGVC) is an important computer visio...

Adaptive Fine-Grained Predicates Learning for Scene Graph Generation

The performance of current Scene Graph Generation (SGG) models is severe...

Learning Where to Fixate on Foveated Images

Foveation, the ability to sequentially acquire high-acuity regions of a ...

I Introduction

In the electronics manufacturing industry, most of the errors during PCB production are caused by solder joint errors; therefore, inspection of solder joints is an important procedure in the electronic manufacturing industry. Accurate detection of solder errors reduces manufacturing costs and ensures production reliability. Solder joints are generally very small and form different shapes on PCBs. The main difficulty in visual inspection of solder joints is that there is generally a minor visual difference between defective and defect-free (normal) solders [moganti1996automatic]. Even a small change in the form of the solder joint can pose a fatal error in PCBs. Considering these, we propose approaching the problem as fine-grained image classification (FGIC), where inter-class variations are small and intra-class variations are large, which is the opposite of generic visual classification. Several examples of normal and defective solders are shown in Fig. 1

(a). We compare the diversity of the solder joint features to the generic images of ImageNet

[imagenet2009] in Fig. 1

(b). For visualization, we sketch the top two principle components of the output features on the last fully connected layer of the deep Convolutional Neural Network (CNN) architecture VGG16


for the solder dataset and ImageNet; and observe that the solder image features’ diversity is much less than ImageNet. Blue and green samples in Fig. 


(c) represent normal and defective solder joints’ features respectively. Feature diversity of the samples across the classes is similar as they show a low inter-class variance, whereas the features are diverse for the same class since they present a high intra-class variance. These observations on the dataset show that SJI is a fine-grained image classification problem.

Fig. 1: (a) Samples from the solder joint dataset: normal solders are at the top, defective solders are at the bottom. (b) PCA projection of normal and defective solder joints’ features are given in blue and green respectively; showing that the inter-class variance among the normal and defective solder classes is low, but the intra-class variance is high. (c) PCA projection of the solder dataset on the last fully connected layer of VGG16 is shown in blue, ImageNet in red; showing that the solder image features’ diversity is much less than ImageNet.

As opposed to generic image classification, samples that belong to different classes can be visually very similar in FGIC tasks. In this case, it is reasonable to penalize the classifiers that are too confident in their predictions, which in turn reduces the specificity of features and improves generalization by discouraging low entropy

[wei2021fine]. One way to achieve FGIC is through entropy-regularization. Entropy-regularization based methods such as label smoothing [szegedy2016rethinking] and maximum entropy learning [dubey2018maximum] penalize model predictions with high confidence. An alternative entropy-regularization technique is the -JS [nielsen2010family, nielsen2011burbea]. In fact, maximum entropy learning and label smoothing are two extremes of -JS [meister2020generalized].

In this study, we show that solder joints are actually fine-grained, and that their inspection can be treated as a FGIC problem. Then, we propose using -JS as an entropy-regularization method that improves inspection accuracy over the other entropy-regularization methods. Additionally, we compare the proposed -JS regularization method to recent FGIC methods that are based on different approaches such as employing segmentation techniques, attention mechanism, transformer models, and designing specific loss functions. We show that regularizing with -JS achieves the highest F1-score and competitive accuracy levels for different models in fine-grained classification of solder joints.

Our contributions are as follows: () We propose using -JS as an entropy regularization in fine-grained classification tasks for the first time. ( We illustrate that SJI is a FGIC task. () We show that entropy regularization with -JS improves classification accuracy over the existing methods across different models on fine-grained solder joints. () We demonstrate that regularizing entropy with -JS can focus on more distinctive parts in the image, and this improves the classification accuracy.

We employ gradient-weighted class activation mapping (Grad-CAM) [selvaraju2017grad], an activation map visualization method, to see the effect of regularizing entropy on the activation maps to classify the solder joints. We find that localized regions of the image used in classification are more reasonable for a model trained with entropy-regularization and less affected by background noise. Although the proposed regularization method is tested for the fine-grained solder joints dataset, it holds potential for other FGIC datasets as well.

Ii Related Work

Solder joint inspection is a long standing problem in the literature [moganti1996automatic, abd2020review]. Proposed approaches can be divided into

groups: reference-based, traditional feature-based, and deep learning based. Reference-based methods use a defect-free (template) version of the defective PCB to compare. This comparison can be made in several ways, including pixel-wise subtraction to find the dissimilarities

[west1984system, wu1996automated]; template matching via normalized cross correlation to find the similarities [annaby2019improved, crispin2007automated]

; frequency domain analysis

[tsai2018defect]; and template image modelling using feature manifolds [jiang2012color], Principle Component Analysis [cai2017ic], and gray values statistics [xie2009high]. The generalization of referential methods rely on the number of templates of the defective image, and this restricts the practical applications.

In the traditional feature extraction based methods, color contours

[capson1988tiered, wu2013classification], geometric [acciani2006application, wu2014inspection], wavelet [acciani2006application, mar2011design], and gabor [mar2011design] features have been used followed by a classifier. In order to extract these features, solder joints are required to be segmented or localized from PCBs. For segmentation, colour transformation and thresholding [jiang2007machine, mar2011design], as well as colour distribution properties [zeng2011algorithm]

and Gaussian Mixture Models

[zeng2011automated] under a special illumination condition are used for segmentation. Then, to extract features, the localized solder joints are divided into subregions [hongwei2011solder, wu2011feature], the selection of which can also be optimized as in [song2019smt].

After the introduction of CNN, deep learning methods outperformed the traditional methods in solder inspection, despite a need for a lot of data. Although defectless (normal) solder joints are easy to obtain, defective solder joints are quite rare. To alleviate this problem, some studies modelled the normal solder data distribution with generative models [li2021ic, Ulger2021]. When enough data is available, Cai et al. [cai2018smt] obtained very high accuracy with CNNs in solder joint classification under special illumination conditions. Li et al. [li2020automatic] utilized an ensemble of Faster R-CNN and YOLO object detectors to both localize a solder from PCB and to classify it. YOLO detector is also employed on thermal PCB images to detect electronic component placement and solder errors in [jeon2022contactless]. A lightweight custom object detector is designed by fusing convolutional layers for defect localization in [wu2022pcbnet]. The two surveys [abd2020review, moganti1996automatic] also provide a comprehensive review on SJI methods.

Fine-grained image classification. In this study, we further analyse the solder joints in another frame and show that they have less feature diversity, and form fine-grained images. By regularizing the entropy in model training, we improve the classification accuracy. The main challenge in FGIC is that classes are visually quite similar and intra-class variation is large in contrast to generic classification. In the literature, mainly (i) localization-classification subnetworks and (ii) end-to-end feature encoding are applied to this problem [wei2021fine]. Localization-classification subnetworks aim to localize image parts that are discriminative for FGIC. In [zhang2019learning] expert subnetworks that learn features with attention mechanism are employed to make diverse predictions. In [ji2020attention], attention convolutional binary neural tree architecture (ACNet) is introduced with an attention transformer module in a tree structure to focus on different image regions. In [behera2021context], a context-aware attentional pooling module is used to capture subtle changes in images. In [rao2021counterfactual], causal inference is used as an attention mechanism. End-to-end feature encoding tries to learn a better representation to capture subtle visual differences between classes. This can be achieved by designing specific loss functions to make the model less confident about its predictions [szegedy2016rethinking, pereyra2017regularizing]. During model training, cross-entropy between the model and target distribution is minimized, which corresponds to maximizing likelihood of a label; however this can cause over-fitting. Several methods are proposed to increase model generalization. Szegedy et al. [szegedy2016rethinking]

propose label smoothing regularization that penalizes the deviation of predicted label distribution from the uniform distribution

[szegedy2016rethinking]. Dubey et al. [dubey2018maximum] show that penalizing confident output distributions [pereyra2017regularizing] improves accuracy in FGIC. Mukhoti et al. [mukhoti2020calibrating] show that focal loss [lin2017focal] can increase the entropy of model predictions to prevent overconfidence. Meister et al. [meister2020generalized] derived an alternative entropy-regularization method from -JS, called as generalized entropy-regularization, and showed its success in language generation tasks. Designing loss functions is not limited to entropy regularization. For instance, the mutual channel loss in [chang2020devil] forces individual feature channels belonging to the same class to be discriminative.

Recently, different approaches are also proposed to tackle the FGIC task. Du et al. [du2020fine] used a jigsaw puzzle generator to encourage the network learn at different visual granularities. Luo et al. proposed Cross-X learning [luo2019cross] to exploit relationship between features from different images and network layers. He et al. [he2022transfg] added a part selection module to Transformer models for FGIC tasks, which integrates attention weights from all layers to capture the most class distinctive regions. In the results section, we compare our approach to both the entropy-regularization and these alternative methods.

The goal in regularizing the entropy is to get the most unbiased representation of the model which is achieved by the probability distribution with maximum entropy


. Several methods have been proposed to increase generalization of a classifier by maximizing entropy. The classifier is maximally confused when the probability distribution over the model predictions is uniform, where the classifier assigns the same probability in predicting between all the classes. Three alternatives to entropy-regularization are given below.

Focal loss [lin2017focal] adds a weighted term to the cross-entropy loss to put more weight on the misclassified examples. Additionally, it ensures entropy of the predicted probability distribution to be large. This leads to less confident predictions and better generalization. Focal loss is defined in Eq. 1 where , and is a scalar value. represents the ground-truth class, and is the probability of the sample to belong to class as given in Eq. 2.


Label smoothing regularization [szegedy2016rethinking] replaces target label distribution by a smoothed distribution to make the model less confident about its predictions. It encourages model distribution 111Conditional probability distribution is represented by for brevity. to be close to uniform distribution to maximize entropy. This is achieved by minimizing cross-entropy as given in Eq. 3, where is a smoothing term, .


Maximum entropy learning [dubey2018maximum] minimizes the Kullback-Leibler () divergence between the model distribution and the true label distribution . A confidence penalty term is added to maximize the entropy of model predictions and make it more uniform as defined in Eq. 4, where is a term for strength of the penalty.


Iii Methods

Our proposed pipeline includes segmentation and classsification of individual solder joints on PCBs as shown in Fig. 2. Solder joints are segmented with You Only Look at Coefficients (YOLACT) [bolya2019yolact], an instance segmentation method. Normal and defective solder joints are classified with a CNN with entropy regularization.

Fig. 2: Proposed model pipeline. Both defect-free (normal) and defective solder joints are segmented with YOLACT. Finally, the CNN with entropy regularization is trained with the solder joint classification.

Iii-1 Segmentation: YOLACT

YOLACT mainly employs a Feature Pyramid Network (FPN) as a backbone to extract features. The feature maps are used in two branches to produce prototype masks and bounding boxes with associated labels. The solder joints are segmented from high resolution PCB images that are divided into tiles for easier processing. Initially, more than normal solder joints and defective samples are labelled on PCB tiles by annotating through Labelme [labelme] in model training. Both normal and defective solder joints are segmented with YOLACT. Only imprecise segmentation results are corrected manually to create a solder dataset for FGIC. The proposed pipeline is given in Fig. 2. All the individual solder joints are obtained as a result of this segmentation.

Iii-2 Classification with Entropy Regularization: -Js

In supervised classification, the objective to minimize the entropy between the model distribution and distribution over the labels leads to overconfidence in assigning probabilities; and this can cause overfitting [szegedy2015going]. To alleviate this problem, overconfidence is penalized by maximizing the entropy. A way to maximize entropy is encouraging model entropy to be as close as possible to uniform distribution, where the entropy is maximum. We propose using -JS as an entropy-regularization method in fine-grained classification of solder joints. -JS is obtained by the sum of weighted skewed Kullback-Leibler (KL) divergences by introducing a skewness parameter, that determines weights for probability distributions and , and scaled by as given in Eq. 5. Derivation is available in Appendix B.


The penalty term given in Eq. 6 is added to encourage high entropy by minimizing the distance between the model and uniform distribution.


As a result, the optimized objective function to train the model is given in Eq. 7.


In fact, using -JS with as a regularizer is equivalent to maximum entropy learning, and label smoothing regularization with . The derivation is available in [meister2020generalized].

In this study, we propose using -JS as an entropy regularizer as given in Eq. 7. The skewness parameter () in -JS allows us to set different entropy values for the model by giving different weights to the uniform and model distributions. Since we observe that higher entropy does not necessarily correspond to higher classification accuracy, which is observed with the experiments in Sec. IV-A, by tuning the parameter and testing different entropy values, we show that higher accuracy can be achieved for FGIC of solder joints across different models.

Iv Dataset and Experiments

Dataset: As a result of the segmentation in Section III-1, the solder dataset has normal and defective solder joints. The labels are confirmed by experts. The type of solder joint errors include mostly solder bridge, excessive, and insufficient solder errors. Additionally, there are some cracked solder, voids, shifted component, corrosion and flux residue errors. About of the dataset that corresponds to normal and defective solder joints are set aside randomly as the test set. The rest of the dataset is used for stratified five-fold cross validation, which retains a percentage of samples for each class in the folds, due to class imbalance. The segmented solder image sizes vary from to , therefore we trained the models with different image sizes: , , and . Image size of decrease accuracy, and large image size require a lot of training time; hence all the samples were resized to . The data is normalized to have zero mean and unit variance prior to training.

Experimental settings: We experiment on different architectures such as GoogleNet [szegedy2015going], VGG16 [simonyan2014very], ResNet-18 and ResNet-50 [he2016deep] with the same initial weights to evaluate the robustness of the proposed method in FGIC. All the models are trained for to iterations until convergence with batch size of

samples. Root Mean Squared Propagation (RMSprop) with momentum and Adam optimizer

[kingma2014adam] is used interchangeably with an initial learning rate of . regularization is applied to mitigate overfitting. The final results are reported by the model that performed the best in the validation set with either minimum validation loss or maximum validation accuracy, then the model is tested in the test set. Entropy regularization strength term is set to for both maximum entropy learning and -JS for fair comparison. For label smoothing, is selected, and focal loss is experimented with and

, that is widely used in the literature. F1-score is used as an evaluation metric due to the imbalance in the dataset.

Iv-a Experimental Results

We compare the proposed approach with (i) entropy-regularization based models in Sec. IV-A1

, and (ii) the models that employ segmentation techniques, attention mechanism, transformer networks and new loss functions in Sec.


Regularization VGG16 ResNet18 ResNet50 GoogleNet No regularization 96.599 96.035 96.629 97.297 Focal loss [lin2017focal] 96.903 96.847 96.833 97.321 Label smoothing ( [szegedy2015going] 97.285 96.018 98.206 96.916 Label smoothing () 222Equivalent to regularizing with 97.118 96.429 97.978 97.297 Max. entropy () 333Equivalent to regularizing with [dubey2018maximum] 97.297 96.689 97.039 97.528 -JS 96.145 97.309 97.788 97.297 96.380 96.833 97.065 97.321 96.889 97.987 TABLE I: F1-score (%) of different entropy-regularization methods on solder test set across different model architectures. The highest accuracy for each model is shown in bold
Method Backbone Accuracy F1-score
ACNet [ji2020attention] VGG16
AP-CNN [ding2021ap] VGG16
CAP [behera2021context] VGG16
MC-Loss [chang2020devil] VGG16
DFL [wang2018learning] VGG16
Cross-X [luo2019cross] ResNet50
CAL [rao2021counterfactual] ResNet50
WS-DAN [hu2019see] ResNet50
PMG [du2020fine] ResNet50
MGE-CNN [zhang2019learning] ResNet50
TransFG [he2022transfg] ViT B/16
TABLE II: Comparison with other FGIC approaches based on various different techniques on the solder dataset. The highest accuracy and F1-score is given in bold, and the second is underlined.

Iv-A1 Comparison with Entropy-Regularization Methods

The results of entropy-regularization based models is given in Table I. Here, we compare the -JS regularization across different architectures to entropy-regularization-based methods such as label smoothing, focal loss, and maximum entropy learning as well as the models trained without regularization. These methods are chosen since they are designed to maximize the entropy for FGIC through a new loss function as opposed to changing the model architecture or employing alternative mechanisms. The highest F1-score is achieved with -JS and for VGG16, for ResNet18, and ResNet50, and for GoogleNet. An improvement in accuracy by approximately for VGG16, for ResNet18, for ResNet50, and for GoogleNet is achieved over the model trained without regularization. The results show that entropy-regularization based methods are very effective in classifying solder joints, especially the -JS, which outperformed the other regularization methods.

Fig. 3: Principle components of the test set on output features of the average pooling layer of ResNet50. Normal and defective solder joints are shown in blue and red respectively. The green points represent the normal and defective solder joints that are misclassified by the model trained without regularization, but are correctly classified when -JS is used as the entropy-regularization.

Effect of entropy-regularization is visualized in Fig. 3. Note that there were many misclassifications in the vicinity of normal samples, and among the scattered defective samples as shown in green. The green circles represent the normal samples that are misclassified by the model trained without regularization (false negatives), classified correctly by -JS (true positives). The green crosses stand for the correctly classified defective solder samples (true negatives) by -JS that were misclassified otherwise (false positives).

In order to interpret the effect of entropy-regularization and model predictions in images, activation maps of each test image are visualized through Grad-CAM. We used the implementation in [jacobgilpytorchcam] for Grad-CAM visualization. The last convolutional layer of each model are visualized to see which part of the image is used in making decision. In Fig. 4, (a) test images, (b) activation maps on VGG16 model trained without any regularization, (c) DFL [wang2018learning], (d) PMG [du2020fine] on ResNet50, and (e) with -JS are shown. The model regularized with -JS yields more precise class-discriminative regions (solder joints/errors) compared to the other approaches. Shorted solder in the first and last row of Fig. 4, flux residues in the second row, and passive electronic component solder joint in the third row are localized more accurately with -JS. Additionally, it is not affected by background noise such as PCB background and non-soldered regions as much as the other models.

(a)            (b)            (c)            (d)            (e)

Fig. 4: Localized class-discriminative regions by Grad-CAM. RGB test images (a). Activation maps on VGG16 trained without any regularization (b), DFL [wang2018learning] (c), PMG [du2020fine] on ResNet50 (d), and with -JS (e). The heatmap shows which part of the image that the model focuses on to make decision. The intensity increases from blue to red. The CAM results overlap the most with soldered regions for the model regularized with -JS.

Iv-A2 Comparison with the Other Approaches

Comparison with the other state-of-the-art approaches are presented in Table II. We experimented with not only the recent state-of-the-art methods based on designing FGIC-specific loss functions [chang2020devil] similar to our approach, but also experimented with employing segmentation networks [wang2018learning, behera2021context], attention mechanism [hu2019see, ji2020attention, ding2021ap, rao2021counterfactual], and others [zhang2019learning, du2020fine, luo2019cross, he2022transfg]. All the models are trained until convergence ( iterations) from scratch, and tuned by changing their learning rate and optimizer. The best results are reported on the test set at the end of training. The proposed approach achieves competitive results on the solder joint dataset without changing the model architecture, using attention mechanism or employing segmentation techniques. The models are compared both with respect to accuracy and F1-score. Accuracy is calculated by the ratio of correct predictions to the number of test samples. For VGG16, -JS leads to and improvement over the closest method [wang2018learning] in terms of accuracy and F1-score respectively. The proposed regularization with -JS is competitive for ResNet50, closely following [rao2021counterfactual] with and in terms of accuracy and F1-score respectively at the expense of extra computational cost for the attention module.

We investigate the effect of the skewness parameter () on the -JS for discrete binary distributions. -JS and entropy values as a function of for VGG16, ResNet18, ResNet50, and GoogleNet are shown in Fig. 5 (a) and (b) respectively. By calculating the divergence as a function of , we observe that the increase in resulted in lower divergence between uniform distribution and model distribution . This monotonic relation shows that the model distribution becomes closer to the uniform distribution with higher values. Taking this into consideration, getting closer to the uniform distribution yields higher model entropy as expected. The divergence of the models are calculated on the test set.

Fig. 5: -JS (a) and model entropy (b) as a function of for VGG16, ResNet18, ResNet50, and GoogleNet architectures. Model entropy and -JS are calculated for = [0.1, 0.5, 0.75, 0.9].
Fig. 6: F1-score as a function of normalized entropy for no regularization, focal loss, maximum entropy learning, -JS, and label smoothing regularization with VGG16, ResNet18, ResNet50, and GoogleNet models.

In line with this information, one can think that the highest accuracy should be achieved with the model that is closest to the uniform distribution i.e. lower -JS and higher entropy; however this relation is not simple. We observe that higher entropy does not always correspond to higher accuracy. We plotted the F1-score results as a function of normalized entropy and did not observe a monotonic relation between these for ResNet50, VGG16, ResNet18, and GoogleNet architectures as shown in Fig. 6. This complicated relation with (skewness parameter) and model accuracy motivates us to search for intermediate values rather than testing only extreme values (label smoothing and maximum entropy regularization).

V Conclusion

In this paper, we show that normal vs. defective solders exhibit a low inter-class variance and high intra-class variance; and when compared to the datasets such as ImageNet, they show little diversity. With this information, we tackle the SJI as a fine grained image classification task, and propose -JS based entropy regularization as a solution to FGIC of solder joints. We compare our proposed solution to both the entropy-regularization based approaches as well as the approaches that employ segmentation, attention mechanism, transformer models, and design loss functions for FGIC. We show that the proposed -JS based entropy regularization method achieves the highest and competitive accuracy in SJI across different model architectures. Using activation maps, we show that better class-discriminative regions are obtained with entropy-regularization, and this improves the classification accuracy.


V-a Preliminaries


is a measure of uncertainty of a random variable. Entropy of a discrete random variable

is defined by Eq. 8, where is the states of and .


Cross-entropy between the model distribution and true distribution of the labels

is minimized to train a neural network in supervised classification.

is the model parameters. For brevity, cross-entropy for binary class classification is given in Eq. 9.


where is defined as,

represents the ground-truth class. is the probability of the sample to belong to class .

Kullback-Leibler (KL) divergence is a difference measure between two probability distributions and and defined as in Eq. 10 for discrete random variables . KL divergence is not symmetric i.e. the forward and reverse KL divergences are not equal.


Jensen-Shannon divergence (JSD) is a symmetric version of KL divergence as given in Eq. 11.


V-B -skew Jensen-Shannon Divergence

-JS is obtained by the sum of weighted skewed KL divergences [lee2001effectiveness] of the probability distributions and as given in Eq. 12 [nielsen2020generalization]. is a skewness parameter that determines weights to the probability distributions.


-JS is scaled by to guarantee continuity for [nielsen2011burbea, nielsen2015total]: