DeepAI

# An Asymmetric Contrastive Loss for Handling Imbalanced Datasets

Contrastive learning is a representation learning method performed by contrasting a sample to other similar samples so that they are brought closely together, forming clusters in the feature space. The learning process is typically conducted using a two-stage training architecture, and it utilizes the contrastive loss (CL) for its feature learning. Contrastive learning has been shown to be quite successful in handling imbalanced datasets, in which some classes are overrepresented while some others are underrepresented. However, previous studies have not specifically modified CL for imbalanced datasets. In this work, we introduce an asymmetric version of CL, referred to as ACL, in order to directly address the problem of class imbalance. In addition, we propose the asymmetric focal contrastive loss (AFCL) as a further generalization of both ACL and focal contrastive loss (FCL). Results on the FMNIST and ISIC 2018 imbalanced datasets show that AFCL is capable of outperforming CL and FCL in terms of both weighted and unweighted classification accuracies. In the appendix, we provide a full axiomatic treatment on entropy, along with complete proofs.

• 4 publications
• 1 publication
09/14/2022

### Joint Debiased Representation and Image Clustering Learning with Self-Supervision

Contrastive learning is among the most successful methods for visual rep...
11/22/2022

### Supervised Contrastive Learning on Blended Images for Long-tailed Recognition

Real-world data often have a long-tailed distribution, where the number ...
03/22/2022

### Rebalanced Siamese Contrastive Mining for Long-Tailed Recognition

Deep neural networks perform poorly on heavily class-imbalanced datasets...
04/07/2021

### Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data

Electronic Health Record (EHR) data has been of tremendous utility in Ar...
06/06/2021

### Self-Damaging Contrastive Learning

The recent breakthrough achieved by contrastive learning accelerates the...
10/27/2022

### Supervised Contrastive Learning for Respiratory Sound Classification

Automatic respiratory sound classification using machine learning is a c...

library_test.bib

## I Introduction

Class imbalance is a major obstacle occurring within a dataset when certain classes in the dataset are overrepresented (referred to as majority classes), while some are underrepresented (referred to as minority classes). This can be problematic for a large number of classification models. A deep learning model such as a convolutional neural network (CNN) might not be able to properly learn from the minority classes. Consequently, the model would be less likely to correctly identify minority samples as they occur. This is especially crucial in medical imaging, since a model that cannot identify rare diseases would not be effective for diagnostic purposes. For example, the ISIC 2018 dataset

[codella, tschandl] is an imbalanced medical dataset which consists of images of skin lesions that appear in various frequencies during screening.

To produce a less imbalanced dataset, it is possible to resample the dataset by either increasing the number of minority samples [bej, fajardo, karia, tripathi] or decreasing the number of majority samples [arefeen, dai, koziarski, rayhan]. Other methods to handle class imbalance include substituting the standard cross-entropy (CE) loss for a more suitable loss, such as the focal loss (FL). Lin et al. [lin_focal]

modified the CE loss into FL so that minority classes can be prioritized. This is done by ensuring that the model focuses on samples that are harder to classify during model training. Recent studies also unveiled the potential of contrastive learning as a way to combat imbalanced datasets

[marrakchi, chen_supercon].

Contrastive learning is performed by contrasting a sample (called an anchor) to other similar samples (called positive samples) so that they are mapped closely together in the feature space. As a consequence, dissimilar samples (called negative samples) are pushed away from the anchor, forming clusters in the feature space based on similarity. In this research, contrastive learning is done using a two-stage training architecture, which utilizes the contrastive loss (CL) formulated by Khosla et al. [khosla]. This formulation of CL is supervised based, and it can contrast the anchor to multiple positive samples belonging to the same class. This is unlike self-supervised contrastive learning [chen_simple, henaff, hjelm, tian], which contrasts the anchor to only one positive sample in the mini-batch.

In this work, we propose a modification of CL, referred to as the asymmetric contrastive loss (ACL). Unlike CL, the ACL is able to directly contrast the anchor to its negative samples so that they are pushed apart in the feature space. This becomes important when a rare sample has no other positive samples in the mini-batch. To our knowledge, this is the first study to modify CL directly in order to address the class imbalance problem. We also consider the asymmetric variant of the focal contrastive loss (FCL) [zhang_fcl]

, called the asymmetric focal contrastive loss (AFCL). Using FMNIST and ISIC 2018 as datasets, experiments are done to test the performance of both ACL and AFCL in binary classification tasks. It is observed that AFCL is superior to CL and FCL in multiple class-imbalance scenarios, provided that suitable hyperparameters are used. In addition, this work provides a streamlined survey on the literature related to entropy and loss functions.

## Ii Background on Entropy and Loss Functions

In this section, we provide a literature review on basic information theory and various loss functions.

### Ii-a Entropy, Information, and Divergence

Introduced by Shannon [shannon]

, entropy provides a measure on the amount of information contained in a random variable, usually in bits. The

entropy of a random variable is given by the formula

 H(X)=EPX[−log(PX(X))]. (1)

Given two random variables and , their joint entropy is the entropy of the joint random variable :

 H(X,Y)=EP(X,Y)[−log(P(X,Y)(X,Y))]. (2)

In addition, the conditional entropy is defined as

 H(Y∣X)=EP(Y,X)[−log(PY∣X(Y∣X)]. (3)

Conditional entropy is used to measure the average amount of information contained in when the value of is given. Conditional entropy is bounded above by the original entropy; that is, , with equality if and only if and are independent [ajjanagadde].

The formulas for entropy, joint entropy, and conditional entropy can be derived via an axiomatic approach [gowers, khinchin]. The list of axioms is provided in Appendix A, whereas the derivation of the formula of entropy is provided in Appendix B.

The mutual information is a measure of dependence between random variables and [cover]. It provides the amount of information about one random variable provided by the other random variable, and it is defined by

 I(X;Y)=H(X)−H(X∣Y)=H(Y)−H(Y∣X). (4)

Mutual information is symmetric. In other words, . Mutual information is also nonnegative (), and if and only if and are independent [ajjanagadde].

The dissimilarity between random variables and on the same space can be measured using the notion of KL-divergence:

 DKL(X∥X′)=EPX[log(PX(X)PX′(X))]. (5)

Similar to mutual information, KL-divergence is nonnegative (), and if and only if [ajjanagadde]. Unlike mutual information, KL-divergence is asymmetric, so and are not necessarily equal.

### Ii-B Cross-Entropy and Focal Loss

Given random variables and on the same space , their cross-entropy is defined as [boudiaf]:

 H(X;^X)=EPX[−log(P^X(X)]. (6)

Cross-entropy is the average amount of bits needed to encode the true distribution

when its estimate

is provided [murphy]. A small value of implies that is a good estimate for . Cross-entropy is connected to KL-divergence via the following identity:

 H(X;^X)=H(X)+DKL(X∥^X). (7)

When , the equality holds.

Now, the cross-entropy loss and focal loss are provided within the context of a binary classification task consisting of two classes labeled and . Suppose that denotes the ground-truth class and

denotes the estimated probability for the class labeled

. The value of is then the estimated probability for the class labeled . The cross-entropy (CE) loss is given by

 LCE =−ylog(p)−(1−y)log(1−p) ={−log(p)y=1,−log(1−p)y=0.

If , then the loss is zero when . On the other hand, if , then the loss is zero when . In either case, the CE loss is minimized when the estimated probability of the true class is maximized, which is the desired property of a good classification model.

The focal loss (FL) [lin_focal] is a modification of the CE loss introduced to put more focus on hard-to-classify examples. It is given by the following formula:

 Lfoc=−y(1−p)γlog(p)−(1−y)pγlog(1−p). (8)

The parameter in is known as the focusing parameter. Choosing a larger value of would push the model to focus on training from the misclassified examples. For instance, suppose that and denote the estimated probability of the true class by . The graph on Figure 1 shows that when , the FL is quite small. Hence, the model would be less concerned about learning from an example when is already sufficiently large. FL is a useful choice when class imbalance exists as it can help the model focus on the less represented samples within the dataset.

### Ii-C Asymmetric Loss

For multi-label classification with labels, let be the ground truth for class and be its estimated probability obtained by the model. The aggregate classification loss is then

 L=K∑i=1Li, (9)

where

 Li=−yiL+i−(1−yi)L−i. (10)

If FL is the chosen type of loss, and are set as follows:

 L+i=(1−pi)γlog(pi)andL−i=pγilog(1−pi). (11)

In a typical multi-label dataset, the ground truth has value for the majority of classes . Consequently, the negative terms dominate in the calculation of the aggregate loss . Asymmetric loss (ASL) [ben-baruch] is a proposed solution to this problem. ASL emphasizes the contribution of the positive terms by modifying the losses of Eq. (11) to

 L+i=(1−pi)γ+log(pi) (12)

and

 L−i=(p(m)i)γ−log(1−p(m)i), (13)

where are hyperparameters and is the shifted probability of obtained from the probability margin via the formula

 p(m)i=max(pi−m,0). (14)

This shift helps decrease the contribution of . Indeed, if we set , then .

### Ii-D Contrastive Loss

Contrastive learning is a learning method to learn representations from data. A supervised approach of contrastive learning was introduced by Khosla et al. [khosla] to learn from a set of sample-label pairs in a mini-batch of size . The samples are fed through a feature encoder and a projection head in succession to obtain features . The feature encoder extracts features from , whereas the projection head projects the features into a lower dimension and apply -normalization so that lies in the unit hypersphere. In other words, .

A pair , where , is referred to as a positive pair if the features share the same class label () and it is a negative pair if the features have different class labels (). Contrastive learning aims to maximize the similarity between and

whenever they form a positive pair and minimize their similarity whenever they form a negative pair. This similarity is measured with cosine similarity

[murphy]:

 κ(zi,zj)=zi⋅zj∥zi∥2∥zj∥2=zi⋅zj. (15)

From the above equation, we have . In addition, when , and when and form a angle.

Fixing as the anchor, let be the set of features other than and let be the set of such that is a positive pair. The predicted probability that and belong to the same class is obtained by applying the softmax function to the the set of similarities between and :

 pij=exp(zi⋅zj/τ)∑zk∈Aiexp(zi⋅zk/τ), (16)

where is referred to as the temperature parameter. Since our goal is to maximize whenever , the contrastive loss which is to be minimized is formulated as

 Lcon=−n∑i=11|Pi|∑zj∈Pilog(pij). (17)

Information-theoretical properties of are given in [zhang_fcl], from which we provide a summary. Let , , and denote random variables of the samples, labels, and features, respectively. The following theorem states that is positive proportional to under the assumption that no class imbalance exists.

###### Theorem II.1 (Zhang et al. [zhang_fcl]).

Assuming that features are -normalized and the dataset is balanced,

 Lcon∝H(Z∣Y)−H(Z). (18)

Theorem II.1 implies that minimizing is equivalent to minimizing the conditional entropy and maximizing the feature entropy . Since , minimizing is equivalent to maximizing the mutual information between features and class labels . In other words, contrastive learning aims to extract the maximum amount of information from class labels and encode them in the form of features.

After the features are extracted, a classifier is assigned to convert into a prediction of the class label. The random variable of predicted class labels is denoted by .

For the next theorem, the definition of conditional cross-entropy is given as follows:

 H(Y;^Y∣Z)=EP(Y,Z)[−log(P(^Y,Z)(Y,Z)]. (19)

Conditional CE measures the average amount of information needed to encode the true distribution using its estimate , given the value of . A small value of implies that is a good estimate for , given .

###### Theorem II.2 (Zhang et al. [zhang_fcl]).

Assuming that features are -normalized and the dataset is balanced,

 Lcon∝infH(Y;^Y∣Z)−H(Y), (20)

where the infimum is taken over classifiers.

Theorem II.2 implies that minimizing will minimize the infimum of conditional cross-entropy taken over classifiers. As a consequence, contrastive learning is able to encode features in such that the best classifier can produce a good estimate of given the information provided by the feature encoder.

The formula for can be modified so as to resemble the focal loss, resulting in a loss function known as the focal contrastive loss (FCL) [zhang_fcl]:

 LFC=−n∑i=11|Pi|∑zj∈Pi(1−pij)log(pij). (21)

## Iii Methodology

In this section, our proposed modification of the contrastive loss, called the asymmetric contrastive loss, is introduced. Also, the architecture of the model in which the contrastive losses are implemented is explained.

### Iii-a Asymmetric Contrastive Loss

In Eq. (17), the inside summation of the contrastive loss is evaluated over . Consequently, according to Eq. (16), each anchor

is contrasted with vectors

that belong to the same class. This does not present a problem when the mini-batch contains plenty of examples from each class. However, the calculated loss may not give each class a fair contribution when some classes are less represented in the mini-batch.

In Figure 2, a sampled mini-batch consists of examples with blue-colored class label and example with red-colored class label. When the anchor is the representation of the red-colored sample, does not directly contribute to the calculation of since is empty. In other words, cannot be contrasted to any other sample in the mini-batch. This scenario is likely to happen when the dataset is imbalanced, and it motivates us to modify CL so that each anchor can also be contrasted with not belonging to the same class.

Let be the set of vectors such that is a negative pair. Motivated by the and of Eq. (10), we define

 L+i=1|Pi|∑zj∈Pilog(pij) (22)

and

 L−i=1|Ni|∑zj∈Nilog(1−pij), (23)

where . The loss function contrasts to vectors in , whereas contrasts to vectors in . The resulting asymmetric contrastive loss (ACL) is given by the formula

 LAC=−n∑i=1(L+i+ηL−i), (24)

where is a fixed hyperparameter. If , then . Hence ACL is a generalization of CL.

When the batch size is set to a large number (over 100, for example), the value tends to be very small. This causes to be much smaller than . In order to balance their contribution to the total loss , a large value for is usually chosen (between 60 and 300 in our experiment).

### Iii-B Asymmetric Focal Contrastive Loss

Following the formulation of in Eq. (21), can be modified to have the following formula:

 L+i=1|Pi|∑zj∈Pi(1−pij)γlog(pij). (25)

Using this loss, the asymmetric focal contrastive loss (AFCL) is then given by

 LAFC=−n∑i=1(L+i+ηL−i), (26)

where . We do not modify by adding the multiplicative term since is usually too small and would make vanish if the term is added.

We have when . Thus, AFCL generalizes FCL. Unlike FCL, we add the hyperparameter to the loss function so as to provide some flexibility to the loss function.

### Iii-C Model Architecture

This section explains the inner workings of the classification model used for the implementation of the contrastive losses. The architecture of the model is taken from [marrakchi, chen_supercon]. The training strategy for the model, as shown in Figure 3, comprises of two stages: the feature learning stage and the fine-tuning stage.

In the first stage, each mini-batch is fed through a feature encoder. We consider either ResNet-18 or ResNet-50 [he_resnet] for the architecture of the feature encoder. The output of the feature encoder is projected by the projection head to generate a vector of length . If ResNet-18 is used for the feature encoder, then the projection head consists of two layers of length 512 and 128. If ResNet-50 is used, then the two layers are of length 2048 and 128. Afterwards, is -normalized and the model parameters are updated using some version of the contrastive loss (either CL, FCL, ACL, or AFCL).

After the first stage is complete, the feature encoder is frozen and the projection head is removed. In its place, we have a one-layer classification head which generates the estimated probability that the training sample belongs to a certain class. The parameters of the classification head are updated using either the FL or CE loss. The final classification model is the feature encoder trained during the first stage, together with the classification head trained during the second stage. Since the classification head is a significantly smaller architecture than the feature encoder, training is mostly focused on the first stage. As a consequence, we typically need a larger number of epochs for the feature learning stage compared to the fine-tuning stage.

## Iv Experiments

The datasets and settings of our experiments are outlined in this section. We provide and discuss the results of the experiments on the FMNIST and ISIC 2018 datasets. The PyTorch implementation is available on GitHub

.

### Iv-a Datasets

In our experiments, the training strategy outlined in Subsection III-C

is applied to two imbalanced datasets. The first is a modified version of the Fashion-MNIST (FMNIST) dataset

[xiao_fmnist], and the second is the International Skin Imaging Collaboration (ISIC) 2018 medical dataset [codella, tschandl].

The FMNIST dataset consists of low-resolution ( pixels), grayscale images of ten classes of clothing. In this study, we take only two classes to form a binary classification task: the T-shirt and shirt classes. The samples are taken such that the proportion between the T-shirt and shirt images can be imbalanced depending on the scenario. On the other hand, the ISIC 2018 dataset consists of high-resolution, RGB images of seven classes of skin lesions. Following FMNIST, we use only two classes for the experiments: the melanoma and dermatofibroma classes. Illustrations of the sample images of both datasets are provided in Figure 4.

FMNIST is chosen as a dataset since, although simple, it is a benchmark dataset to test deep learning models for computer vision. On the other hand, ISIC 2018 is chosen since it is a domain-appropriate imbalanced dataset for our model. We first apply the model (using AFCL as the loss function) to the more lightweight FMNIST dataset under various class-imbalance scenarios. This is conducted to check the appropriate values of the

and parameters of AFCL under different imbalance conditions. Afterwards, the model is applied to the ISIC 2018 dataset using the optimal parameter values obtained during the FMNIST experiments.

### Iv-B Experimental Details

The experiments are conducted using the NVIDIA Tesla P100-PCIE GPU allocated by the Google Colaboratory Pro platform. The models and loss functions are implemented using PyTorch. To process the FMNIST dataset, we use the simpler ResNet-18 architecture as the feature encoder and train it for epochs. On the other hand, to process the ISIC 2018 dataset, we use the deeper ResNet-50 as the feature encoder and train it for epochs. For both the FMNIST and ISIC 2018 datasets, the learning rate and batch size are set to and , respectively. In addition, the classification head is trained for epochs. The encoder and the classification head are both trained using the Adam optimizer. Finally, the temperature parameter of the contrastive loss is set to its default value of .

The evaluation metrics utilized in the experiment are (weighted) accuracy and unweighted accuracy (UWA), both of which can be calculated from the number of true positives (

TP), true negatives (TN), false negatives (FN), and false positives (FP) using the formulas

 Accuracy=TP+TNTP+TN+FN+%FP (27)

and

 UWA=12(TPTP+FN+% TNTN+FP), (28)

respectively. Unlike accuracy, UWA provides the average of the individual class accuracies regardless of the number of samples in the test set of each class. UWA is an appropriate metric when the dataset is significantly imbalanced [fahad].

For heavily imbalanced datasets, a high accuracy and low UWA may mean that the model is biased towards classifying samples as part of the majority class. This indicates that the model does not properly learn from the minority samples. In contrast, a lower accuracy with high UWA indicates that the model takes significant risks to classify some samples as part of the minority class. Our aim is to construct a model that maximizes both metrics simultaneously; that is, a model that can learn unbiasedly from both the majority and minority samples with minimal misclassification error.

### Iv-C Experiments using FMNIST

The data used in the FMNIST experiment comprise of 1000 images classified as either a T-shirt or a shirt. The dataset is split 70/30 for model training and testing. The images are augmented using random rotations and random flips. We deploy 11 class-imbalance scenarios on the dataset which control the proportion between the T-shirt class and the shirt class. For example, if the the proportion is 60:40, then 600 T-shirt images and 400 shirt images are sampled to form the experimental dataset. Our proportions range from 50:50 up to 98:2.

During the first stage, the ResNet-18 encoder is trained using the AFCL. Afterwards, the classification head is trained using the CE loss during the second stage. As AFCL contains two parameters and , our goal is to tune each of these parameters independently, keeping the other parameter fixed. First, is tuned as we set , followed by the tuning of as we set . Each experiment is done four times in total. The average accuracy and UWA of these four runs are provided in Tables I (for the tuning of ) and II (for the tuning of ).

For the tuning of , six values of are experimented on, namely . When , the loss function reduces to the ordinary CL. As observed in Table I, the optimal value of tends to be larger when the dataset is moderately imbalanced. As the scenario goes from 60:40 to 90:10, the parameter that maximizes accuracy increases in value, from when the proportion is 60:40 to when the proportion is 90:10. In general, this indicates that the term of the ACL becomes more essential to the overall loss as the dataset gets more imbalanced, confirming the reasoning contained in Subsection III-A.

As seen in Table II, we experiment on , where choosing means that we are using CL. Although the overall pattern of the optimal is less apparent than of the previous experiment, some insights can still be obtained. When the scenario is between 70:30 and 90:10, the focusing parameter is optimally chosen when it is larger than zero. This is in direct contrast to when the proportion is perfectly balanced (50:50), where is the most optimal parameter. This suggests that a larger value of should be considered when class imbalance is significantly present within the dataset.

### Iv-D Experiments using ISIC 2018

From the ISIC 2018 dataset, a total of 1113 melanoma images and 115 dermatofibroma images are combined to create the experimental dataset. As with the previous experiment, the dataset is split 70/30 for training and testing. The images are resized to pixels. The ResNet-50 encoder is trained using one of the available contrastive losses, which include CL/FCL as baselines and ACL/AFCL as the proposed loss functions. The classification head is trained using FL as the loss function with its focusing parameter set to .

The proportion between the melanoma class and the dermatofibroma class in the experimental dataset is close to 90:10. Using results from Tables I and II

as a heuristic for determining the optimal parameter values, we set

and . It is worth mentioning that even though produces the best accuracy in the FMNIST experiment, the UWA of the resulting model is quite poor. However, we decide to include this value in this experiment for completeness.

The results of this experiment is given in Table III. As in the previous section, each experiment is conducted four times, so the table lists the average accuracy and UWA of these four runs for each contrastive loss tested. Each run, which includes both model training and testing, is completed in roughly 80 minutes using our computational setup.

From Table III, CL and ACL performs the worst in terms of UWA and accuracy, respectively. However, ACL gives the best UWA among all losses. This may indicate that ACL encourages the model to take the risky approach of classifying some samples as part of the minority class at the expense of accuracy. Overall, AFCL with and emerges as the best loss in this experiment, producing the best accuracy and the second-best UWA behind ACL. This leads us to conclude that the AFCL, with optimal hyperparameters chosen, is superior to the vanilla CL and FCL.

## V Conclusion and Future Work

In this work, we introduced an asymmetric version of both contrastive loss (CL) and focal contrastive loss (FCL) referred to as ACL and AFCL, respectively. These asymmetric variants of the contrastive loss were proposed to provide more focus on the minority class. The experimental model used was a two-stage architecture consisting of a feature learning stage and a classifier fine-tuning stage. This model was applied to the FMNIST and ISIC 2018 imbalanced datasets using various contrastive losses. Our results show that AFCL was able to outperform CL and FCL in terms of both weighted and unweighted accuracies. On the ISIC 2018 binary classification task, AFCL, with and as hyperparameters, achieved an accuracy of 93.75% and an unweighted accuracy of 74.62%. This is in contrast to FCL, which achieved 93.07% and 74.34% on both metrics, respectively.

The experiments of this research were conducted using datasets consisting of approximately 1000 total images. In the future, the experimental model may be applied to larger-scale datasets in order to test its scalability. In addition, other models based on ACL and AFCL can also be developed for specific datasets, preferably within the realm of multiclass classification.

## A Axioms for Entropy

In his landmark paper, Shannon [shannon] introduced the notion of entropy of a random variable . Entropy measures the amount of information contained in , usually in bits. For example, a fair coin toss contains one bit of information; the bit can represent the heads whereas the bit can represent the tails. On the other hand, an unfair coin toss whose coin always lands on heads gives no meaningful information. Hence, the trial can be conveyed using zero bits.

This section aims to construct the theory of entropy via an axiomatic approach. First, a collection of axioms, known as the Shannon–Khinchin axioms [khinchin], is employed to give desired properties of the function . Then, it is shown that the usual formula for follows uniquely from these axioms. The presentation of the axioms in this section follows a set of notes provided by Gowers [gowers].

Suppose that and

are discrete random variables taking values in finite spaces

and , respectively. Let and for and . The first axiom is motivated using the coin toss example. Since a fair coin toss is expected to contain one bit of information, the following axiom is obtained.

###### Axiom 0 (Normalization).

If and

has a uniform distribution, then

.

Also,

depends only on the probability distribution of

. Consequently, if is another random variable that has an identical distribution to , then .

###### Axiom 1 (Invariance).

depends only on the probability distribution of , and not on any other factor.

Going back to the coin toss example, we would like to ensure that a coin toss contains the most information when it is fair. In general, the following axiom is assumed.

###### Axiom 2 (Maximality).

Assuming is fixed, is maximized when is uniform.

In addition, the value of should not increase when impossible samples are added to .

###### Axiom 3 (Extensibility).

If with for every (and thus for every ), then .

To state the next axiom, two notions on entropy are first introduced. The joint entropy is simply the entropy of the joint random variable , and the conditional entropy is defined as

 H(Y∣X)=∑x∈XpxH(Y∣X=x). (29)

Conditional entropy measures the average amount of information contained in given the value of .

.

If and are independent, then . Therefore, in that case. In general, if are independent, then .

Suppose that . Since only depends on the distribution of by Axiom 1, the function can instead be seen as a function . The next axiom states that is continuous on the space

 S={(p1,…,pn)∈[0,1]n∣p1+⋯+pn=1}. (30)
###### Axiom 5 (Continuity).

is continuous with respect to all probabilities .

From Axioms A5, the formula for is uniquely determined as shown in the following theorem.

###### Theorem A.1.

Let be a function defined for any discrete random variable that takes values in a finite set . This function satisfies Axioms A5 if and only if

 H(X)=−∑x∈Xpxlog(px), (31)

where the logarithm is to the base and we set .

The proof of A.1 is provided in Appendix B. Looking back at the coin toss example, Figure 5 illustrates the graph of when is either heads or tails with probabilities and , respectively. Entropy is maximized when the coin is fair, and it decreases in a continuous manner to zero as the coin becomes less fair.

The formula for entropy in Eq. (31) can be expressed in the form of an expectation:

 H(X)=EPX[−log(PX(X))]. (32)

Likewise, joint entropy and conditional entropy can be expressed as

 H(X,Y)=EP(X,Y)[−log(P(X,Y)(X,Y))] (33)

and

 H(Y∣X) =∑x∈XpxH(Y∣X=x) =−∑x∈X(px∑y∈YPY∣X(y∣x)log(PY∣X(y∣x))) =−∑x∈X∑y∈YpxPY∣X(y∣x)log(PY∣X(y∣x)) =−∑(y,x)P(Y,X)(y,x)log(PY∣X(y∣x)) =EP(Y,X)[−log(PY∣X(Y∣X))].

## B Proof of Theorem a.1

The arguments used in this proof are adapted from [gowers, khinchin]. We first verify one direction of Theorem A.1.

###### Lemma B.1.

The formula for given in Eq. (31) satisfies Axioms A5.

###### Proof.

It is trivial to show that the normalization, invariance, extensibility, and continuity axioms hold, so we focus on proving the maximality and additivity axioms.

For maximality, we need to utilize Jensen’s inequality [ajjanagadde] applied on the concave function . This inequality takes the form

 E[logY]≤log(E[Y]). (34)

For any random variable ,

 H(X) =E[log(1P(X))] ≤log(E[1P(X)]) (by Eq.\ (???), where Y=1P(X)) =log(∑x∈Xpx⋅1px) =log(|X|).

Since is the entropy of a uniform random variable on , the entropy is maximized is uniform.

For additivity, we need to prove that . Writing , we have

 H(X,Y) =−∑x∈X∑y∈Ypx,ylog(px,y) =−∑x∈X∑y∈Ypx,ylog(pxPY∣X(y∣x)) =−∑x∈X∑y∈Ypx,y(log(px)+log(PY∣X(y∣x))) =−∑x∈X∑y∈Ypx,ylog(px)−∑x∈X∑y∈Ypx,ylog(PY∣X(y∣x)).

We can obtain

 −∑x∈X∑y∈Ypx,ylog(px)=−∑x∈Xpxlog(px)=H(X)

and

 −∑x∈X∑y∈Ypx,ylog(PY∣X(y∣x)) =−∑x∈X∑y∈YpxPY∣X(y∣x)log(PY∣X(y∣x)) =−∑x∈X(px∑y∈YPY∣X(y∣x)log(PY∣X(y∣x))) =∑x∈Xpx H(Y∣X=x) =H(Y∣X).

Therefore, . ∎

To ease the notation, we can assume that and write in place of by the invariance axiom. For brevity, is defined as the entropy of a uniform random variable with . In other words,

 L(n)=H(1n,…,1n). (35)
###### Lemma B.2.

The following properties hold for the function :

1. is non-decreasing.

2. .

3. .

4. .

###### Proof.

1. For every natural number ,

 L(n) =H(1n,…,1n,0) (by the extensibility axiom) ≤H(1n+1,…,1n+1) (by the maximality axiom) =L(n+1).

Since is arbitrary, this proves that is a non-decreasing function of .

2. Let be a uniform random variable on with . Then . Now let be i.i.d. random variables with distribution identical to . Since the joint variable is uniform, we obtain

 L(nm)=H(X1,…,Xm)=m∑i=1H(Xi)=mH(X)=mL(n).

3. By the normalization axiom, . Therefore, .

4. We aim to prove that for every . Fix and . Now let be the unique integer such that the inequality

 2k≤nm≤2k+1 (36)

holds. Applying the non-decreasing function on (36), we obtain

 k≤mL(n)≤k+1. (37)

Applying the non-decreasing function on (36), we also obtain

 k≤mlog(n)≤k+1. (38)

Both (37) and (38) imply that

 |mL(n)−mlog(n)|≤(k+1)−k=1. (39)

We can obtain by dividing both sides of (39) by . As a consequence, . ∎

The fourth property of Lemma B.2 infers that Theorem A.1 holds for uniform random variables . We are now ready to complete the proof of Theorem A.1 for any random variable .

###### Proof of Theorem a.1.

One half of Theorem A.1 is proved in Lemma B.1. It remains to prove that a function which satisfies Axioms A5 is necessarily equal to . Without loss of generality, we can assume that are all rational. Indeed, since the rationals are dense in the reals, the theorem would still hold for real values by the continuity axiom (Axiom 5).

Let for , where each is a positive integer and . Define a random variable dependent on such that and is partitioned into disjoint groups containing values, respectively. If it is given that , where , then all the values in have the same probability , and values from other groups have probability zero. It follows that

 H(Y∣X) =n∑i=1piH(Y∣X=i) =n∑i=1piL(gi) =n∑i=1pilog(g⋅pi) =n∑i=1pi(log(g)+log(pi)) =log(g)+n∑i=1pilog(pi).

In addition, and are identically distributed since is completely dependent on . The joint variable thus has a total of possible values, and each value has the same probability of occurring. By the additivity axiom,

 H(X) =H(X,Y)−H(Y∣X) =L(g)−log(g)−n∑i=1pilog(pi) =−n∑i=1pilog(pi).

This proves that for rational . The full statement of the theorem follows from continuity, as explained at the beginning of the proof. ∎