Isolating effects of age with fair representation learning when assessing dementia

07/19/2018 ∙ by Zining Zhu, et al. ∙ UNIVERSITY OF TORONTO 0

One of the most prevalent symptoms among the elderly population, dementia, can be detected using linguistic features extracted from narrative transcripts. However, these linguistic features are impacted in a similar but different fashion by normal aging process. It has been hard for machine learning classifiers to isolate the effects of confounding factors (e.g., age). We show that deep neural network (DNN) classifiers can infer ages from linguistic features. They could make classifications based on the bias given age, which entangles unfairness across age groups. In this paper, we address this problem with fair representation learning. We build neural network classifiers that learn low-dimensional representations reflecting the impacts of dementia but do not contain age-related information. To evaluate these classifiers, we specify a model-agnostic score Δ_eo^(N) measuring how classifier results are disentangled from age. Our best models are better than baseline DNN classifiers, in both accuracy and disentanglement, while compromising accuracies by as little as 2.56 dataset respectively.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


One in three seniors die of Alzheimer’s and other types of dementia in the United States (Association, 2018). Although its causes are not yet fully understood, dementia impacts people’s cognitive abilities in a detectable manner. This includes different syntactic distributions in narrative descriptions (Roark et al., 2007), more pausing (Singh et al., 2001), higher levels of difficulty in recalling stories (Lunsford & Heeman, 2015), and impaired memory generally (Lehr et al., 2012). Fortunately, linguistic features can be used to train classifiers to detect various cognitive impairments. For example, Fraser et al. (2013) detected primary progressive aphasia with up to 100% accuracy, and classified subtypes of primary progressive aphasia with up to 79% accuracy on a set of 40 participants using lexical-syntactic and acoustic features. Fraser et al. (2015) classified dementia from control participants with 82% accuracy on narrative speech.

However, dementia is not the only factor causing such detectable changes in linguistic features of speech. Aging also impairs cognitive abilities (Harada et al., 2013), but in subtly different ways from dementia. For example, aging inhibits fluid cognitive abilities (e.g., cognitive processing speed) much more than the consolidated abilities (e.g., those related to cumulative skills and memories) (Deary et al., 2009). In other words, the detected changes of linguistic features, including more pauses and decreased short-term memories, could attribute to just normal aging process instead of dementia. Unfortunately, due to the high correlation between dementia and aging, it can be difficult to disentangle symptoms are caused by dementia or aging (Murman, 2015). Age is therefore a confounding factor in detecting dementia.

The effects of confounding factors are hard for traditional machine learning algorithms to isolate, and this is largely due to sampling biases in the data. For example, some algorithms predict higher risk of criminal recidivism for people with darker skin colors (Julia et al., 2016), others identify images of smiling Asians as blinking (Lee, 2009), and GloVe word embeddings can project European-American names significantly closer to the words like ‘pleasant’ than African-American names (Caliskan et al., 2017). It is preferable for classifiers to make decisions without biasing too heavily on demographic factors, and therefore to isolate the effects of confounding factors. However, as we will show in Experiments, traditional neural network classifiers bias on age to infer dementia; this can lead to otherwise avoidable false positives and false negatives that are especially important to avoid in the medical domain. Graphically, if both age and dementia cause changes in a feature , the result is a v-structure (Koller & Friedman, 2009)

which is activated upon observing . In other words, the confounder affects if we train the classifier in traditional ways, which is to collect data points and to learn an inference model approximating the affected .

Traditionally, there are several ways to eliminate the effects of confounding factors .


gives a posterior distribution . This is unfortunately unrealistic for small, imbalanced clinical datasets, in which sparsity may require stratification. However, the stratified distributions can be far from a meaningful representation of the real world (as we will show, e.g., in Figure 2

). Moreover, a discrepancy in the sizes of age groups can skew the age prior

, which would seriously inhibit the generalizability of a classifier.


Conducting a randomized control trial (RCT) on removes all causal paths leading ”towards” the variable , which gives a de-confounded dataset according to the notation in Pearl (2009). However, RCTs on are even less practical because simultaneously controlling multiple features produces exponential number of scenarios, and doing this to more than 400 features require far more data points than any available dataset.


according to a pre-trained model per feature could also approximately generate the dataset

. However, such a model should consider participant differences, otherwise interpolating using a fixed age

would give exactly the same features for everybody. The participant differences, however, are best characterized via , which are the values you want to predict.

To overcome the various problems with these methods, we let our classifiers be aware of cognitive impairments while actively filtering out any information related to aging. This is a fair representation learning framework that protects age as a “sensitive attribute”.

Fair representation learning frameworks can be used to train classifiers to equally consider the subjects with different sensitive attributes. A sensitive attribute (or “protected attribute”) can be race, age, or other variables whose impact should be ignored. In the framework proposed by Zemel et al. (2013

), classifiers were penalized for the differences in classification probabilities among different demographic groups. After training, the classifiers produced better demographic similarities while compromising only a little overall accuracy. To push the fair representation learning idea further, adversarial training can be incorporated.

Goodfellow et al. (2014) introduced generative adversarial networks, in which a generator and a discriminator are iteratively optimized against each other. Incorporating adversarial training, Madras et al. (2018) proposed a framework to learn a latent representation of data in order to limit its adversary’s ability to classify based on the sensitive attributes.

However, these approaches to fair representation learning only handle binary attributes. E.g., Madras et al. (2018

) binarized age. To apply to cognitive impairments detection, we want to represent age on a continuous scale (with some granularity if necessary). We formulate a fairness metric for evaluating the ability of a classifier to isolate a continuous-valued attribute. We also propose four models that compress high-dimensional feature vectors into low-dimensional representations which encrypt age from an adversary. We show empirically that our models achieve better fairness metrics than baseline deep neural network classifiers, while compromising accuracies by as little as

and on our two empirical datasets, respectively.

Measuring disentanglement

There are many measures of entanglement between classifier outcomes and specific variables. We briefly review some relevant metrics, and then propose ours.

Traditional metrics


(Pearson, Spearman, etc.) is often used to compare classification outputs with component input features. To the extent that these variables are stochastic, several information theoretic measures could be applied, including Kullback-Leibler divergence and Jensen-Shannon divergence. These can be useful to depict characteristics of two distributions when no further information about available data is given.

Mutual information

can depict the extent of entanglement of two random variables. If we treat age (

) and dementia () as two random variables, then adopting the approach of Kwak & Choi (2002

) gives an estimation of

. However, given the size of clinical datasets, it can be challenging to give precise estimations.

An alternative approach is to assume that these variables fit into some probabilistic models. For example, we might assume the age variable , dementia indicator variable , and multi-dimensional linguistic feature fit into some a priori model (e.g., the v-structure mentioned above, ), then the mutual information between and is:

where the entropy of age and of cognitive impairment remain constant with respect to the input data , and . However, this marginalized probability is difficult to approximate well, because (1) the accuracy of the term relies on the ability of our model to infer age from features, and (2) it is hard to decide on a good prior distribution on linguistic features . We want to make the model agnostic to age, leading to a meaningless mutual information in the ‘ideal’ case.

In our frameworks, we do not assume specific graphical models that correlate confounds and outcomes, and we propose more explainable metrics than the traditional statistical ones.

Fairness metrics

The literature in fairness representation learning offers several metrics for evaluating the extent of bias in classifiers. Generally, the fairer the classifier is, the less entangled the results are with respect to some protected features.

Demographic parity Zemel et al. (2013) stated that the fairest scenario is reached when the composition of the classifier outcome for the protected group is equal to that of the whole population. While generally useful, this does not apply to our scenario, in which there really are more elderly people suffering from cognitive impairments than younger people (see Figure 2).

Cross-entropy loss Edwards & Storkey (2016) used the binary classification loss of an adversary that tried to predict sensitive data from latent representations, as a measure of fairness. This measure can only apply to those models containing an adversary component, not traditional classifiers. Moreover, this loss also depends on the ability of the adversary network. For example, a value of this loss could indicate confusing representations (so sensitive information are protected well), but it could also indicate a weak adversary.

Equalized odds

Hardt et al. (2016) proposed a method in which false positive rates should be equal across groups in the ideal case. Madras et al. (2018) defined fairness distance as the absolute difference in false positive rates between two groups, plus that of the false negative rates:

where and correspond to the false positive rate and false negative rate, respectively, with sensitive attribute ().

Our metric

We propose an extension of the metric used by Madras et al. (2018) to continuous sensitive attributes, suitable for evaluating an arbitrary two-class classifier.

First, groups of age along a scale are divided so that each group has multiple participants with both positive and negative diagnoses, respectively. Let be the age group each participant is in.

Then, we aim for the expected false positive (FP) rates of the classifier to be as constant as possible across age groups. This applies likewise to the false negative (FN) rates. To measure their variability, we use their sum of differences against the mean.

where represents the mean of variable .

Analysis of metric

Special cases

To illustrate the nature of our metric, we apply it to several special cases, i.e.:

  1. When there is only one age group, our fairness metric has its best possible value: .

  2. When there are only two age groups, our metric equals that of Madras et al. (2018).

  3. In the extreme case where there are as many age groups as there are sample points (assuming there are no two people with identical ages but with different diagnoses), our metric becomes less informative, because the empirical expected false positive rates of that group is either or . This is a limitation of our metric, and is the reason that we limit the number of age groups to accommodate the size of the training dataset.


age-indep-autoencoder and age-indep-entropy

Figure 1: Model structures. Each colored arrow denotes a neural network. The common components are interpreters , adversary , and classifier . In age-indep-autoencoder and age-indep-entropy (Figure 1), a reconstructor

tries to reconstruct input data from the hidden representation. In age-indep-consensus-nets (Figure

1), a discriminator tells apart from which modality the representation originates.


Our metric is bounded. The lower bound, , is reached when all false positive rates are equal and when all false negative rates are equal across age groups. Letting be the number of age groups divided, an upper bound for is for any better-than-trivial binary classifier. The detailed proof is included in the Appendix.


Our fairness metric illustrates disentanglement. A higher corresponds to a higher variation of incorrect predictions by the classifier across different age groups. Therefore, a lower value of is desired for classifiers isolating the effects of age to a better extent. Throughout this paper, we use the terms ‘fairness’, ‘disentanglement’, and ‘isolation’ interchangeably.

Design choices

We explain a few design choices here, namely linearity and indirect optimization.

Linearity. We encourage to be as linear as possible, for explainability of the fairness score itself. This eliminates possible scores consisting of higher order terms of FP / FN rates.

Indirect optimization. We avoid directly optimizing the fairness score . The reasons are twofold. On one hand, although is correlated to the disentanglement between age and classification, it is based on FP / FN rates and hence bears their limitations – FP / FN rates do not capture all aspects of classifiers. Instead of making the representations beneficial for , we encourage the hidden representations to be age-agnostic (we will explain how to set up age agnostic models in the following section). On the other hand, FP / FN rates are not differentiable after all.


In this section, we describe four different ways of building representation learning models, which we call age-indep-simple, age-indep-autoencoder, age-indep-consensus-net, and age-indep-entropy.


The simplest model consists of an interpreter network to compress high-dimensional input data, , to low-dimensional representations:

An adversary tries to predict the exact age from the representation:

A classifier estimated the probability of label (diagnosis) based on the representation:

1:Initialize , ,
2:for step := 1 to N do N is a hyper-param
3:     for minibatch in training data  do
4:         , ,
5:         Calculate ,
6:          backprop gradients
7:         for  k:=1 to K do K is a hyper-param
8:               backprop gradients               
Algorithm 1 Training age-indep-simple

For optimization, we set up two losses: the classification negative log likelihood loss and the adversarial (L2) loss , where:

We want to train the adversary to minimize the L2 loss, to train the interpreter to maximize it, and to train the classifier (and interpreter) to minimize classification loss. Overall,

The training steps are taken iteratively, as in previous work (Goodfellow et al., 2014).


The age-indep-autoencoder structure is similar to Madras et al. (2018), and can be seen as an extension from the age-indep-simple structure. Similar to age-indep-simple, there is an interpreter , an adversary , and a classifier network. The difference is that there is a reconstructor network that attempts to recover input data from hidden representations:

The loss functions are set up as:

Overall, we want to train both the interpreter and the reconstructor to minimize the reconstruction loss term, in additional to all targets mentioned in the age-indep-simple network.

The detailed algorithm is similar to Algorithm 1 and is in the Appendix.


This is another extension from the age-indep-simple structure, borrowing an idea from consensus networks (Zhu et al., 2018a), i.e., that agreements between multiple modalities can result in representations that are beneficial for classification. By examining the performance of age-indep-consensus-net, we would like to see whether agreement between multiple modalities of data can be trained to be disentangled from age.

Similar to age-indep-simple structures, there is also an adversary and a classifier . The interpreter, however, is replaced with several interpreters , each compressing a subset of the input data (“modality”) into a low-dimensional representation. The key of age-indep-consensus-network models is that these representations are encouraged to be indistinguishable. For simplicity, we randomly divide the input features into three modalities () with equal (1) features. A discriminator tries to identify the modality from which the representation comes:

The loss functions are set up as:

Overall, we want to iteratively optimize the networks:

The detailed algorithm is in the Appendix. Note that we do not combine the consensus network with the reconstructor because they do not work well with each other empirically. In one of the experiments by Zhu et al. (2018b), each interpreter is paired with a reconstructor and the performance decreases dramatically. The reconstructor encourages hidden representations to retain the fidelity of data, while the consensus networks urges hidden representations to keep only the information common among modalities, which prohibits the reconstructor and consensus mechanism to function together.


The fourth model we apply to fair representation learning is motivated by categorical GANs (Springenberg, 2016), where information theoretic metrics characterizing the confidences of predictions can be optimized. This motivates an additional loss function term; i.e., we want to encourage the interpreter to increase the uncertainty (i.e., to minimize the entropy) while letting the adversary become more confident in predicting ages from representations.

Age-indep-entropy models have the same network structures as age-indep-autoencoder, except that instead of predicting the exact age, the adversary network outputs the probability of the sample age being larger than the mean:

This enables us to define the empirical entropy , which describes the uncertainty of predicting age.

Formally, the loss functions are set up as follows:

where is a hyper-parameter. For comparison, we also include two variants, namely the age-indep-entropy (binary) and age-indep-entropy (Honly) variants, each keeping only one of the two terms in . In our experiments, we show that these two terms in are better applied together.

Overall, the training procedure is the same as age-indep-autoencoder and algorithm pseudocode is in the Appendix:


All models are implemented in PyTorch

(Paszke et al., 2017), optimized with Adam (Kingma & Ba, 2014) with initial learning rate of , and L2 weight decay

. For simplicity, we use fully connected networks with ReLU activations

(Nair & Hinton, 2010)

and batch normalization

(Ioffe & Szegedy, 2015) before output layers, for all interpreter, adversary, classifier, and discriminator networks. Our frameworks can be applied to other types of networks in the future.




DementiaBank111 is the largest available public dataset for assessing cognitive impairments using speech, containing 473 narrative picture descriptions from subjects aged between 45 and 90 (Becker et al., 1994). In each sample, a participant talks about what is happening in a clinically validated picture. There is no time limit in each session, but the average description lasts about a minute. 79 samples are excluded due to missing age information. In the remaining data samples, 182 are labeled ‘control’, and 213 are labeled ‘dementia’. All participants have mini-mental state estimation (MMSE) scores (Folstein et al., 1975) between 1 and 30 222A higher MMSE score corresponds to a healthier estimated cognitive ability – scores 24 to 30 typically indicate a healthy state, 18-23 usually indicate mild cognitive impairment (MCI), and scores below 17 indicate dementia (or other type of cognitive impairment). To formulate a binary classification task, we label all of MCI and dementia as ‘dementia’.

. Of all data samples containing age information, the mean is 68.26 and standard deviation is 9.00.

Famous People

The Famous People dataset (Balagopalan et al., 2018) contains 252 transcripts from 17 people (8 with dementia including Gene Wilder, Ronald Reagan and Glen Campbell, and 9 healthy controls including Michael Bloomberg, Woody Allen, and Tara VanDerveer), collected and transcribed from publicly available speech data (e.g., press conferences, interviews, debatse, talk shows). Seven data samples are discarded due to missing age information. Among the remaining samples, there are 121 labeled as control and 124 as impaired. Note that the data samples were gathered across a wide range of ages (mean 59.25, standard deviation 13.60). For those people diagnosed with dementia, there are data samples gathered both before and after the diagnosis, and all of which are labeled as ‘dementia’. The Famous People dataset permits for early detection several years before diagnosis, which is a more challenging classification task than DementiaBank.

Older participants in both DementiaBank (Figure 1(a)) and the Famous People dataset (Figure 1(b)) are more likely to have cognitive impairments.

(a) Histogram plot for DementiaBank
(b) Histogram plot for Famous People Dataset
Figure 2: Expository histogram plots for the ages of people in the impaired and control groups.

Preprocess and feature extraction

We extract 413 linguistic features from the narrative descriptions and their transcripts. These features were previously identified as the most useful for this task (Roark et al., 2007; Fraser et al., 2015; Lunsford & Heeman, 2015; Hernández-Domínguez et al., 2018). Each feature is -score normalized. Relevant features include:


mean, variance, skewness, and kurtosis of the first 42 cepstral coefficients.

Speech fluency:

pause-word ratio, utterance length, number and lengths of filled/unfilled pauses.


cosine similarity between pairs of utterances, word lengths, lexical richness (moving-average type-token ratio, Brunet’s index, and Honoré’s statistics (Guinn & Habash, 2012)).


Number of occurrences of part-of-speech tags, tagged by SpaCy333

Syntactic and semantic:

occurrences of context-free grammar phrase types, parsed by Stanford CoreNLP (Manning et al., 2014), and Yngve depth statistics (Yngve, 1960).

Linguistic features can predict age

As part of expository data analysis, we show that these linguistic features contain information indicating age. Simple fully connected neural networks can predict age with mean absolute error of years (on DementiaBank444Hidden layer sizes 64, 32, 8. 5-fold cross validation.) and years (on the Famous People dataset555Hidden layer sizes 32, 20, 2. 5-fold cross validation). This indicates that even simple neural networks are able to infer information about age from linguistic features. Neural classifiers can therefore also easily bias on age, given the utility of age in downstream tasks.

Evaluating classical classifiers and disentanglement methods

We first set up benchmarks for our classifiers. We evaluate several traditional classifiers with our fairness metrics ( and , corresponding to dividing ages into and groups respectively). The results666All accuracy and fairness results in this paper are based on 5-fold cross validations, where no speaker occurs both in train and test data. are listed in Table 1. A DNN is used as the baseline because (1) all our models are based on neural networks, and (2) DNN classifiers have had the best (or statistically indistinguishable from the best) accuracy on the DementiaBank and Famous People datasets.


Classifier DementiaBank Famous People
Accuracy Accuracy
Using raw features
DNN .78.05 0.130.12 0.940.23 .59.05 0.300.19 1.560.60
SVM .77.05 0.170.13 0.930.29 .60.04 0.230.19 1.280.29
Random Forest .74.03 0.190.14 1.070.36 .56.06 0.330.26 1.350.42
Adaboost .78.07 0.140.11 0.960.22 .54.04 0.230.14 1.360.57
Table 1: Accuracy and fairness ( and ) of several traditional classifiers. DNN is the baseline used to benchmark our neural network based representation learning models.


Model DementiaBank Famous People
Accuracy Accuracy
DNN baseline .78.05 0.130.12 0.940.23 .59.05 0.300.19 1.560.60
*-simple .75.00 0.080.01 0.800.08 .57.05 0.241.90 1.470.57
*-autoencoder .76.01 0.110.00 0.880.24 .55.07 0.210.16 1.280.31
*-consensus-nets .72.00 0.110.01 0.830.24 .58.05 0.250.16 1.430.41
*-entropy .75.00 0.150.01 0.880.24 .58.06 0.230.16 1.350.44
*-entropy (binary) .72.00 0.120.01 1.100.37 .55.07 0.261.53 1.410.40
*-entropy (Honly) .74.00 0.170.02 1.270.54 .53.06 0.200.16 1.390.49
Table 2: Evaluation results of our representation learning models. The ”age-indep” prefix are replaced with ”*” in model names. age-indep-simple and age-indep-autoencoder have better disentanglement scores, while the rest two models could have better accuracy.

Performance and discussion

We evaluate the performances of our four proposed neural networks against the DNN baseline. As an additional ablation study, two variants of age-indep-entropy are also evaluated. Table 2 shows classification accuracies and fairness metrics, and the DNN baseline for comparison. Several observations emerge, as discussed below.


The fair representation learning models compromise accuracy, in comparison to DNN baselines. This confirms that part of the classification power of DNNs come from biasing with regards to age. On DementiaBank, the age-indep-autoencoder reduces accuracy the least (only 2.56% in comparison to the DNN baseline). On the Famous People data, age-indep-consensus and age-indep-entropy models compromise accuracies by only 2.25% and 2.75% respectively, which are not statistically different from the DNN baseline777 on 38-DoF one-tailed -tests, respectively..


In comparison to DNN baselines, our fair representation learning models improve disentanglement/fairness888On DementiaBank, and for age-indep-simple and age-indep-entropy on respectively; these are significant. and on age-indep-autoencoder and age-indep-consensus-net on respectively; these are marginally significant. However, these differences are not as significant on (0.05, 0.31, 0.44, and 0.16.). On Famous People data, the values for our four models are on and on . These are all 38-DoF one-tailed -tests., the improvements are mostly significant when measured by the two-group scores . Also, the five-group scores are less stable for both datasets, and the scores in the Famous People have higher variances than in DementiaBank. Following is an explanation. DementiaBank has 400 data samples. In 5-fold cross validation, each of the five age groups has only 16 samples during evaluation. Famous People data contains 250 samples, which increases the variance. When the number of groups, of , is kept small (e.g., 100 samples per label per group, as in DementiaBank ), the fairness metrics are stable.

Side notes

The model age-indep-entropy is best used with a loss function containing both the binary classification term and the uncertainty minimization term. As shown in Table 2, although having similar fairness metrics999On DementiaBank, for and of age-indep-Honly against age-indep-entropy, for age-indep-binary. On Famous People, for age-indep-Honly, and for age-indep-binary. None of them are significant on 38-DoF one-tailed -tests., the two variants with only one term could have lower accuracy than age-indep-entropy.

In general, age-indep-simple and age-indep-autoencoder achieve the best fairness metrics. Noticeably, the better of them surpass traditional classifiers in both and .


Here, we identify the problem of entangling age in the detection of cognitive impairments. After explaining this problem with causality diagrams, we formulate it into a fair representation learning task, and propose a fairness score to measure the extent of disentanglement. We put forward four fair representation learning models that learn low-dimensional representations of data samples containing as little age information as possible. Our best model improves upon the DNN baseline in our fairness metrics, while compromising as little accuracy as 2.56% (on DementiaBank) and 2.25% (on the Famous People dataset).


  • Association (2018) Alzheimer’s Association. Alzheimer’s disease facts and figures. Alzheimer’s & dementia, 2018.
  • Balagopalan et al. (2018) Aparna Balagopalan, Jekaterina Novikova, and Frank Rudzicz. Early prediction of Alzheimer’s disease from spontaneous speech. Submitted to AAAI, 2018.
  • Becker et al. (1994) James T Becker, François Boiler, Oscar L Lopez, Judith Saxton, and Karen L McGonigle. The natural history of Alzheimer’s disease: description of study cohort and accuracy of diagnosis. Archives of Neurology, 51(6):585–594, 1994.
  • Caliskan et al. (2017) Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017. URL
  • Deary et al. (2009) Ian J Deary, Janie Corley, Alan J Gow, Sarah E Harris, Lorna M Houlihan, Riccardo E Marioni, Lars Penke, Snorri B Rafnsson, and John M Starr. Age-associated cognitive decline. British medical bulletin, 92(1):135–152, 2009.
  • Edwards & Storkey (2016) Harrison Edwards and Amos Storkey. Censoring representations with an adversary. In ICLR, 2016.
  • Folstein et al. (1975) Marshal F Folstein, Susan E Folstein, and Paul R McHugh. “Mini-mental state”: a practical method for grading the cognitive state of patients for the clinician. Journal of psychiatric research, 12(3):189–198, 1975.
  • Fraser et al. (2015) Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. Linguistic features identify Alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease 49(2016)407-422, 2015.
  • Fraser et al. (2013) Katie Fraser, Frank Rudzicz, and Elizabeth Rochon. Using text and acoustic features to diagnose progressive aphasia and its subtypes. In Proc. Interspeech, pp. 2177–2181, Lyon France, aug 2013.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.
  • Guinn & Habash (2012) Curry I Guinn and Anthony Habash. Language analysis of speakers with dementia of the Alzheimer’s type. In

    AAAI Fall Symposium: Artificial Intelligence for Gerontechnology

    , pp. 8–13. Menlo Park, CA, 2012.
  • Harada et al. (2013) Caroline N Harada, Marissa C Natelson Love, and Kristen L Triebel. Normal cognitive aging. In Clinics in geriatric medicine, volume 29, pp. 737–752. Elsevier, 2013.
  • Hardt et al. (2016) Moritz Hardt, Eric Price, Nati Srebro, et al.

    Equality of opportunity in supervised learning.

    In NIPS, pp. 3315–3323, 2016.
  • Hernández-Domínguez et al. (2018) Laura Hernández-Domínguez, Sylvie Ratté, Gerardo Sierra-Martínez, and Andrés Roche-Bergua. Computer-based evaluation of AD and MCI patients during a picture description task. In Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring. Elsevier, 2018.
  • Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456, 2015.
  • Julia et al. (2016) Angwin Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. Machine Bias. ProPublica, 2016.
  • Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • Koller & Friedman (2009) Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Kwak & Choi (2002) Nojun Kwak and Chong Ho Choi.

    Input feature selection by mutual information based on Parzen window.

    In IEEE Trans. Patt. Anal. Mach. Intell., volume 24, pp. 1667–1671, 2002.
  • Lee (2009) Odella Lee. Camera misses the mark on racial sensitivity. Gizmodo, 2009.
  • Lehr et al. (2012) Maider Lehr, Emily Prud’hommeaux, Izhak Shafran, and Brian Roark. Fully automated neuropsychological assessment for detecting mild cognitive impairment. In Proc. Interspeech, pp. 1039–1042, 2012.
  • Lunsford & Heeman (2015) Rebecca Lunsford and Peter A Heeman. Using linguistic indicators of difficulty to identify mild cognitive impairment. In Proc. Interspeech, pp. 658–662, 2015.
  • Madras et al. (2018) David Madras, Elliot Creager, Toniann Pitassi, and Richard Zemel. Learning adversarially fair and transferable representations. In ICML, pp. 3381–3390, 2018.
  • Manning et al. (2014) Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J Bethard, and David McClosky.

    The Stanford CoreNLP Natural Language Processing Toolkit.

    In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60, 2014. URL
  • Murman (2015) Daniel L Murman. The impact of age on cognition. In Seminars in hearing, volume 36, pp. 111. Thieme Medical Publishers, 2015.
  • Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, pp. 807–814, 2010.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. 2017.
  • Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
  • Roark et al. (2007) Brian Roark, Margaret Mitchell, and Kristy Hollingshead. Syntactic complexity measures for detecting mild cognitive impairment. In Workshop on BioNLP 2007, pp. 1–8. Association for Computational Linguistics, 2007.
  • Singh et al. (2001) Sameer Singh, Romola S. Bucks, and Joanne M. Cuerden. Evaluation of an objective technique for analysing temporal variables in dat spontaneous speech. In Aphasiology, volume 15, pp. 571–583. Routledge, 2001.
  • Springenberg (2016) Jost Tobias Springenberg.

    Unsupervised and semi-supervised learning with categorical generative adversarial networks.

    In ICLR, 2016.
  • Yngve (1960) Victor H Yngve. A model and an hypothesis for language structure. Proceedings of the American philosophical society, 104(5):444–466, 1960.
  • Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning Fair Representations. In ICML, pp. 325–333, Atlanta, Georgia, USA, 2013.
  • Zhu et al. (2018a) Zining Zhu, Jekaterina Novikova, and Frank Rudzicz. Detecting cognitive impairments by agreeing on interpretations on linguistic features. arxiv 1808.06570, 2018a.
  • Zhu et al. (2018b) Zining Zhu, Jekaterina Novikova, and Frank Rudzicz. Semi-supervised classification by reaching consensus among modalities. arxiv 1805.09366, 2018b.

Appendix 1: Proof of Upper bound of

In this section, we detail the steps leading to an upper bound for the metric .

Proposition The expectation of all false positive and false negative rates are bounded by .

This gives an upper bound to our metric . If the classifier is not trivial, there is a tighter upper bound.

Definition A trivial binary classifier always predicts the majority class.

Lemma The expected error rate of a trivial binary classifier is no more than 0.5.

Proof of Lemma Let () denote the composite of positive samples in the dataset. Table 3 shows the possible values of error rates. Regardless of whether the dataset has balanced classes, the error rate of a trivial binary classifier is no more than 0.5.


Trivial prediction 0 1
False positive rate (FP) 0
False negative rate (FN) 0
Error rate (FP+FN)
Table 3: Table of values showing statistics of a trivial binary classifier.


Our score is upper bounded by for any non-trivial binary classifier:

Proof of Theorem

For each of the age groups:

Summing up the age groups results in our upper bound for non-trivial classifiers.

Appendix 2: Algorithms for our models

Following are the pseudo-code algorithms for our remaining three models; age-indep-AutoEncoder, age-indep-ConsensusNetworks, and age-indep-Entropy.

1:Initialize , , ,
2:for step := 1 to N do N is a hyper-parameter
3:     for minibatch in training data  do
4:         , ,
5:          Reconstructing the original feature vector.
6:         Calculate , ,
7:          backprop gradients
8:         for  k:=1 to K do K is a hyper-parameter
9:               backprop gradients               
Algorithm 2 Training age-indep-AutoEncoder
1:Each data point are split into M modalities
2:Initialize , ,
3:for step := 1 to N do N is a hyper-parameter
4:     for minibatch in training data  do
5:         for m := 1 to M do
6:               interpretation
7:               predict modality
8:               predict age group          
10:         Calculate , ,
11:          backprop gradients
12:         for  k:=1 to  do is a hyper-parameter
13:               optimize modality discriminator          
14:         for  k:=1 to  do is a hyper-parameter
15:               optimize adversary               
Algorithm 3 Training age-indep-consensus-net
1:Initialize , , ,
2:for step := 1 to N do N is a hyper-parameter
3:     for minibatch in training data  do
4:         , , ,
5:         Calculate ,
6:         Calculate
7:          backprop gradients
8:         for  k:=1 to K do K is a hyper-parameter
9:               backprop gradients               
Algorithm 4 Training age-indep-Entropy