1 Introduction
The identification of vector autoregressive processes with hidden components from time series of observations is a central problem across Machine Learning, Statistics, and Forecasting
[1]. This problem is also known as proper learning of linear dynamical systems (LDS) in System Identification [2]. As a rather general approach to timeseries analysis, it has applications ranging from learning populationgrowth models in actuarial science and mathematical biology [3] to functional analysis in neuroscience [4]. Indeed, one encounters either partially observable processes [5] or questions of causality [6] that can be tied to proper learning of LDS [7] in almost any application domain.A discretetime model of a linear dynamical system [1]
suggests that the random variable
capturing the observed component (output, observations, measurements) evolves over time according to:(1)  
(2) 
where is the hidden component (state) and and are compatible system matrices. Random variables
capture normallydistributed process noise and observation noise, with zero means and covariance matrices
and , respectively. In this setting, proper learning refers to identifying the quadruple given the observations of . This also allows for the estimation of subsequent observations, in the socalled “predictionerror” approach to improper learning [2].We consider a generalisation of the proper learning of LDS, where:

There are a number of individuals within a population. The population is partitioned into a set of subgroups .

For each subgroup , there is a set of trajectories of observations available and each trajectory has observations for periods , possibly of varying cardinality .

Each subgroup is associated with a LDS, . For all , , the trajectory , for , is hence generated by precisely one LDS .
Note that for notations, the superscripts denote the trajectories and subgroups while subscripts indicates the periods.
In this setting, underrepresentation bias [8, cf. Section 2.2]
, where the trajectories of observations from one (“disadvantaged”) subgroup are underrepresented in the training data, harms both accuracy of the classifier overall and fairness in the sense of varying accuracy across the subgroups. This is particularly important, if the problem is constrained to be subgroupblind, i.e., constrained to consider only a single LDS as a model. This is the case, when the use of attributes distinguishing each subgroup can be regarded as discriminatory (e.g., gender, race, cf.
[9]). Notice that such antidiscrimination measures are increasingly stipulated by the legal systems, e.g., within product or insurance pricing, where the sex of the applicant cannot be used, despite being known.A natural notion of fairness in subgroupblind learning of LDS involves estimating the system matrices or forecasting the next output of a single LDS that captures the overall behaviour across all subgroups, while taking into account the varying amounts of training data for the individual subgroups. To formalise this, suppose that we learn one LDS
from the multiple trajectories and we define a loss function that measures the loss of accuracy for a certain observation
, for , , when adopting the forecast for the overall population. For , , , we have(3) 
Let . We know that is feasible only when . Note that since each trajectory is of varying length, it is possible that at certain triple , there is no observation and , are infeasible.
We propose two novel objective to address the underrepresentation bias:

Subgroup Fairness. The objective seeks to equalise, across all subgroups, the sum of losses for the subgroup. Considering the number of trajectories in each subgroup and the number of observations across the trajectories may differ, we include as weights in the objective:
(4) 
Instantaneous Fairness. The objective seeks to equalise the instantaneous loss, by minimising the maximum of the losses across all subgroups and all times:
(5)
1.1 Contributions
Overall, our contributions are the following:

We introduce two new notions of fairness into forecasting.

We cast proper and improper learning of a linear dynamical system with fairness considerations as a noncommutative polynomial optimisation problem (NCPOP).

We prove convergence of an algorithm based on the convergent hierarchy of semidefinite programming (SDP) relaxations.

We study the numerical methods for solving the resulting NCPOP and extracting its optimiser.
This presents an algorithmic approach to addressing the underrepresentation bias studied by Blum et al. [8] and presents a step forward within the fairness in forecasting studied recently by [9, 10, 11], as outlined in the excellent survey of [12]. It follows much work on fairness in classification, e.g., [13, 14, 15, 16, 12, 17]. It is complemented by several recent studies involving dynamics and fairness [18, 19, 20], albeit not involving learning of dynamics. It relies crucially on tools developed in noncommutative polynomial optimisation [21, 22, 23, 24, 25] and noncommutative algebra [26, 27, 28, 29], which have not seen much use in Statistics and Machine Learning, yet.
2 Motivation
Insurance pricing
Let us consider two motivating examples. One important application arises in Actuarial Science. In the European Union, a directive (implementing the principle of equal treatment between men and women in the access to and supply of goods and services), bars insurers from using gender as a factor in justifying differences in individuals’ premiums. In contrast, insurers in many other territories classify insureds by gender, because females and males have different behavior patterns, which affects insurance payments. Take the annuitybenefit scheme for example. It is a wellknown fact that females have a longer life expectancy than males. The insurer will hence pay more to a female insured over the course of her lifetime, compared to a male insured, on averag [30]. Because of the directive, a unisex mortality table needs to be used. As a result, male insureds receive less benefits, while paying the same premium in total as the female subgroup [30]. Consequently, male insureds might leave the annuitybenefit scheme (known as the adverse selection), which makes the unisex mortality table more challenging to use in the estimation of the life expectancy of the “unisex” population, where female insureds become the advantaged subgroup.
To be more specific, consider a simple actuarial pricing model of annuity insurance. Insureds enter an annuitybenefit scheme at time and each insured can receive 1 euro in the end of each year for at most 10 years on the condition that it is still alive. Let denotes how many insureds left in the scheme in the end of the year. Suppose there are insureds in the beginning and the pricing interest rate is . The formula of calculating the pure premium is in (6), thus summing up the present values of payment in each year and then divided by the number of insureds in the beginning.
(6) 
The most important quality is derived from estimating insureds’ life expectancy. Suppose the insureds can be divided into female subgroup and male subgroup and each subgroup only have one trajectory: for female subgroup, for male subgroup for , where the superscript is dropped. The two trajectories indicate how many female and male insureds are alive in the end of the year respectively. Both trajectories can be regarded as linear dynamic systems. We have
(7)  
(8)  
(9) 
where and are measurement noises while and are system matrices for female LDS and male LDS respectively. Note that these are state processes, without any observation process: the number of survivals can be precisely observed. To satisfy the directive, one needs to consider a unisex model:
(10) 
where and and pertain to the unisex insureds LDS . Subsequently, the loss functions for female (f) and male (m) subgroups are:
(11)  
(12) 
Since the trajectories and have the same length and there is only one trajectory in each subgroup, we have
(13) 
(14) 
Personalised pricing
Another application arises in personalised pricing (PP). The extent of personalised pricing is growing, as the amounts of data available to pricing strategies increase. Suitable data include user locations, IP address, web visits, past purchases and additional information volunteered by customers [31]. There are concerns that the practice may hurt overall trust, as it did in the wellknown case [31] of a consumer, who found out that Amazon was selling products to regular consumers at higher prices, and that deleting the cookies on the computer could cause the inflated prices to drop.Furthermore, the practice can also violate antidiscrimination law [31] in many jurisdictions. For instance in the United States, the Federal Trade Commission enforces the Equal Credit Opportunity Act (ECOA), which bars offering prices for credit from utilising certain protected consumer characteristics such as race, colour, religion, national origin, sex, marital status, age, or the receipt of public assistance. This risk would force many entities offering financial products to set the same price for the subgroups, regardless of the significant differences in their willingness to pay.
Let us consider an idealised example of PP: Consider a soap retailer, whose customers contain female and male subgroups. Each gender has a specific dynamic system modelling its willing to pay (“demand price” of each subgroup), while the retailer should set a “unisex” price. As in the discussion of insurance pricing, we consider subgroups female, male and use superscripts to distinguish the related quantities. Unlike in insurance pricing, the demand price of each customer is regarded as a single trajectory. More importantly, since customers might start buying the soap, quit buying the soap, or move to other substitutes at different time points, those trajectories of demand prices are assumed to be of varying lengths. For example, a customer starts to buy the soap at time but decides to buy hand wash instead from time .
Let us assume there are female customers and customers in the overall time window . Let denote the estimated demand price at time of the customer in subgroup . These evolve as:
(15)  
(16)  
(17)  
(18) 
The unisex model for demand price considers the unisex state , the unisex system matrices , and unisex noises :
(19)  
(20) 
For and , we can consider
(21) 
(22) 
We also refer to [32] for further work on protecting customers’ interests in personalised pricing via fairness considerations.
3 Our models
We assume that the underlying LDS of each subgroup all have the form of (1)(2), while only one subgroupblind LDS can be learned and used for prediction. The following model in (23)(24) can be used to describe the subgroupblind state evolution directly.
(23)  
(24) 
for , where represents the estimated subgroupblind state and is the trajectory predicted by the subgroupblind LDS .
The objectives (4) and (5), subject to (23)(24), yield two operatorvalued optimisation problems. Their inputs are , i.e., the observations of multiple trajectories and the multiplier . The operatorvalued decision variables include operators proper , vectors , and scalars and z. Notice that ranges over , except for , where . The auxiliary scalar variable is used to reformulate "“ in the objective (4) or (5). Since the observation noise is assumed to be a sample of meanzero normallydistributed random variable, we add the sum of squares of to the objective with a multiplier , seeking a solution with close to zero. (See the Supplementary Material.) Overall, the subgroupfair and instantfair formulations read:
Oz + λ∑_t≥1 ν_t^2SubgroupFair z≥1I(s) ∑_i ∈I^(s) 1T(i,s) ∑_t∈T^(i,s) loss^(i,s)(f_t) , s∈S m_t= G m_t1+ω_t, t∈T^+ f_t= F’ m_t+ν_t, t∈T^+
Oz + λ∑_t≥1 ν_t^2InstantFair z≥loss^(i,s)(f_t), t∈T^(i,s),i∈I^(s),s∈S m_t= G m_t1+ω_t, t∈T^+ f_t= F’ m_t+ν_t, t∈T^+
For comparison, we use a traditional formulation that focuses on minimising the overall loss:
O ∑_s ∈S ∑_i ∈I^(s) ∑_t∈T^(i,s) loss^(i,s)(f_t) + λ∑_t≥1 ν_t^2Unfair m_t= G m_t1+ω_t, t∈T^+ f_t= F’ m_t+ν_t, t∈T^+
To state our main result, we need a technical assumption related to the stability of the LDS, which suggests that the operatorvalued decision variables (and hence estimates of states and observations) remain bounded. Let us define the quadratic module, following [21]. Let be the set of polynomials determining the constraints. The positivity domain of are tuples of bounded operators on a Hilbert space making all positive semidefinite. The quadratic module is the set of where and are polynomials from the same ring. As in [21], we assume:
Assumption 1 (Archimedean).
Quadratic module of (3) is Archimedean, i.e., there exists a real constant such that .
Theorem 2.
For any observable linear system , for any length of a time window, and any error , under Assumption 1, there is a convex optimisation problem from whose solution one can extract the best possible estimate of system matrices of a system based on the observations, with fairness subgroupfair considerations (3), up to an error of at most in Frobenius norm. Further, with suitably modified assumptions, the result holds also for the instantfair considerations (3).
4 Numerical illustrations
4.1 Generation of biased training data
To illustrate the impact of our models on data with varying degrees of underrepresentation bias, we consider a method for generating data resembling the motivating applications of Section 2, with varying degrees of the bias. Suppose there are two subgroups, one advantaged subgroup and one disadvantaged subgroup, advantaged, disadvantaged with trajectories and in each subgroup. Underrepresentation bias can enter training set in the following ways:

Observations are sampled from corresponding LDS . Thus each .

Discard some trajectories in , if necessary, such that .

Let
denote the probability that an observation from subgroup
stays in the training data and . Discard more observations of than those of so that . If is fixed at 1, the degree of underrepresentation bias can be controlled by simply adjusting .
The last two steps discard more observations of the disadvantaged subgroup in the biased training data, so that the advantaged subgroup becomes overrepresented. Note that for small sample size, it is necessary to make sure there is at least one observation in each subgroup at each period.
For example, consider that both subgroups have the same system matrices:
while the covariance matrices
are sampled randomly from a uniform distribution over
and , respectively. Set the time window to be across trajectories in the advantaged subgroup and in disadvantaged one, i.e., , and . Then the bias is introduced according to the biased training data generalisation process described above, with random .Figure 2 shows the forecasts in 30 experiments on this example. For each experiment, the same set of observations , , is reused and the trajectories of advantaged and disadvantaged subgroups are denoted by dotted lines and dashed lines, respectively. However, the observations that are discarded vary across the experiments. Thus, a new biased training set is generated in each experiment, albeit based on the same “ground set” of observations. The three models (3)(3) are applied in each experiment with of 5 and 1, respectively, as chosen by iterating over integers 1 to 10. The mean of forecast
and its standard deviation are displayed as the solid curves with error bands.
4.2 Effects of underrepresentation bias on accuracy
Figure 2 suggests how the degree of bias affects accuracy with and without considering fairness. With the number of trajectories in both subgroups set to 2, i.e. and , we vary the degree of bias within . To measure the effect of the degree on accuracy, we introduce the normalised root mean squared error (nrmse) fitness value for each subgroup:
(25) 
for and . Higher indicates lower accuracy for the subgroup, i.e., the predicted trajectory of subgroupblind is further away from the subgroup.
For the training data, the same set of observations , , in Figure 2 is reused but . Thus one trajectory in advantaged subgroup is discarded. Then, the biased training data generalisation process in Section 4.1 is applied in each experiment with and the values for selecting from to at the step of . At each value of , three models (3)(3) are run with new biased training data and the experiment is repeated for times. Hence, the quartiles of for each subgroup shown as boxes in Figure 2.
One could expect that nrmse fitness values of advantaged subgroup in Figure 2 to be generally lower than those of the disadvantaged subgroup (), leaving a gap. Those gaps narrow down as increases, simply because more observations of disadvantaged subgroup remain in the training data. Compared the to “Unfair”, models with fairness constraints, i.e., “SubgroupFair” and “InstantFair”, show narrower gaps and higher fairness between two subgroups. More surprisingly, when decreases as gets close to , "SubgroupFair" model still can keep the at almost the same level, indicating a rise in overall accuracy. This is in contrast with results [13, 33] from classification.
4.3 Runtime
Notice that minimising a multivariate polynomial in matricial variables (3)(3) over a set defined by a finite intersection of polynomial inequalities in the same variables is nontrivial, but there exists the globally convergent NavascuésPironioAcín (NPA) hierarchy [34] of semidefinite programming (SDP) relaxations as explained in the Supplementary Material, and its sparsityexploiting variant (TSSOS) as pioneered by Wang et al. [24, 25], which can be applied to such noncommutative polynomial optimisation problems. The SDP of a given order in the respective hierarchy can be constructed using ncpol2sdpa^{1}^{1}1https://github.com/peterwittek/ncpol2sdpa of Wittek [22] or the tssos^{2}^{2}2https://github.com/wangjie212/TSSOS of Wang et al. [24, 25] and solved by sdpa of Yamashita et al. [35]
. Our implementation is available in the Supplementary Material for review purposes and will be opensourced upon acceptance.
In Figure 4, we illustrate the runtime and size of the relaxations as a function of the length of the time window with the same data set as above (i.e., Figure 2). The grey curve displays the number of variables in the firstorder SDP relaxation of "SubgroupFair" and "InstantFair" models against the length of time window. The deeppink and cornflowerblue curves show the runtime of the firstorder SDP relaxation of NPA and the secondorder SDP relaxation of TSSOS hierarchy, respectively, on a laptop equipped by Intel Core i7 8550U at 1.80 Ghz. The results of "SubgroupFair" and "InstantFair" models are presented by solid and dashed curves, respectively. Since each experiment is repeated for three times, the mean and mean
1 standard deviation of runtime are presented by curves with shaded error bands. It is clear that the runtime of TSSOS exhibits a modest growth with the length of time window, while that of the plainvanilla NPA hierarchy surges as can be expected, given that the number of SDP variables is equivalent to that of relaxation variables or the entries in the moment matrix (
, as defined in the Supplementary Material, cf. Eq. 28).4.4 Experiments with COMPAS recidivism scores
Finally, we wish to suggest the broader applicability of the two notions of subgroup fairness and instantaneous fairness. We use the wellknown benchmark dataset [36] of estimates of the likelihood of recidivism made by the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS), as used by courts in the United States. Broadly speaking, the defendants’ risk scores (the higher the worse) are negatively correlated with the amount of time before defendants’ recidivism. However, the correlation is different between black and white defendants’ COMPAS scores.
We consider all defendants (N = 21) within the age range of 2545, male, with two or less prior crime counts, labelled as belonging to either AfricanAmerican or Caucasian ethnicity. The defendants are partitioned into two subgroups, by ethnicity. In each subgroup, defendants are divided by the type of their reoffending (M1 and M2). The COMPAS scores of subsamples are shown in Figure 4 by dots, where warm and cold tones denote AfricanAmerican and Caucasian subgroups respectively. The trajectory shown by same colour is obtained from dots in corresponding subsample. The SubgroupFair outcome is presented in cyan. In Figure 4, the coralcoloured curve for the COMPAS dataset suggests that the runtime remains modest. While the COMPAS dataset calls for classification, rather than forecasting, our notion also seems to be applicable.
5 Conclusions
Overall, the two natural notions of fairness (subgroup fairness and instantaneous fairness), which we have introduced, contribute towards the fairness in forecasting and proper learning of linear dynamical systems. We have presented globally convergent methods for the estimation considering the two notions of fairness using hierarchies of convexifications of noncommutative polynomial optimisation problems, whose runtime is independent of the hidden state.
Acknowledgements
Quan’s and Bob’s work has been supported by the Science Foundation Ireland under Grant 16/IA/4610. Jakub acknowledges funding from RCI (reg. no. CZ.02.1.01/0.0/ 0.0/16_019/0000765) supported by the European Union.
References
 [1] M. West and J. Harrison, Bayesian Forecasting and Dynamic Models (2nd ed.). Berlin, Heidelberg: SpringerVerlag, 1997.
 [2] L. Ljung, System Identification: Theory for the User. Pearson Education, 1998.
 [3] P. H. Leslie, “On the use of matrices in certain population mathematics,” Biometrika, vol. 33, no. 3, pp. 183–212, 11 1945.
 [4] M. Besserve, B. Schölkopf, N. K. Logothetis, and S. Panzeri, “Causal relationships between frequency bands of extracellular signals in visual cortex revealed by an information theoretic analysis,” Journal of computational neuroscience, vol. 29, no. 3, pp. 547–566, 2010.
 [5] K. J. Åström, “Optimal control of markov processes with incomplete state information,” Journal of Mathematical Analysis and Applications, vol. 10, no. 1, pp. 174 – 205, 1965.
 [6] J. Pearl, Causality. Cambridge university press, 2009.
 [7] P. Geiger, K. Zhang, B. Schoelkopf, M. Gong, and D. Janzing, “Causal inference by identification of vector autoregressive processes with hidden components,” in International Conference on Machine Learning, 2015, pp. 1917–1925.
 [8] A. Blum and K. Stangl, “Recovering from biased data: Can fairness constraints improve accuracy?” arXiv preprint arXiv:1912.01094, 2019.
 [9] P. Gajane and M. Pechenizkiy, “On formalizing fairness in prediction with machine learning,” arXiv preprint arXiv:1710.03184, 2017.
 [10] A. Chouldechova, “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,” Big data, vol. 5, no. 2, pp. 153–163, 2017.
 [11] F. Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Schölkopf, and O. Bachem, “On the fairness of disentangled representations,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 14 611–14 624.
 [12] A. Chouldechova and A. Roth, “A snapshot of the frontiers of fairness in machine learning,” Commun. ACM, vol. 63, no. 5, p. 82–89, Apr. 2020.
 [13] I. Zliobaite, “On the relation between accuracy and fairness in binary classification,” in The 2nd workshop on Fairness, Accountability, and Transparency in Machine Learning (FATML) at ICML’15, 2015.

[14]
M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” in
Advances in neural information processing systems, 2016, pp. 3315–3323.  [15] N. Kilbertus, M. R. Carulla, G. Parascandolo, M. Hardt, D. Janzing, and B. Schölkopf, “Avoiding discrimination through causal reasoning,” in Advances in Neural Information Processing Systems, 2017, pp. 656–666.
 [16] M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” in Advances in Neural Information Processing Systems, 2017, pp. 4066–4076.

[17]
S. Aghaei, M. J. Azizi, and P. Vayanos, “Learning optimal and fair decision trees for nondiscriminative decisionmaking,” in
Proceedings of the AAAI Conference on Artificial Intelligence
, vol. 33, 2019, pp. 1418–1426.  [18] H. Mouzannar, M. I. Ohannessian, and N. Srebro, “From fair decision making to social equality,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019, pp. 359–368.

[19]
B. Paaßen, A. Bunge, C. Hainke, L. Sindelar, and M. Vogelsang, “Dynamic
fairnessbreaking vicious cycles in automatic decision making,” in
Proceedings of the 27th European Symposium on Artificial Neural Networks (ESANN 2019)
, 2019.  [20] C. Jung, S. Kannan, C. Lee, M. M. Pai, A. Roth, and R. Vohra, “Fair prediction with endogenous behavior,” arXiv preprint arXiv:2002.07147, 2020.
 [21] S. Pironio, M. Navascués, and A. Acín, “Convergent relaxations of polynomial optimization problems with noncommuting variables,” SIAM Journal on Optimization, vol. 20, no. 5, pp. 2157–2180, 2010.
 [22] P. Wittek, “Algorithm 950: Ncpol2sdpa—sparse semidefinite programming relaxations for polynomial optimization problems of noncommuting variables,” ACM Transactions on Mathematical Software (TOMS), vol. 41, no. 3, pp. 1–12, 2015.
 [23] I. Klep, J. Povh, and J. Volcic, “Minimizer extraction in polynomial optimization is robust,” SIAM Journal on Optimization, vol. 28, no. 4, pp. 3177–3207, 2018.
 [24] J. Wang, V. Magron, and J.B. Lasserre, “Tssos: A momentsos hierarchy that exploits term sparsity,” arXiv preprint arXiv:1912.08899, 2019.
 [25] ——, “Chordaltssos: a momentsos hierarchy that exploits term sparsity with chordal extension,” arXiv preprint arXiv:2003.03210, 2020.
 [26] I. Gelfand and M. Neumark, “On the imbedding of normed rings into the ring of operators in Hilbert space,” Rec. Math. [Mat. Sbornik] N.S., vol. 12, no. 2, pp. 197–217, 1943.
 [27] I. E. Segal, “Irreducible representations of operator algebras,” Bulletin of the American Mathematical Society, vol. 53, no. 2, pp. 73–88, 1947.
 [28] S. McCullough, “Factorization of operatorvalued polynomials in several noncommuting variables,” Linear Algebra and its Applications, vol. 326, no. 13, pp. 193–203, 2001.
 [29] J. W. Helton, ““Positive” noncommutative polynomials are sums of squares,” Annals of Mathematics, vol. 156, no. 2, pp. 675–694, 2002.
 [30] Y. Thiery and C. Van Schoubroeck, “Fairness and equality in insurance classification,” The Geneva Papers on Risk and InsuranceIssues and Practice, vol. 31, no. 2, pp. 190–211, 2006.
 [31] OECD, “Personalised pricing in the digital era,” in the joint meeting between the Competition Committee and the Committee on Consumer Policy, 2018.
 [32] R. Dong, E. Miehling, and C. Langbort, “Protecting consumers against personalized pricing: A stopping time approach,” arXiv preprint arXiv:2002.05346, 2020.
 [33] S. Dutta, D. Wei, H. Yueksel, P.Y. Chen, S. Liu, and K. R. Varshney, “An informationtheoretic perspective on the relationship between fairness and accuracy,” The 37th International Conference on Machine Learning (ICML 2020), 2020, arXiv preprint arXiv:1910.07870.
 [34] S. Pironio, M. Navascués, and A. Acin, “Convergent relaxations of polynomial optimization problems with noncommuting variables,” SIAM Journal on Optimization, vol. 20, no. 5, pp. 2157–2180, 2010.
 [35] M. Yamashita, K. Fujisawa, and M. Kojima, “Implementation and evaluation of sdpa 6.0 (semidefinite programming algorithm 6.0),” Optimization Methods and Software, vol. 18, no. 4, pp. 491–505, 2003.
 [36] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” ProPublica, May, vol. 23, p. 2016, 2016.
 [37] S. Burgdorf, I. Klep, and J. Povh, Optimization of polynomials in noncommuting variables. Springer, 2016.
 [38] A. K. Tangirala, Principles of system identification: theory and practice. Crc Press, 2014.
 [39] K.J. Åström and B. Torsten, “Numerical identification of linear dynamic systems from normal operating records,” IFAC Proceedings Volumes, vol. 2, no. 2, pp. 96–111, 1965, 2nd IFAC Symposium on the Theory of SelfAdaptive Control Systems, Teddington, UK, September 1417, 1965.
 [40] O. Anava, E. Hazan, S. Mannor, and O. Shamir, “Online learning for time series prediction,” in COLT 2013  The 26th Annual Conference on Learning Theory, June 1214, 2013, Princeton University, NJ, USA, 2013.
 [41] C. Liu, S. C. H. Hoi, P. Zhao, and J. Sun, “Online arima algorithms for time series prediction,” ser. AAAI’16, 2016.
 [42] M. Kozdoba, J. Marecek, T. Tchrakian, and S. Mannor, “Online learning of linear dynamical systems: Exponential forgetting in kalman filters,” in The ThirtyThird AAAI Conference on Artificial Intelligence (AAAI19), 2019, arXiv preprint arXiv:1809.05870.
 [43] A. Tsiamis and G. Pappas, “Online learning of the kalman filter with logarithmic regret,” arXiv preprint arXiv:2002.05141, 2020.
 [44] T. Katayama, Subspace methods for system identification. Springer Science & Business Media, 2006.
 [45] P. Van Overschee and B. De Moor, Subspace identification for linear systems. Theory, implementation, applications. Incl. 1 disk, 01 1996, vol. xiv.
 [46] A. Tsiamis, N. Matni, and G. J. Pappas, “Sample complexity of Kalman filtering for unknown systems,” arXiv preprint arXiv:1912.12309, 2019.
 [47] A. Tsiamis and G. J. Pappas, “Finite sample analysis of stochastic system identification,” arXiv preprint arXiv:1903.09122, 2019.
 [48] E. Hazan, K. Singh, and C. Zhang, “Learning linear dynamical systems via spectral filtering,” in Advances in Neural Information Processing Systems, 2017, pp. 6702–6712.
 [49] E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang, “Spectral filtering for general linear dynamical systems,” in Advances in Neural Information Processing Systems, 2018, pp. 4634–4643.
 [50] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018.
 [51] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference On Learning Theory, 2018, pp. 439–473.
 [52] M. Simchowitz, R. Boczar, and B. Recht, “Learning linear dynamical systems with semiparametric least squares,” arXiv preprint arXiv:1902.00768, 2019.
 [53] T. Sarkar and A. Rakhlin, “Near optimal finite time identification of arbitrary linear dynamical systems,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 5610–5618.
 [54] V. V. Vazirani, Approximation algorithms. Springer Science & Business Media, 2013.
 [55] S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala, “The price of fair pca: One extra dimension,” in Advances in Neural Information Processing Systems, 2018, pp. 10 976–10 987.
 [56] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” in Conference on fairness, accountability and transparency, 2018, pp. 77–91.
 [57] T. Bolukbasi, K.W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” in Advances in neural information processing systems, 2016, pp. 4349–4357.
 [58] S. SharifiMalvajerdi, M. Kearns, and A. Roth, “Average individual fairness: Algorithms, generalization and experiments,” in Advances in Neural Information Processing Systems, 2019, pp. 8240–8249.
 [59] U. Tantipongpipat, S. Samadi, M. Singh, J. H. Morgenstern, and S. Vempala, “Multicriteria dimensionality reduction with applications to fairness,” in Advances in Neural Information Processing Systems, 2019, pp. 15 135–15 145.
 [60] N. I. Akhiezer and M. Krein, Some questions in the theory of moments. American Mathematical Society, 1962, vol. 2.
 [61] D. Henrion and J.B. Lasserre, “Detecting global optimality and extracting solutions in gloptipoly,” in Positive polynomials in control. Springer, 2005, pp. 293–310.
 [62] J. Dixmier, Les C*algèbres et leurs représentations. Paris, France: GauthierVillars, 1969, English translation: C*algebras (NorthHolland, 1982).
6 Background
In this paper, we would like to consider the case of multiple variants of the LDS and conduct proper learning of the LDS in a way of fairness using the technologies of noncommutative polynomial optimisation. In Section 6.1, we set our work in the context of system identification and control theory. In Section 6.2, we introduce the concept of fairness, which can be used to deal with multiple variants of the LDS. In Section 6.3, we provide a brief overview of noncommutative polynomial optimisation, pioneered by [21] and nicely surveyed by [37], which is our key technical tool.
6.1 Related Work in System Identification and Control
Research within System Identification variously appears in venues associated with Control Theory, Statistics, and Machine learning. We refer to [2] and [38] for excellent overviews of the long history of research in the field, going back at least to [39]. In this section, we focus on pointers to key more recent publications. In improper learning of LDS, a considerable progress has been made in the analysis of predictions for the expectation of the next measurement using autoregressive (AR) processes. In [40], first guarantees were presented for autoregressive movingaverage (ARMA) processes. In [41], these results were extended to a subset of autoregressive integrated moving average (ARIMA) processes. [42] have shown that up to an arbitrarily small error given in advance, AR() will perform as well as anyKalman filter on any bounded sequence. This has been extended by [43] to Kalman filtering with logarithmic regret. Another stream of work within improper learning focuses on subspace methods [44, 45] and spectral methods. [46, 47] presented the presentbest guarantees for traditional subspace methods. Within spectral methods, [48] and [49]
have considered learning LDS with input, employing certain eigenvaluedecay estimates of Hankel matrices in the analyses of an autoregressive process in a dimension increasing over time. We stress that none of these approaches to improper learning are “predictionerror”: They do
not estimate the system matrices.In proper learning of LDS, many stateoftheart approaches consider the leastsquares method, despite complications encountered in unstable systems [50]. [51]
have provided nontrivial guarantees for the ordinary leastsquares (OLS) estimator in the case of stable
and there being no hidden component, i.e., being an identity and . Surprisingly, they have also shown that more unstable linear systems are easier to estimate than less unstable ones, in some sense. [52] extended the results to allow for a certain prefiltering procedure. [53] extended the results to cover stable, marginally stable, and explosive regimes.Our work could be seen as a continuation of the least squares method to processes with hidden components, with guarantees of global convergence. In Computer Science, our work could be seen as an approximation scheme [54], as it allows for error for any .
6.2 Fairness
In machine learning, the training set might have biased representations of its subgroups, even when sampled with equal weight [55]. Algorithms focusing on maximising the overall accuracy might cause different distribution of errors in different subgroups.
In concerns of the uneven distribution of error over subgroups, fairness was introduced to the field of machines learning. According to a clear summary in [12], the definition of fairness can be derived from a statistical notion and an individual notion. The statistical definition of fairness is to request a classifier’s statistic, such as false positive or false negative rates be equalized across the subgroups so that the error caused by the algorithm be proportionately spread across subgroups [12]
. The statistical definition has a natural connection with Principal Component Analysis (PCA). Introduced in
[55], the FairPCA problem aims to minimize the maximum construction loss of different subgroups when looking for a lower dimensional representation. To solve the FairPCA problem, [58] design an oracleefficient algorithm while [59] propose an algorithms based on extremepoint solutions of semidefinite programs. The individual definition is discussed less on account of its requirement of making significant assumptions even through it has strong individual level semantics that one’s risk of being harmed by the error of the classifier are no higher than they are for anyone else [58].We can introduce fairness to learning of LDS when dealing with multiple variants of the LDS. When estimating the next observation, one might be given several trajectories of observations from unknown variants of the LDS. In this case, fairness asks to find a suitable model that treats each LDS equally.
6.3 NonCommutative Polynomial Optimisation
In learning of the LDS, the key technical tool of this paper is noncommutative polynomial optimisation (NCPOP), first introduced by [21]. Here, we provide a brief summary of their results, and refer to [37] for a booklength introduction. NCPOP is an operatorvalued optimisation problem with a standard form in Problem 6.3:
(H,X,ϕ) ⟨ϕ,p(X)ϕ⟩P:p*= q_i(X)≽0, i=1,…,m ⟨ϕ,ϕ⟩= 1,
where is a bounded operator on a Hilbert space . The normalised vector , i.e., is also defined on with inner product equals to . and are polynomials and denotes that the operator is positive semidefinite. Polynomials and of degrees and , respectively, can be written as:
(26) 
where . Following [60], we can define the moments on field or , with a feasible solution of problem (6.3):
(27) 
for all and . Given a degree , the moments whose degrees are less or equal to form a sequence of . With a finite set of moments of degree , we can define a corresponding order moment matrix :
(28) 
for any and a localising matrix :
(29)  
for any , where . The upper bounds of and are lower than the that of moment matrix because is only defined on while .
If is feasible, one can utilize the Sums of Squares theorem of [29] and [28] to derive semidefinite programming (SDP) relaxations. In particular, we can obtain a order SDP relaxation of the noncommutative polynomial optimization problem (6.3) by choosing a degree that satisfies the condition of . The SDP relaxation of order , which we denote , has the form:
y=(y_ω)_ω≤2k ∑_ω≤d p_ω y_ωR_k:p^k= M_k(X)≽0 M_kd_i(q_i X)≽0, i=1,…,m ⟨ϕ,ϕ⟩= 1,
Let us define the quadratic module, following [21]. Let be the set of polynomials determining the constraints. The positivity domain of are tuples of bounded operators on a Hilbert space making all positive semidefinite. The quadratic module is the set of where and are polynomials from the same ring. As in [21], we assume:
Assumption 3 (Archimedean).
Quadratic module of (6.3) is Archimedean, i.e., there exists a real constant such that , where are defined to be .
If the Archimedean assumption is satisfied, [21] have shown that for a finite . We can use the socalled rankloop condition of [21] to detect global optimality. Once detected, it is possible to extract the global optimum from the optimal solution of problem , by Gram decomposition; cf. Theorem 2 in [21]. Simpler procedures for the extraction have been considered, cf. [61], but remain less well understood.
7 Proof of Theorem 2
Putting the elements together, we an prove the Theorem 2:
Proof.
First, we need to show the existence of a sequence of convex optimisation problems, whose objective function approaches the optimum of the noncommutative polynomial optimisation problem. [21] shows that, indeed, there are natural semidefinite programming problems, which satisfy this property. In particular, the existence and convergence of the sequence is shown by Theorem 1 of [21], which requires Assumption 1. Second, we need to show that the extraction of the minimiser from the SDP relaxation of order in the series is possible. There, one utilises the Gelfand–Naimark–Segal (GNS) construction [26, 27], as explained in Section 2.2 of [23]. ∎
8 Details of Experiments with COMPAS recidivism scores
For the experiment with COMPAS recidivism scores, we use the “COMPAS” dataset from [36]. The dataset include defendants’ gender, race, age, charge degree, COMPAS recidivism scores, twoyear recidivism label as well as information on prior incidents. The twoyear recidivism label denotes whether a person got rearrested within two years (label 1) or not (label 0). If the twoyear recidivism label is , there is also information of the time between a person received the COMPAS score and got rearrested, and the recharge degree.
We choose defendants that are with recidivism label 1, AfricanAmerican or Caucasian, within the age range of 2545, male and with prior crime counts less than 2, with charge degree M and recharge degree M1 or M2. The sample size is 119. We plot their COMPAS recidivism scores against the days before they got rearrested, which are shown by the dots in Figure 4. We distinguish the ethnicity of the defendants by warmtone and coldtone colours. Further, defendants with recharge degree M1 are displayed with darker colours than those with recharge degree M2. Note that each colour denotes one subsample.
Every days are regarded as one period. Then, we try to extract trajectories from COMPAS scores of each subsample. For AfricanAmerican defendants with recharge degree M1, we check if anyone reoffend within days. If there is one, its COMPAS score is recorded as the observation of first period of the trajectory of "Black Defendants M1"; if there are more than one, their average are recorded as the observation. Then we check the following periods, up to periods. Also, the same procedure can be applied to other three subsamples. In the end, we get four trajectories of each subsample, which are shown by the curves in Figure 4. With the four trajectories, we can apply the SubgroupFair and the results is shown by the cyan curve.