More than ever before we have access to data sets throughout almost all disciplines of science and engineering. Fueling our economies and shaping our society, data is therefore considered the oil of the 21st century. At the same time, AI algorithms become increasingly powerful to transform large amounts of raw data into valuable information. Consequently, AI development and data availability are deeply intertwined and they have a common characteristic: both are growing exponentially Wu et al. (2021). While this wealth of information is opening the doors for extraordinary opportunities, we cannot ignore the downside of this development: AI research on a large scale has undesirable, negative side effects on economic, social, and environmental sustainability. To illustrate this, we consider the following example. The authors of Strubell et al. (2019)
analyze the energy required for training popular off-the-shelf natural language processing (NLP) models. They have found COemissions up to for training and up to 3 million dollars for cloud compute costs. As a comparison, a person could fly more than 300 times between Amsterdam and New York to emit the same amount of CO Airlines .
One of the reasons for unsustainable development in AI research is unfortunately anchored in what the AI community defines as better
. The current thrive and almost single focus is to present a model that beats the current accuracy benchmark. To obtain state-of-the-art results, the number of model parameters and hyperparameters are increased, as well as the size of the training data. The result is an exponentially growing demand for compute used to train AI modelsSchwartz et al. (2020). This is not only unsustainable from an environmental and economic point of view, but also from a social one: it has a large carbon footprint, it is expensive, and it excludes researchers with fewer resources.
The problems associated with the unsustainable development in AI have led to a growing awareness in the AI community. Several researchers call for redirecting the focus of AI research by implementing efficiency as an additional benchmark to assess algorithmic progress Schwartz et al. (2020); Strubell et al. (2020); Tamburrini (2022). In fact, efficiency has always been the primary criterion to measure algorithmic progress in computer science Cormen et al. (2022); Knuth (1973, 1976)
. Inspired by this, a similar approach can be adapted to AI algorithms. For this purpose, different metrics have been proposed in literature, e.g. estimating the carbon footprint, reporting the energy consumption, or stating the number of floating-point operations (FLOPS)Henderson et al. (2020); Lacoste et al. (2019); Lannelongue et al. (2021). In this context, new vocabulary has been suggested. It distinguishes between AI that solely focuses on accuracy versus AI that considers efficiency and accuracy as equal criteria: Red AI versus Green AI Schwartz et al. (2020).
While metrics pave the way for Green AI, only reporting on efficiency metrics is not enough. More importantly, the AI community needs to develop and adopt methods that have the potential to reduce compute significantly. We would like to contribute to this development by presenting a very promising method: Tensor networks (TNs), also called tensor decompositions. Being a tool from multilinear algebra, TNs have become increasingly popular in the AI community111The number of papers at NeurIPS, AISTATS and ICML with ‘tensor’, ‘Tucker’, ‘CP’, ‘PARAFAC’, ‘block term’, ‘decomposition’ in the title have increased in the last 12 years. In 2010 there was only 1 paper, in 2015 there were 11 and in 2021 there were 28.
. TNs can approximate data in a compressed format while preserving the essence of the information. In this way, they oftentimes lift the curse of dimensionality, while having the potential to achieve competitive accuracy. This makes TNs exactly the type of tool to apply to large-scale and/or high-dimensional problems: they allow for an efficient and thus sustainable way of representing and handling big dataCichocki (2014). Even though the popularity of TNs is increasing in the AI community, they are still not widespread yet. Various promising results regarding accuracy and efficiency, as e.g. He et al. (2018); Izmailov et al. (2018); Richter et al. (2021); Wesel and Batselier (2021), show their potential. With this paper, we would like to give the impulse for AI researchers to adopt TNs for their work.
To that end, we provide an economic, social, and environmental analysis of the possibilities and challenges of Green AI as well as the potential of TNs for Green AI. Understanding the impact of TNs on Green AI is not trivial. To the best of the authors’ knowledge, this is the first paper that highlights TNs from a Green AI perspective. In this paper, we answer the following three research questions. Why should better AI algorithms be defined by both accuracy and efficiency? How can TNs make AI algorithms more efficient? How are TNs an appropriate tool to contribute towards Green AI? To answer the three research questions, the paper is organized as follows: In section 2, we describe how an exponentially growing demand for compute has a negative impact on sustainability and AI progress. In addition, we discuss different criteria suggested in literature for an efficiency metric. In section 3, we introduce commonly used TNs and their compression potential. In section 4, we show how an efficiency gain can be achieved by applying TNs in a kernel machine and deep learning setting. In section 5, we draw conclusions.
2 Context and related literature
AI models are considered as better when they achieve greater accuracy (or similar measures) than previously reported. To incentivize competition and thus accelerate innovation, results are made public on leaderboards based on accuracy metrics. Beating state-of-the-art algorithms is often achieved by at least one of the following three aspects: bigger data sets, a more complex model, which is highly correlated with the number of parameters, or more extensive hyperparameter experiments Schwartz et al. (2020). This results in an exponentially growing demand for compute, requiring an unsustainable amount of hardware, energy, and computational time.
2.1 The negative impact of growing compute on sustainability and AI progress
Sustainability is based on three fundamental, intersecting dimensions: economic, social, and environmental Purvis et al. (2019). An exponentially growing amount of compute in AI compromises all three of them, as explained in the following.
Because large computations can have a large price tag Strubell et al. (2019, 2020), they negatively impact both the economic and social dimensions of sustainability. Concerning the economic dimension, linear gains in accuracy contrast with an exponentially growing amount of compute Schwartz et al. (2020). Thus, diminishing returns contrast with increasing costs associated with e.g. cloud computing and hardware (such as CPU, GPU, and TPU). In the social dimension, large costs can create barriers for researchers with fewer resources to participate in computationally expensive AI development. Beyond the financial aspect, being dependent on external cloud compute providers can be problematic, too. For example, in applications where privacy or security-relevant data is handled, computations may need to be performed on-site. Therefore, state-of-the-art computations with sensitive data may not be possible, if there are not enough financial resources available. Concerning the environmental aspect, using a lot of resources for AI research causes a meaningful carbon footprint. The emissions are attributable to operational and embodied emissions, associated with e.g. cloud computing and hardware, respectively. In case operational emissions are decreased by relying on electricity with a low carbon intensity, embodied emissions make up for the largest share Gupta et al. (2021). In fact, the impact of AI on the environmental dimension of sustainability demands special attention. The Paris Agreement, ratified by 192 countries and the European Union Nations , states that it is a goal to limit global warming to . Reaching the goal entails an immediate, rapid and deep emission reduction across all sectors IPCC (2022). In addition, net-zero emissions need to be achieved around 2050 IPCC (2018). To stay on a pathway, emissions need to be reduced by each year between 2020 and 2030 UNEP (2019). Even though AI can contribute to the mitigation of climate change Huntingford et al. (2019); Rolnick et al. (2022), AI can also have adverse side effects IPCC (2022): The demand for compute impacts CO emissions directly Gupta et al. (2021); Guyon et al. (2017). Improving algorithmic efficiency can have a positive impact on both operational and embodied emissions Gupta et al. (2021).
Beyond sustainability, an exponentially growing demand for compute can cause an additional problem. When the demand for compute cannot be met anymore, it will have a negative impact on progress and innovation in AI. Up until now progress and innovation in AI strongly rely on the increase of available computational resources Thompson et al. (2020). According to forecasts, the number of transistors on a microchip doubling every 2 years (Moore’s Law), will reach its limits Leiserson et al. (2020). Furthermore, its 2 years doubling period has already been overtaken by the need for compute, which is doubling every 3.4 months Amodei and Hernandez with a total increase of times between 2012 and 2018 Amodei and Hernandez . Consequently, the growing need for compute will ultimately limit AI progress and innovation, especially in computationally expensive fields Strubell et al. (2019); Thompson et al. (2020). Another aspect is that exponentially growing demand for compute can compromise reproducibility. In the AI community, there is already a growing awareness of the importance of reproducibility, see e.g. the ML Reproducibility Challenge with Code .
The problems mentioned above can be tackled by reducing compute. Consequently, AI researchers need to redirect their focus toward Green AI and redefine what better means. Redefining better entails mainly two aspects. First, accuracy and efficiency need to be considered being equally important when measuring progress in AI. Second, an analysis of the trade-off between performance and computational resources used need to be included in AI research. AI algorithms with better efficiency will therefore have a positive impact on both sustainability and AI progress Schwartz et al. (2020); Strubell et al. (2019). In addition, reporting on efficiency metrics has other benefits: it will not only raise awareness but also incentivize progress in efficiency Henderson et al. (2020); Tamburrini (2022).
2.2 Efficiency metrics suggested in literature
So far, there has not been much attention on efficiency in major AI venues as most papers report only on an accuracy metric Schwartz et al. (2020). In addition, it is not straightforward to report on efficiency and there is no established way to do so. This is why we will discuss several evaluation criteria for efficiency that are considered in the literature.
An appealing criterion to measure efficiency is CO equivalent (COe) emissions, since they are directly related to climate change. Disadvantages include that they are difficult to measure and are influenced by factors that do not account for algorithmic optimization, e.g. hardware, carbon intensity of the used electricity, and time and location of the compute Schwartz et al. (2020); Strubell et al. (2020). An alternative criterion, which is independent of time and location, is to state the electricity usage in kWh. It can be quantified because GPUs commonly report the amount of consumed electricity. However, it is still hardware dependent. There are several websites and packages available to compute the COe and electricity usage of algorithms Anthony et al. (2020); Henderson et al. (2020); Lacoste et al. (2019); Lannelongue et al. (2021); Lottick et al. (2019); Schmidt et al. (2021). A hardware-independent criterion is the number of floating point operations (FLOPS) that are required for training the model. On the one hand, they are computed analytically and facilitate a fair comparison between algorithms Hernandez and Brown (2020); Schwartz et al. (2020). On the other hand, they do not account e.g. for optimized memory access or memory used by the model Henderson et al. (2020); Schwartz et al. (2020). Next to giving the absolute number of FLOPS, it is possible to additionally report e.g. a FLOPS-based learning curve Hernandez and Brown (2020). There are packages available to compute FLOPS222https://github.com/sovrasov/flops-counter.pytorch, https://github.com/Swall0w/torchstat, https://github.com/Lyken17/pytorch-OpCounter, https://github.com/triscale-innov/GFlops.jl. A criterion that is commonly used in computer science is the big notation Knuth (1976). This notation is used to report the storage complexity and the computational cost of an algorithm. For AI practitioners, the big notation can be impractical because the run time is often strongly influenced by application-specific termination criteria Hernandez and Brown (2020). However, it has benefits in a theoretical analysis: Sections of an algorithm and required storage complexities can be easily compared. In tensor community, using the big notation is common, see e.g. Cichocki et al. (2016) and references therein. The elapsed runtime is easy to measure, but it is highly dependent on the used hardware and other jobs running in parallel Schwartz et al. (2020). The number of parameters (learnable or total) by the model is a common efficiency measure. It is agnostic to hardware and takes the memory needed by the model into account. Nonetheless, different model architectures can lead to a different amount of work for the same amount of parameters Schwartz et al. (2020).
We suggest reporting on multiple efficiency criteria, as well as stating the specifications of the used hardware. Combining multiple criteria can support making a link between an efficiency gain and algorithmic innovation. In addition, we agree with the suggestion by Hernandez and Brown (2020); Strubell et al. (2019) to report a cost-benefit analysis, relating gains or losses in accuracy with gains or losses in efficiency.
3 Tensor networks basics
As mentioned earlier, a performance boost of an AI model can be obtained by an increase in model parameters, hyperparameters, and the training data size. The resulting large-scale or high-dimensional data objects can be handled by TNs. The successful application of TNs to AI has been extensively explored in literature: The bookCichocki et al. (2017) describes topics including tensor regression, tensor-variate Gaussian processes, and support tensor machines. The survey Ji et al. (2019) treats supervised and unsupervised classification, among other things. The overview paper Sidiropoulos et al. (2017) treats e.g. collaborative filtering, mixture and topic modeling, and in Signoretto et al. (2011) a framework for kernel machines with TNs is discussed.
3.1 From tensors to tensor networks
Multidimensional arrays, also called tensors, are a generalization of matrices to higher dimensions. In many applications, instead of tensors, large vectors or matrices arise. By rearranging their entries, however, vectors and matrices can also be transformed into tensors. In this procedure, called tensorizationKhoromskij (2011); Oseledets (2010), the row and column size are first factorized into multiple factors and then reshaped into a tensor. A small tensorization example is illustrated in Figure 1 where a matrix of size is transformed into a 6-dimensional tensor. The left side of Figure 1 shows the rearrangement of the matrix elements into the tensor. In addition, the two x-shapes showcase how the bookkeeping of the entries works. To simplify the depiction of higher-dimensional objects, it is possible to use the diagram notation shown on the right side of Figure 1. In this notation, a matrix or a tensor are depicted as nodes with as many edges sticking out as its number of dimensions. The reason why a tensorized representation is useful is that it enables the use of TNs. In the following subsection, we introduce commonly used TNs, while reviews can e.g. be found in Cichocki et al. (2016); Kolda and Bader (2009).
3.2 Commonly used tensor networks
A tensor can be expressed as a function of simpler tensors that form a TN, also called tensor decomposition. The idea of a TN has its origins in the generalization of the singular value decomposition (SVD) to higher dimensions. In a lot of applications TNs can represent data in a compressed format with a marginal loss of information, because of correlations present in the dataCichocki et al. (2017, 2016).
In literature the most commonly used TNs include the CANDECOMP/PARAFAC (CP) Carroll and Chang (1970); Harshman (1970), the Tucker Tucker (1966), and the tensor train (TT) decomposition Oseledets (2011). Without loss of generality, in this subsection, we will treat a three-dimensional tensor to introduce the TNs mentioned above. The CP decomposition Carroll and Chang (1970); Harshman (1970) of consists of a set of factor matrices , and a weight vector . Elementwise, can be computed from
The scalars are the entries of the three factor matrices , is the -th entry of , denotes the rank of the decomposition. The Tucker decomposition Tucker (1966) of consists of a 3-way tensor , called core tensor, and a set of matrices , . Elementwise, can be computed from
The scalars are the entries of the three factor matrices , is the -th entry of and denote the ranks of the decomposition. The tensor train decomposition Oseledets (2011) of consists of a set of three-way tensors , called TT-cores. Elementwise, can be computed from
where denote the ranks of the TT-cores and by definition . Figure 2 shows a graphical depiction of the CP, Tucker, and TT decomposition for a three-dimensional tensor. Connected edges are indices that are being summed over and the number of free edges corresponds to the dimensionality of the tensor.
3.3 Compression potential of TNs
When a tensor is approximately represented by a TN, then the ranks determine the accuracy of the approximation and the TN is called low-rank. These ranks are then hyperparameters. The low-rank property is very powerful mainly for two reasons. First, the low-rank property can enable a compressed representation of the data with marginal loss of information Cichocki et al. (2017). Second, low-rank TNs have the capability to reduce an exponential complexity in to a linear in . With this property, they can alleviate the curse of dimensionality which is an exponential growth of elements with the number of dimensions . For example, considering the CP decomposition for a given tensor with elements, the number of elements of the rank- decomposition are , thus linear in . When a TN is low-rank, the compression can be significant. In this paper, we call the compression from exponential in to linear in a logarithmic compression.
4 Applications of TNs to AI
To show how TNs can be integrated into AI algorithms, consider the following two common learning problems: unsupervised learning and supervised learning. One example for unsupervised learning is data compression, where a-dimensional data tensor can be approximately represented in a compressed format as a low-rank TN. It can be formulated as
Given a set of input/output pairs and , a supervised learning can be described with minimizing the empirical loss and a regularization term
where is a nonlinear function parameterized by the weights . To apply TNs, is parametrized with low-rank TNs. In this section, we will describe two models of
parametrized by low-rank TNs: a kernel machine and a neural network. For these two examples, we quantify efficiency gains obtained by logarithmic compression.
The fundamental insights taken from these specific examples illustrate the principle of how TNs make AI algorithms more efficient. Beyond that, the generalization of the examples enables to integrate TNs into a variety of AI applications.
4.1 TNs in kernel machines
Kernel machines, such as Gaussian processes (GPs) Rasmussen and Williams (2006)
and support vector machines (SVMs)Hammer and Gersmann (2003), can be universal function approximators. While they have shown equivalent or superior performance compared to neural networks Garriga-Alonso et al. (2018); Lee et al. (2017); Novak et al. (2018), they scale poorly for high-dimensional or large-scale problems.
We show how TNs can be integrated into the supervised learning problem described in (5), considering a kernel machine setting. By making four specific model choices, we present a low-rank TN primal kernel ridge regression problem (T-KRR) based on Wesel and Batselier (2021). Even if we describe a specific problem characterized by our model choices, it is possible to implement the same approach with other model choices. This results in a broad variety of problems where low-rank TNs are applicable.
Given an input/output pair and , a kernel machine is given by
where is a feature map, is a weight vector, denotes the inner product, and is the error. In our example, the first model choice is to represent as a Kronecker product of regressors, computed from a chosen number basis functions. In this context, for TNs the following basis functions have been explored in literature: Polynomial basis function are treated in Batselier and Wong (2017); Blondel et al. (2016); Rendle (2010), pure-power- polynomials in Novikov et al. (2017), pure-power- polynomials in Chen et al. (2017) and B-splines in Karagoz and Batselier (2020). The use of trigonometrical basis functions are described in Stoudenmire and Schwab (2016), and Fourier features in Kargas and Sidiropoulos (2021); Wahls et al. (2014); Wesel and Batselier (2021).
For the -th input vector, is given by
where is the -th entry of a matrix , where the rows are the input vectors . For all samples , (6) can be rewritten as
The second and third model choice is to use the quadratic loss function and a Tikhonov regularization, respectively. This results in a least square problem for kernel ridge regression given by
with as a regularization parameter. Solving (9) explicitly would require constructing and inverting the matrix . This would have a complexity of . Making this problem scalable calls for a more efficient approach. Applying TNs, the least squares problem (9) can be solved with neither constructing nor inverting the matrix explicitly. The key idea is to replace and with TNs. Concerning , a tensorized representation needs to be found. Each row of can be represented as a Kronecker product as shown in (7). Conveniently, a tensorization of corresponds to a series of outer products, denoted by , given by
However, creating alone does not solve the scalability problem, since it still requires a storage complexity of . Instead of constructing explicitly, it is reinterpreted as the TN with as the TN components. The outer product corresponds to a rank-1 connection between the components. Considering all samples, only matrices of size need to be stored. This results in a storage complexity of for , which is significantly smaller than for . The reduction in complexity is a logarithmic compression, because it reduces from exponential in to linear in .
The fourth model choice is to impose a low-rank structure onto by parametrizing it as a TN. To do this, first needs have a tensorized representation given by -dimensional tensor . Second, TN components of a low-rank TN representing need to be found. In this context, different TNs can be used, e.g. CP, Tucker, or TT decomposition, introduced in section 3.2. The minimization problem can now be written as
To find , a variety of optimization solvers can be applied. The alternating linear scheme (ALS), for example, is an iterative scheme to find the TN components in an alternating fashion. The ALS for TT decomposition, e.g., is treated in Batselier et al. (2017); Stoudenmire and Schwab (2016). One iteration of the ALS has a complexity of when using the CP decomposition. In this context, only a few iterations are necessary for convergence as shown in Wesel and Batselier (2021). This is a significant efficiency gain compared to the original . TNs are therefore a promising tool for Green AI.
To illustrate how Green AI can be put into practice, we present exemplifying experiments. All experiments333https://github.com/TUDelft-DeTAIL/Towards-Green-AI-with-TNs were performed with Matlab R2020b on a Dell Inc. Latitude 7410 laptop with 16 GB of RAM and an Intel Core i7-10610U CPU running at 1.80 GHz. In our experiment, we use the T-KRR implementation444https://github.com/fwesel/T-KRR available under MIT-license Wesel and Batselier (2021), which uses deterministic Fourier features and a CP decomposition to solve (11). The experiments are performed on the airline data set (, ). We compare the T-KRR algorithm for to the solution in (9) for , in terms of runtime. We report runtime and number of parameters for an increasing number of basis functions in Table 1
. The runtime is given in terms of a mean and standard deviation over three runs. Computations that failed due to memory errors are denoted not applicable (NA). We consider a subset of the dataand the whole data set . In both cases, we use of the data for training. As reported in Table 1, solving (9) is not possible for , while the T-KRR can handle all cases. The number of parameters and the runtime grow exponentially for (9), while both grow linearly for T-KRR. For and , the T-KRR algorithm requires electricity and causes COe emissions, corresponding to driving a car for . The numbers were computed with Lannelongue et al. (2021).
To discuss the trade-off between accuracy and efficiency from a Green AI perspective, we consider the accuracy, reported hardware and runtime of three state-of-the-art methods for regression with the full airline data set: Hilbert-GP Solin and Särkkä (2020), stochastical variational inference for GPs (SVIGP) Hensman et al. (2013), and T-KRR. The Hilbert-GP attains a predictive mean squared error (MSE) of and on a MacBook Pro. The SVIGP achieves an MSE of in on a cluster. The T-KRR attains an MSE of in on a Dell laptop. Having the two highest accuracies, SVIGP and T-KRR enable all possible interactions between the selected basis functions by using multiplicative kernels. In contrast, Hilbert-GP, using an additive kernel structure, achieves the best efficiency. The additive kernel structure limits the interaction between the basis functions at a cost of accuracy. Comparing both models with multiplicative kernels, T-KRR is more efficient than SVIGP. The efficiency gain compared to SVIGP was achieved by exploiting the logarithmic compression potential of TNs. In addition, T-KRR attains the highest accuracy for the full airline data set among all three methods.
4.2 TNs in deep learning
Deep Learning (DL) has achieved state-of-the-art performance in many fields, such as computer vision (CV)He et al. (2016); Krizhevsky et al. (2012) and NLP Brown et al. (2020); Devlin et al. (2019). The success, however, comes at a cost: models in DL are large and require a lot of compute Strubell et al. (2019, 2020). Neural networks (NNs) have been made more efficient with TNs for a variety of application fields, including CV Jaderberg et al. (2014); Kim et al. (2016); Lebedev et al. (2015) and NLP Hrinchuk et al. (2020); Ma et al. (2019). This section provides an example of how a TN can make a NN layer more efficient and in general how the set of efficiency metrics proposed for Green AI are reported for TNs in DL.
We consider solving the learning problem of (5) where is parametrized by a NN. TNs can be introduced into NN layer(s) and make NNs more efficient. To this end, we work out the details for a fully connected (FC) layer compressed with a TT decomposition for supervised learning, following the methodology of Novikov et al. (2015). The main operation in a FC layer is a matrix vector product between the weights and input . To simplify notation, we omit the bias term. Before a TN can be applied, the weight matrix and input vector need to be tensorized into and , respectively. Without loss of generality and for notational simplicity and we set and for . The key idea is to replace with a TT decomposition by applying the -dimensional form of (3). Without loss of generality, the ranks are set equal to . The entries of the matrix vector product in tensor format can then be computed as
Introducing TTs in FC layers reduces the computational complexity of the forward pass from to . The learning complexity of one backward pass is reduced from to . The memory complexity is reduced from to for the forward pass and to for the backward pass.
Generally, TNs can be introduced into NNs in two methodologies: Compressing a pretained NN or training a NN in its compressed form. In this context, a NN that is (partly) represented by TNs is referred to as factorized NN. The first methodology takes a pretrained network, which is commonly available Abadi et al. (2015); Paszke et al. (2019). It decomposes layers using (4) to minimize the approximation error on the pretrained weights. Then it is common to fine-tune the factorized network to recover performance lost in the decomposition step Denton et al. (2014); Kim et al. (2016); Lebedev et al. (2015). The second methodology randomly initializes a factorized network and trains it in factorized form, which can be used to make training more efficient due to the reduced number of parameters Ye et al. (2018). During fine-tuning in the first methodology or training in the second methodology back-propagation is based on (5), where the weights are in TN form. The example of the TT layer works for both methodologies. Besides TT Cheng et al. (2021); Efthymiou et al. (2019); Garipov et al. (2016); Novikov et al. (2015); Tjandra et al. (2017); Wu et al. (2020); Yang et al. (2017); Yu et al. (2019), other decompositions, e.g. Tucker Calvi et al. (2020); Chu and Lee (2021); Kim et al. (2016) and CP Astrid and Lee (2017); Chen et al. (2020); Denton et al. (2014); Jaderberg et al. (2014); Kossaifi et al. (2020); Lebedev et al. (2015); Mamalet and Garcia (2012); Rigamonti et al. (2013) have been proposed for layers such as FC, convolutional, recurrent, and attention layers.
Finally, we illustrate how TNs in DL contribute towards Green AI by reporting efficiency metrics, as shown for CP in convolutional layers Kim et al. (2016) and for recurrent layers with another TN called Block-Term decomposition Ye et al. (2018)
. These works compare their TN-based NN to a baseline not involving TNs. For Green AI, both total training time and inference efficiency are relevant. Making training more efficient can be achieved by computing one epoch faster and/or needing fewer epochs. Runtime and electricity usage are often reported for parsing one observation through the model. InKim et al. (2016) a reduction in runtime and electricity usage by factors of up to 2.33 and 4.26, respectively, is reported. This comes at the cost of a drop in top-5 accuracy of up to 1.70%. Reducing the number of epochs results in faster training. The same validation accuracy as the baseline model is achieved with 14 times fewer epochs in Ye et al. (2018). In addition, Kim et al. (2016) reports a reduction in the number of parameters of up to 7.40 times and Ye et al. (2018) shows times fewer parameters over a 15.6% increase in accuracy. Reducing the number of parameters also reduces the number of FLOPS. In Kim et al. (2016) a FLOPS reduction of up to 4.93 times is shown. In summary, these examples show the potential for efficiency gains by parametrizing NNs with TNs.
In this paper, we addressed three research questions. First, we answered the question of why both accuracy and efficiency should define a better AI algorithm. Taking both accuracy and efficiency into account has the potential to incentivize the development of efficient algorithms and reduce compute. This will in turn have a positive impact on both sustainability and AI progress. Second, we answered the question of why TNs can make AI algorithms more efficient. We mathematically showed efficiency gains by means of a kernel machine and deep learning example. We demonstrated that the key to efficiency of TNs is their logarithmic compression potential. Third, we answered the question of why TNs are an appropriate tool to contribute toward Green AI. We demonstrated with an exemplifying experiment that TNs improve efficiency and we reported on multiple efficiency metrics in that context. In addition, we discussed the trade-off between efficiency and accuracy reported for examples in kernel methods and deep learning. Based on the examples, we concluded that TNs can achieve competitive accuracy.
As an overall conclusion, needing less hardware and computational time, TNs have the potential to positively impact all three dimensions of sustainability and AI progress. First, requiring less hardware and computational time, TNs have the potential to reduce costs. This, in turn, makes TNs attractive for industrial applications and more economically sustainable. Second, needing less hardware and computational time with TNs reduces the dependency on external computing services. In that way, the barriers to participation in AI research are lower, thus TNs have the potential to enhance inclusivity. In addition, by being able to handle sensitive data on-site, TNs can increase privacy protection. Third, requiring both less hardware and less computational time, TNs have the potential to reduce embodied and operational emissions of AI algorithms. In that way, a broader implementation of TNs can have a huge impact on the environmental dimension of sustainability. Beyond that, algorithmic innovation will have to rely on improving efficiency, when Moore’s Law reaches its limit. In that context, TNs can play an important role in AI progress.
The potential of TNs for Green AI can be limited by several things: First, an increased hyperparameter training to find appropriate ranks can reduce the efficiency gains achieved with TNs. Second, TNs rely on the assumption that there are correlations present in data. Third, the positive impact on environmental sustainability can be reduced because a higher efficiency can lead to an increased usage of AI. This is known as the rebound effect. The rebound effect is not inherently negative, because an increased efficiency can contribute to more inclusivity Sorrell et al. (2018). Lower barriers to participation in AI can again include researchers, who use AI for climate change mitigation.
Looking forward, more research on TNs is required. Among them are theoretical aspects, such as the further integration of TNs into probabilistic frameworks and how the parametrization of NNs affects the learning convergence rate. More practical aspects include wider adoption of open-source libraries and packages for TNs as well as making TNs part of fundamental AI education.
Our vision is that TNs will have a broad application in AI in diverse sectors and contribute to more sustainable AI research. We encourage researchers to integrate TNs in their research, address the open questions, and by this means push AI progress forward. This paper is the first milestone to elaborate on TNs from a Green AI perspective. We showed how TNs can make AI algorithms better.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. External Links: Cited by: §4.2.
-  CO2 emission and compensation price per destination. Note: https://www.klm.com/travel/nl_en/images/KLMdestinationsCO2data_jan2018_tcm542-658529.pdfAccessed: 2022-02-03 Cited by: §1.
-  AI and Compute. Note: https://openai.com/blog/ai-and-compute/Accessed: Feb. 2, 2022 Cited by: §2.1.
-  (2020) Carbontracker: tracking and predicting the carbon footprint of training deep learning models. Note: ICML Workshop on Challenges in Deploying and monitoring Machine Learning SystemsarXiv:2007.03051 Cited by: §2.2.
CP-decomposition with tensor power method for convolutional neural networks compression. In IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 115–118. Note: ISSN: 2375-9356 External Links: Cited by: §4.2.
-  (2017) Tensor network alternating linear scheme for MIMO volterra system identification. Automatica 84, pp. 26–35. Cited by: §4.1.
-  (2017) A constructive arbitrary-degree Kronecker product decomposition of tensors. Numerical Linear Algebra with Applications 24 (5), pp. e2097. Cited by: §4.1.
-  (2016) Higher-order factorization machines. Advances in Neural Information Processing Systems 29. Cited by: §4.1.
-  (2020) Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33, pp. 1877–1901. External Links: Cited by: §4.2.
-  (2020-01-06) Compression and interpretability of deep neural networks via tucker tensor layer: from first principles to tensor valued back-propagation. arXiv:1903.06133 [cs, eess]. External Links: Cited by: §4.2.
-  (1970) Analysis of individual differences in multidimensional scaling via an n-way generalization of "Eckart-Young" decomposition. Psychometrika 35 (3), pp. 283–319. Cited by: §3.2.
-  (2020-08-02) Tensor low-rank reconstruction for semantic segmentation. arXiv:2008.00490 [cs]. External Links: Cited by: §4.2.
Parallelized tensor train learning of polynomial classifiers. IEEE transactions on neural networks and learning systems 29 (10), pp. 4621–4632. Cited by: §4.1.
S3-net: A fast and lightweight video scene understanding network by single-shot segmentation. In IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3329–3337. External Links: Cited by: §4.2.
-  (2021-12-07) Low-rank tensor decomposition for compression of convolutional neural networks using funnel regularization. arXiv:2112.03690 [cs]. External Links: Cited by: §4.2.
-  (2016) Tensor networks for dimensionality reduction and large-scale optimization: Part 1 low-rank tensor decompositions. Foundations and Trends® in Machine Learning 9 (4-5), pp. 249–429. Cited by: §2.2, §3.1, §3.2.
-  (2015) Tensor decompositions for signal processing applications: from two-way to multiway component analysis. IEEE signal processing magazine 32 (2), pp. 145–163. Cited by: Figure 1.
-  (2017) Tensor networks for dimensionality reduction and large-scale optimizations: Part 2 applications and future perspectives. Learning 9 (6), pp. 431–673. Cited by: §3.2, §3.3, §3.
-  (2014) Era of big data processing: A new approach via tensor networks and tensor decompositions. arXiv preprint arXiv:1403.2048. Cited by: §1.
-  (2022) Introduction to algorithms. MIT press. Cited by: §1.
-  (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, pp. 9. Cited by: §4.2.
-  (2019-05-24) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs]. Note: version: 2 External Links: Cited by: §4.2.
-  (2019-06-07) TensorNetwork for machine learning. arXiv:1906.06329 [cond-mat, physics:physics, stat]. External Links: Cited by: §4.2.
-  (2016-11-10) Ultimate tensorization: Compressing convolutional and FC layers alike. arXiv:1611.03214 [cs]. External Links: Cited by: §4.2.
-  (2018) Deep convolutional networks as shallow Gaussian processes. In International Conference on Learning Representations, Cited by: §4.1.
-  (2021) Chasing carbon: The elusive environmental footprint of computing. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 854–867. Cited by: §2.1.
-  (2017) How much energy can green HPC cloud users save?. In 2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), pp. 416–420. Cited by: §2.1.
-  (2003) A note on the universal approximation capability of support vector machines. neural processing letters 17 (1), pp. 43–53. Cited by: §4.1.
-  (1970) Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics 16 (10), pp. 1– 84. Cited by: §3.2.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. External Links: Cited by: §4.2.
-  (2018) Boosted sparse and low-rank tensor regression. Advances in Neural Information Processing Systems 31. Cited by: §1.
-  (2020) Towards the systematic reporting of the energy and carbon footprints of machine learning. Journal of Machine Learning Research 21 (248), pp. 1–43. Cited by: §1, §2.1, §2.2.
-  (2013) Gaussian processes for big data. arXiv preprint arXiv:1309.6835. Cited by: §4.1.
-  (2020) Measuring the algorithmic efficiency of neural networks. arXiv preprint arXiv:2005.04305. Cited by: §2.2, §2.2.
-  (2020-11) Tensorized embedding layers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4847–4860. External Links: Cited by: §4.2.
Machine learning and artificial intelligence to aid climate change research and preparedness. Environmental Research Letters 14 (12), pp. 124007. Cited by: §2.1.
-  (2018) Global warming of . An IPCC special report on the impacts of global warming of 1.5°c above pre-industrial levels and related global greenhouse gas emission pathways, in the context of strengthening the global response to the threat of climate change, sustainable development, and efforts to eradicate poverty. World Meteorological Organization. Cited by: §2.1.
-  (2022) Climate change 2022: Mitigation of climate change. contribution of working group III to the sixth assessment report of the intergovernmental panel on climate change. Cambridge University Press. Cited by: §2.1.
-  (2018) Scalable Gaussian processes with billions of inducing inputs via tensor train decomposition. In International Conference on Artificial Intelligence and Statistics, pp. 726–735. Cited by: §1.
-  (2014-05-15) Speeding up convolutional neural networks with low rank expansions. arXiv:1405.3866. External Links: Cited by: §4.2, §4.2.
-  (2019) A survey on tensor techniques and applications in machine learning. IEEE Access 7 (), pp. 162950–162990. External Links: Cited by: §3.
-  (2020) Nonlinear system identification with regularized tensor network b-splines. Automatica 122, pp. 109300. Cited by: §4.1.
-  (2021) Supervised learning and canonical decomposition of multivariate functions. IEEE Transactions on Signal Processing 69, pp. 1097–1107. Cited by: §4.1.
-  (2011) O (dlog n)-quantics approximation of nd tensors in high-dimensional numerical modeling. Constructive Approximation 34 (2), pp. 257–280. Cited by: §3.1.
-  (2016-02-24) Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, External Links: Cited by: §4.2, §4.2, §4.2.
-  (1973) Fundamental algorithms. External Links: Cited by: §1.
-  (1976) Big omicron and big omega and big theta. ACM Sigact News 8 (2), pp. 18–24. Cited by: §1, §2.2.
-  (2009) Tensor decompositions and applications. SIAM review 51 (3), pp. 455–500. Cited by: §3.1.
-  (2020) Factorized higher-order CNNs with an application to spatio-temporal emotion estimation. In Computer Vision and Pattern Recognition, pp. 6060–6069. External Links: Cited by: §4.2.
-  (2012) ImageNet classification with deep convolutional neural networks. In Neural Information Processing Systems, External Links: Cited by: §4.2.
-  (2019) Quantifying the carbon emissions of machine learning. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019. Cited by: §1, §2.2.
-  (2021) Green algorithms: Quantifying the carbon footprint of computation. Advanced Science. Cited by: §1, §2.2, §4.1.
-  (2015-04-24) Speeding-up convolutional neural networks using fine-tuned CP-decomposition. In International Conference Learning Representations, External Links: Cited by: §4.2, §4.2.
-  (2017) Deep neural networks as Gaussian processes. arXiv preprint arXiv:1711.00165. Cited by: §4.1.
-  (2020) There’s plenty of room at the top: what will drive computer performance after Moore’s law?. Science 368 (6495). Cited by: §2.1.
-  (2019) Energy usage reports: Environmental awareness as part of algorithmic accountability. Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019. Note: arXiv:1911.08354 Cited by: §2.2.
-  (2019) A tensorized transformer for language modeling. In Advances in Neural Information Processing Systems, Vol. 32. External Links: Cited by: §4.2.
-  (2012) Simplifying ConvNets for fast learning. In Artificial Neural Networks and Machine Learning – ICANN 2012, pp. 58–65. External Links: Cited by: §4.2.
-  The Paris Agreement. Note: https://www.un.org/en/climatechange/paris-agreementAccessed: May 17, 2022 Cited by: §2.1.
-  (2018) Bayesian deep convolutional networks with many channels are Gaussian processes. In International Conference on Learning Representations, Cited by: §4.1.
-  (2015) Tensorizing neural networks. In Neural Information Processing Systems, Cited by: §4.2, §4.2.
-  (2017-12-08) Exponential machines. In ICLR, External Links: Cited by: §4.1.
-  (2010) Approximation of 2d x 2d Matrices using Tensor Decomposition. SIAM Journal on Matrix Analysis and Applications 31 (4), pp. 2130–2145. Cited by: §3.1.
-  (2011-01) Tensor-train decomposition. SIAM J. Sci. Comput. 33 (5), pp. 2295–2317. External Links: Cited by: §3.2.
-  (2019) PyTorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), External Links: Cited by: §4.2.
-  (2019) Three pillars of sustainability: In search of conceptual origins. Sustainability science 14 (3), pp. 681–695. Cited by: §2.1.
-  (2006) Gaussian Processes for Machine Learning. The MIT Press. Cited by: §4.1.
-  (2010) Factorization machines. In 2010 IEEE International conference on data mining, pp. 995–1000. Cited by: §4.1.
-  (2021) Solving high-dimensional parabolic PDEs using the tensor train format. In International Conference on Machine Learning, pp. 8998–9009. Cited by: §1.
-  (2013) Learning separable filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2754–2761. External Links: Cited by: §4.2.
-  (2022) Tackling climate change with machine learning. ACM Computing Surveys (CSUR) 55 (2), pp. 1–96. Cited by: §2.1.
-  (2021) CodeCarbon: Estimate and Track Carbon Emissions from Machine Learning Computing. Zenodo. Note: https://github.com/mlco2/codecarbon External Links: Cited by: §2.2.
-  (2020) Green AI. Communications of the ACM 63 (12), pp. 54–63. Cited by: §1, §1, §2.1, §2.1, §2.2, §2.2, §2.
-  (2017) Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing 65 (13), pp. 3551–3582. External Links: Cited by: §3.
-  (2011) A kernel-based framework to tensorial data analysis. In International Conference on Artificial Neural Networks, Vol. 24, pp. 861–874. External Links: Cited by: §3.
-  (2020) Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing 30 (2), pp. 419–446. Cited by: §4.1.
-  (2018) Energy sufficiency and rebound effects. Prepared for ECEEE’S Energy Sufficiency Project. Cited by: §5.
-  (2016) Supervised learning with quantum-inspired tensor networks. arXiv:1605.05775. Cited by: §4.1, §4.1.
-  (2019) Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243. Cited by: §1, §2.1, §2.1, §2.1, §2.2, §4.2.
-  (2020) Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 13693–13696. Cited by: §1, §2.1, §2.2, §4.2.
-  (2022) The AI carbon footprint and responsibilities of AI scientists. Philosophies 7 (1), pp. 4. Cited by: §1, §2.1.
-  (2020) The computational limits of deep learning. arXiv preprint arXiv:2007.05558. Cited by: §2.1.
Compressing recurrent neural network with tensor train. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 4451–4458. Note: ISSN: 2161-4407 External Links: Cited by: §4.2.
-  (1966) Some Mathematical Notes on Three-Mode Factor Analysis. Psychometrika 31 (3), pp. 279–311. External Links: Cited by: §3.2.
-  (2019) Emissions gap report 2019. UN Environment Programme. Cited by: §2.1.
-  (2014) Learning multidimensional fourier series with tensor trains. In 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 394–398. Cited by: §4.1.
-  (2021) Large-scale learning with fourier features and tensor decompositions. Advances in Neural Information Processing Systems 34. Cited by: §1, §4.1, §4.1, §4.1, §4.1, Table 1.
-  ML Reproducibility Challenge 2021. Note: https://paperswithcode.com/rc2021Accessed: May 18, 2022 Cited by: §2.1.
-  (2020-12) Hybrid tensor decomposition in neural network compression. Neural Networks 132, pp. 309–320. External Links: Cited by: §4.2.
-  (2021) Sustainable AI: environmental implications, challenges and opportunities. arXiv:2111.00364. Cited by: §1.
-  (2017-07-06) Tensor-train recurrent neural networks for video classification. arXiv:1707.01786 [cs]. External Links: Cited by: §4.2.
-  (2018) Learning compact recurrent neural networks with block-term tensor decomposition. In Computer Vision and Pattern Recognition, pp. 10. Cited by: §4.2, §4.2.
-  (2019-08-23) Long-term forecasting using higher order tensor RNNs. JMLR. External Links: Cited by: §4.2.