and machine learning[3, 4]. This prblem is often referred to as matrix completion (MC). Let be a matrix with a limited number of observed entries defined by a set of indices such that if has been observed. Then, recovering from the observed entries can be formulated as an optimization problem of the form:
with denoting the complete matrix, an operator that indexes the entries defined in , and the Frobenius norm, that is, .
Several studies have focused on the problem of reconstructing from , most of them assuming that is a low-rank matrix. As the rank minimization problem is intractable, nuclear norm minimization is often used as a convex relaxation [5, 6]. However, the major drawback of nuclear norm minimization algorithms is their high computational cost. Less computationally demanding matrix factorization methods [7, 8] approximate the unknown matrix by a product of two factors , with , and have been employed in large-scale MC problems such as recommender systems .
Recently, matrix completion has been addressed by several deep-network-based models [3, 4, 9, 10, 11] that yield state-of-the-art results in a variety of benchmark datasets. Having the ability to provide powerful data representations, deep neural networks have achieved great successes in many problems, from acoustic modeling  and compressed sensing  to image classification  and social media analysis . Among deep-network-based MC models, autoencoder-based methods have received a lot of attention due to their superior performance and a direct connection to the matrix factorization principle [16, 17, 18, 19, 20]. Despite their remarkable performance, these models suffer from overfitting, especially, when dealing with matrices that are highly sparse. To overcome this problem, more recent works either employ data-independent regularization techniques such as weight decay and dropout [16, 17, 18, 20, 21], or incorporate side information to mitigate the lack of available information [17, 18, 19]. The efficiency of the former approach is limited under high sparsity settings [16, 17, 20], while the latter is not directly applicable when side information is unavailable or difficult to collect.
In this paper, we focus on matrix completion without side information under settings with highly scarce observations, and propose a data-dependent regularization technique to mitigate the overfitting of autoencoder-based MC models. While data-independent regularization approaches focus on training a model such that it is difficult to fit to random error or noise, data-dependent techniques rely on the idea that the data of interest lie close to a manifold and learn attributes that are present in the data [22, 23]. In particular, we combine row-based and column-based autoencoders into a hybrid model and constrain the latent representations produced by the two models by a manifold learning objective as in . This constraint can be considered as a data-dependent regularization. The resulting model follows the multi-task learning paradigm [24, 25, 26], where the main task is the reconstruction and the auxiliary task is the manifold learning. Experimental results on various real datasets show that our model effectively mitigates overfitting and consistently improves the performance of autoencoder-based models. The proposed approach is complementary to data-independent regularization, and the two techniques can be combined in an efficient way .
Ii-a Autoencoder-based Matrix Completion
The first autoencoder-based MC model, coined AutoRec , comes in two versions, namely, the row-based and the column-based AutoRec. For brevity, we only focus on the formulation of the row-based AutoRec; the column-based model can be defined in a similar way. The row-based AutoRec model operates on a row of a partially observed matrix, projects it into a low-dimensional latent (hidden) space and then reconstructs it in the outer space. Let be an incomplete matrix with rows , , and columns ,
. In the encoder side, the model takes as input a partially observed row vectorand transforms it to a latent representation vector through a series of hidden neural network layers. In the decoder side, a dense reconstruction of is generated from through another set of hidden layers to predict the missing values. Denote by and the non-linear functions corresponding to the encoder and the decoder, respectively, with , the corresponding vectors containing the free parameters of the model. Then, the intermediate representation of is given by , while the dense reconstruction is obtained as . The objective function used to train the row-based AutoRec involves a reconstruction loss of the form:
and a regularization term ; typically, regularization is applied to the model parameters. In the above equation, denotes the -th element of the -th row and the cardinality of . Figure 1 illustrates the general architecture of a row-based AutoRec.
Ii-B Multi-task Learning
When training a model designed for a specific task using information extracted by other tasks that are related to the main task, we refer to multi-task learning [27, 24, 25]. The use of knowledge gained from the solution of one or more auxiliary tasks is an inductive transfer mechanism, which can improve a model by introducing inductive bias. The inductive bias provided by the auxiliary tasks causes the model to prefer hypotheses that explain more than one task. This approach often leads to solutions that generalize better for the main task . Multi-task learning can be considered, therefore, as a regularization technique that reduces the risk of overfitting. When designing a multi-task learning model, we seek for auxiliary tasks that are similar to the main task. Different approaches for considering two tasks as similar have been reported in the literature [24, 28, 29].
Multi-task learning in deep neural networks typically involves learning tasks in parallel while using a shared representation. A common approach when designing a multi-task deep network is to share hidden layers between all tasks and separate task-specific outputs. The idea is that what is learned for each task can help other tasks be learned better .
Iii Proposed Approach
Considering that overfitting is a major drawback of existing AutoRec models, the MC model proposed in this paper aims to address this problem by introducing a regularization strategy that follows the multi-task learning principle. The proposed model is a two-brach autoencoder performing one main task and one auxiliary task. The main task outputs row and column predictions of the unknown matrix. The auxiliary task uses the latent representations of the main task to predict the value of a single matrix entry. Concretely, the additional output that corresponds to the auxiliary task imposes a similarity constraint on the hidden representations produced by the row and column autoencoders. The proposed constraint acts as a data-dependent regularization to the autoencoder models.
Iii-a Model Architecture
The architecture of the proposed model is illustrated in Fig. 2. The model consists of two branches. The row branch takes the -th row vector as input, predicts the missing values and produces the dense reconstruction as output. Assuming a deep autoencoder with encoding and decoding layers, the hidden representation of the -th row is obtained according to:
with , while for the decoding part it holds
with . In the above equations, , are the weights used in the -th encoding and the -th decoding layer, respectively, , the corresponding biases, and
the activation function.
Similarly, the column branch outputs the dense reconstruction from the input incomplete column vector . Without any additional constraint on these two branches, this model can be seen as an ensemble of two AutoRec models, working independently.
To enforce that the two branches work together, we propose to use the corresponding latent representations of rows and columns to perform a related task, namely, to realize a matrix factorization model. The basic assumption of matrix factorization models is that there exists an unknown representation of rows and columns in a latent space such that the entry of a matrix can be modelled as the inner product of the latent representations of the -th row and the -th column in that space [7, 8]. This was the dominant idea of the deep matrix factorization model presented in 
. Following this approach, we enforce the hidden representations of the row and column autoencoders to be close under the cosine similarity if the corresponding matrix entry is of high value. Specifically, we define a loss term associated to theobserved entry as follows:
where , is the cosine similarity function, and , the hidden representations of the row and column autoencoders, respectively. We employ the cosine similarity rather than the dot product as a similarity metric, as we empirically found that it enables learning a latent space of much higher dimensions. Since the cosine similarity between two vectors lies in , assuming that , all entries in the original matrix are scaled during training according to with .
Employing (5) to train the proposed model is equivalent to applying a manifold learning objective on the intermediate outputs of the row and column autoencoders. The proposed model can then be thought of as performing two tasks simultaneously, that is, learning to reconstruct and manifold learning. The latter is the auxiliary task, playing the role of improving the main reconstruction task.
Iii-B Objective Function
While in single-task learning we optimize one loss function, in multi-task learning, the objective function is a combination of loss terms. In the proposed model, the loss associated to each of the known entries inconsists of three terms, namely, the reconstruction loss for the row autoencoder , the reconstruction loss for the column autoencoder , and the representation loss defined in (5), where denotes the -th element of the -th row, and the -th element of the -th column. Averaging over all training samples in , we obtain the objective function:
with , , appropriate weights.
Iii-C Comparison with Existing Work
is based on a neural network component that incorporates side information, coined additional stacked denoising autoencoder (aSDAE). Specifically, aSDAE is a two-branch neural network that takes a pair of (corrupted) inputs, namely, a row (column) vector and the side information, and produces a pair of reconstructed outputs. Employing two branches that realize a row and a column aSDAE, the model performs matrix factorization as a main task, using the latent representations provided by the row and column aSDAEs. Concretely, the objective of this work concerns deep matrix factorization with side information. Different from, our work focuses on settings in which the available data is highly sparse and side information is unavailable. Our main objective is to compensate for the scarcity of data by providing an efficient regularization technique for autoencoder MC models that can address overfitting. In our setting, the row and column reconstruction is the main task and the matrix factorization is an auxiliary task. Our approach is simpler with fewer branches compared to 
and fewer hyperparameters to be tuned.
Other regularization methods aiming to improve the generalizability of AutoRec have been reported in [17, 18, 20]. The model presented in [17, 18] uses a denoising autoencoder and employs side information to augment the training data and help the model learn better. The authors of  extend AutoRec with a very deep architecture and heavily regularize its parameters by employing dropout regularizer with very high dropping ratio, while proposing a dense re-feeding technique to train the model. Nevertheless, the performance of these models is reduced in case of high scarcity in the training data or lack of side information. Similar to Dropout , denoising autoencoders average the realizations of a given ensemble and try to make the model difficult to fit random error. On the other hand, our model learns attributes that are present in the data rather than preventing the model learning non-attributes. It should be noted that existing techniques are complementary to the proposed approach and their combination could lead to further performance improvement.
Iv Experimental Results
We carry out experiments involving two benchmark datasets, namely, the MovieLens100K and the MovieLens1M , containing approximately and one million observations, respectively. The datasets contain users’ movie ratings summarized in two matrices where rows correspond to users and columns to movies. Only a small fraction of ratings is observed in each dataset. We randomly split these ratings into three sets; are used for training, for validation and for evaluation. We evaluate the performance of the proposed model in terms of regularization quality and reconstruction accuracy. On each dataset, we run the models on five different random splits of the data and report the average RMSE and MAE values, calculated as follows: , and , with the set of indices corresponding to entries available for evaluation and its cardinality.
Iv-a Hyperparameter Settings
Following the original AutoRec model , we employ only one hidden layer with units for each branch of our model, while the numbers of units in the input and output layers are set according to the sizes of the matrices. We train our model using the Adam optimizer , with mini-batches of size , and initial learning rate equal to decaying by a factor
every 30 epochs. The number of training epochs is set to. The regularization weight is set to for the MovieLens100K, and for MovieLens1M dataset.
We search for the best weights for the loss terms in (III-B) using a separate random split of the two datasets. As the roles of row and column branches in our model are equivalent, we keep and fixed, equal to , and only vary . The results for different values of on the two datasets are shown in Table I. For the MovieLens100K dataset, higher reconstruction accuracy is delivered with , while for the MovieLens1M dataset, is more effective. The obtained values of confirm our conjecture that the role of regularization is critical when the number of available samples is small, that is, when we train our model on the MovieLens100K dataset.
Iv-B Regularization Performance
As the proposed approach can be seen as a data-dependent regularizer, we carry out experiments on the MovieLens100K dataset to evaluate its regularization performance. Figure 3 illustrates the training and validation reconstruction losses of the proposed model while training proceeds. We train the model with and without regularization, for and . When , the model reduces to a row-based and a column-based Autorec working independently, and the proposed regularizer is not applied.
As can be seen in Fig. 2(a), without any regularization, both models heavily overfit; however, when the proposed approach is applied (), the generalization error (i.e., the gap between the training and the validation loss) becomes lower. The generalization errors decrease further when the regularization is employed (Fig. 2(b)). Combining both the regularization and the proposed approach, we obtain the best results in terms of generalization error. This confirms that different regularization techniques can be complementary to each other.
We note that for a highly tuned problem like matrix completion, a small improvement in the validation loss that improves the reconstruction accuracy is significant (see also Sec. IV-C).
Iv-C Reconstruction accuracy
We carry out experiments to compare the proposed model to other neural-network-based MC models in terms of reconstruction quality. In this experiment, we use the tuned values of (see Sec. IV-A) while all other hyperparameter settings are the same. Table II presents a comparison between the proposed model against the AutoRec model  and the matrix factorization model proposed in . We do not include a comparison with the more recent autoencoder-based models [17, 18], as they do not improve over AutoRec when the training data is small, which is the focus of this work.
|Nguyen et al. |
|AutoRec  (row-based)|
|AutoRec  (column-based)|
As can be seen, the proposed model consistently improves over the corresponding Autorec models (row-based and column-based). It should be noted on MovieLens datasets, column-based Autorec usually performs better than row-based Autorec  as the number of ratings per item is usually higher than per user. The improvements are more significant on the MovieLens100K dataset, which has far fewer training data than the MovieLens1M dataset. On the MovieLens100K dataset, the column branch of the proposed model outperforms  which generalizes much better than the AutoRec model. On the MovieLens1M dataset, the column branch of our model delivers the best results, followed by the column-based AutoRec and .
We propose a data-dependent regularizer for autoencoder-based matrix completion models. Our approach relies on the multi-task learning principle, performing prediction as the main task and manifold learning as an auxiliary task. The latter acts as a regularizer, improving the generalizability on the main task. Experimental results on two real-world datasets show that the proposed approach effectively reduces overfitting for both row and column autoencoder-based models, especially, when the number of training data is small; and consistently outperforms state-of-the-art models in terms of reconstruction accuracy.
F. Cao, M. Cai, and Y. Tan, “Image interpolation via low-rank matrix completion and recovery,”IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1261 – 1270, 2015.
-  H. Ji, C. Liu, Z. Shen, and Y. Xu, “Robust video denoising using low rank matrix completion,”
-  Y. Zheng, B. Tang, W. Ding, and H. Zhou, “A neural autoregressive approach to collaborative filtering,” in Int. Conf. Machine Learning (ICML), 2016, pp. 764–773.
-  F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multi-graph neural networks,” arXiv:1704.06803, 2017.
-  E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Found. Comput. Math., vol. 9, no. 6, p. 717, 2009.
-  B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review, vol. 52, no. 3, pp. 471–501, 2010.
-  R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” in Adv. Neural Inf. Process. Syst. (NIPS), 2007, pp. 1257–1264.
-  Y. Koren, R. Bell, and C. Volinsky, “Matrix Factorization Techniques for Recommender Systems,” IEEE Computer, vol. 42, no. 8, pp. 30–37, 2009.
-  M. Volkovs, G. Yu, and T. Poutanen, “Dropoutnet: Addressing cold start in recommender systems,” in Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 4957–4966.
-  D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Extendable neural matrix completion,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) [Available: arXiv:1805.04912], 2018.
-  ——, “Learning discrete matrix factorization models,” IEEE Signal Process. Lett., vol. 25, no. 5, pp. 720–724, 2018.
A. r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,”IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 1, pp. 14–22, 2012.
D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Deep learning sparse ternary projections for compressed sensing of images,” inIEEE Global Conference on Signal and Information Processing (GlobalSIP), 2017, pp. 1125–1129.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
-  T. H. Do, D. M. Nguyen, E. Tsiligianni, B. Cornelis, and N. Deligiannis, “Twitter user geolocation using deep multiview learning,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) [Available: arXiv:1805.04612], 2018.
-  S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “AutoRec: Autoencoders Meet Collaborative Filtering,” in Int. Conf. World Wide Web (WWW), 2015, pp. 111–112.
-  F. Strub, R. Gaudel, and J. Mary, “Hybrid recommender system based on autoencoders,” in The 1st Workshop on Deep Learning for Recommender Systems (DLRS), 2016, pp. 11–16.
-  F. Strub, J. Mary, and R. Gaudel, “Hybrid collaborative filtering with neural networks,” arXiv:1603.00806, 2016.
X. Dong, L. Yu, Z. Wu, Y. Sun, L. Yuan, and F. Zhang, “A hybrid collaborative
filtering model with deep structure for recommender systems,” in
AAAI Conf. Artificial Intelligence, 2017, pp. 1309–1315.
-  O. Kuchaiev and B. Ginsburg, “Training deep autoencoders for collaborative filtering,” arXiv:1708.01715, 2017.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
-  M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, 2006.
-  J. Huang, Q. Qiu, R. Calderbank, and G. Sapiro, “GraphConnect: A regularization framework for neural networks,” arXiv:1512.06757, 2015.
-  R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp. 41–75, Jul. 1997.
-  T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, pp. 615–637, 2005.
-  S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv:1706.05098, 2017.
-  R. Caruana, “Multitask Learning: A Knowledge-Based Source of Inductive Bias,” in Int. Conf. Machine Learning. Morgan Kaufmann, 1993, pp. 41–48.
-  J. Baxter, “A model of inductive bias learning,” Journal of Artificial Intelligence Research, vol. 12, no. 1, pp. 149–198, Mar. 2000.
-  S. Ben-David and R. Schuller, “Exploiting task relatedness for multiple task learning,” in Learning Theory and Kernel Machines, 2003, pp. 567–580.
-  F. M. Harper and J. A. Konstan, “The MovieLens Datasets: History and Context,” ACM Trans. Interact. Intell. Syst., vol. 5, no. 4, pp. 19:1–19:19, 2015.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.