I Introduction
Recovering the unknown entries of a partially observed matrix is an important problem in signal processing [1, 2]
and machine learning
[3, 4]. This prblem is often referred to as matrix completion (MC). Let be a matrix with a limited number of observed entries defined by a set of indices such that if has been observed. Then, recovering from the observed entries can be formulated as an optimization problem of the form:(1) 
with denoting the complete matrix, an operator that indexes the entries defined in , and the Frobenius norm, that is, .
Several studies have focused on the problem of reconstructing from , most of them assuming that is a lowrank matrix. As the rank minimization problem is intractable, nuclear norm minimization is often used as a convex relaxation [5, 6]. However, the major drawback of nuclear norm minimization algorithms is their high computational cost. Less computationally demanding matrix factorization methods [7, 8] approximate the unknown matrix by a product of two factors , with , and have been employed in largescale MC problems such as recommender systems [8].
Recently, matrix completion has been addressed by several deepnetworkbased models [3, 4, 9, 10, 11] that yield stateoftheart results in a variety of benchmark datasets. Having the ability to provide powerful data representations, deep neural networks have achieved great successes in many problems, from acoustic modeling [12] and compressed sensing [13] to image classification [14] and social media analysis [15]. Among deepnetworkbased MC models, autoencoderbased methods have received a lot of attention due to their superior performance and a direct connection to the matrix factorization principle [16, 17, 18, 19, 20]. Despite their remarkable performance, these models suffer from overfitting, especially, when dealing with matrices that are highly sparse. To overcome this problem, more recent works either employ dataindependent regularization techniques such as weight decay and dropout [16, 17, 18, 20, 21], or incorporate side information to mitigate the lack of available information [17, 18, 19]. The efficiency of the former approach is limited under high sparsity settings [16, 17, 20], while the latter is not directly applicable when side information is unavailable or difficult to collect.
In this paper, we focus on matrix completion without side information under settings with highly scarce observations, and propose a datadependent regularization technique to mitigate the overfitting of autoencoderbased MC models. While dataindependent regularization approaches focus on training a model such that it is difficult to fit to random error or noise, datadependent techniques rely on the idea that the data of interest lie close to a manifold and learn attributes that are present in the data [22, 23]. In particular, we combine rowbased and columnbased autoencoders into a hybrid model and constrain the latent representations produced by the two models by a manifold learning objective as in [10]. This constraint can be considered as a datadependent regularization. The resulting model follows the multitask learning paradigm [24, 25, 26], where the main task is the reconstruction and the auxiliary task is the manifold learning. Experimental results on various real datasets show that our model effectively mitigates overfitting and consistently improves the performance of autoencoderbased models. The proposed approach is complementary to dataindependent regularization, and the two techniques can be combined in an efficient way [23].
Ii Background
Iia Autoencoderbased Matrix Completion
The first autoencoderbased MC model, coined AutoRec [16], comes in two versions, namely, the rowbased and the columnbased AutoRec. For brevity, we only focus on the formulation of the rowbased AutoRec; the columnbased model can be defined in a similar way. The rowbased AutoRec model operates on a row of a partially observed matrix, projects it into a lowdimensional latent (hidden) space and then reconstructs it in the outer space. Let be an incomplete matrix with rows , , and columns ,
. In the encoder side, the model takes as input a partially observed row vector
and transforms it to a latent representation vector through a series of hidden neural network layers. In the decoder side, a dense reconstruction of is generated from through another set of hidden layers to predict the missing values. Denote by and the nonlinear functions corresponding to the encoder and the decoder, respectively, with , the corresponding vectors containing the free parameters of the model. Then, the intermediate representation of is given by , while the dense reconstruction is obtained as . The objective function used to train the rowbased AutoRec involves a reconstruction loss of the form:(2) 
and a regularization term ; typically, regularization is applied to the model parameters. In the above equation, denotes the th element of the th row and the cardinality of . Figure 1 illustrates the general architecture of a rowbased AutoRec.
IiB Multitask Learning
When training a model designed for a specific task using information extracted by other tasks that are related to the main task, we refer to multitask learning [27, 24, 25]. The use of knowledge gained from the solution of one or more auxiliary tasks is an inductive transfer mechanism, which can improve a model by introducing inductive bias. The inductive bias provided by the auxiliary tasks causes the model to prefer hypotheses that explain more than one task. This approach often leads to solutions that generalize better for the main task [26]. Multitask learning can be considered, therefore, as a regularization technique that reduces the risk of overfitting. When designing a multitask learning model, we seek for auxiliary tasks that are similar to the main task. Different approaches for considering two tasks as similar have been reported in the literature [24, 28, 29].
Multitask learning in deep neural networks typically involves learning tasks in parallel while using a shared representation. A common approach when designing a multitask deep network is to share hidden layers between all tasks and separate taskspecific outputs. The idea is that what is learned for each task can help other tasks be learned better [26].
Iii Proposed Approach
Considering that overfitting is a major drawback of existing AutoRec models, the MC model proposed in this paper aims to address this problem by introducing a regularization strategy that follows the multitask learning principle. The proposed model is a twobrach autoencoder performing one main task and one auxiliary task. The main task outputs row and column predictions of the unknown matrix. The auxiliary task uses the latent representations of the main task to predict the value of a single matrix entry. Concretely, the additional output that corresponds to the auxiliary task imposes a similarity constraint on the hidden representations produced by the row and column autoencoders. The proposed constraint acts as a datadependent regularization to the autoencoder models.
Iiia Model Architecture
The architecture of the proposed model is illustrated in Fig. 2. The model consists of two branches. The row branch takes the th row vector as input, predicts the missing values and produces the dense reconstruction as output. Assuming a deep autoencoder with encoding and decoding layers, the hidden representation of the th row is obtained according to:
(3) 
with , while for the decoding part it holds
(4) 
with . In the above equations, , are the weights used in the th encoding and the th decoding layer, respectively, , the corresponding biases, and
the activation function.
Similarly, the column branch outputs the dense reconstruction from the input incomplete column vector . Without any additional constraint on these two branches, this model can be seen as an ensemble of two AutoRec models, working independently.
To enforce that the two branches work together, we propose to use the corresponding latent representations of rows and columns to perform a related task, namely, to realize a matrix factorization model. The basic assumption of matrix factorization models is that there exists an unknown representation of rows and columns in a latent space such that the entry of a matrix can be modelled as the inner product of the latent representations of the th row and the th column in that space [7, 8]. This was the dominant idea of the deep matrix factorization model presented in [10]
. Following this approach, we enforce the hidden representations of the row and column autoencoders to be close under the cosine similarity if the corresponding matrix entry is of high value. Specifically, we define a loss term associated to the
observed entry as follows:(5) 
where , is the cosine similarity function, and , the hidden representations of the row and column autoencoders, respectively. We employ the cosine similarity rather than the dot product as a similarity metric, as we empirically found that it enables learning a latent space of much higher dimensions. Since the cosine similarity between two vectors lies in , assuming that , all entries in the original matrix are scaled during training according to with .
Employing (5) to train the proposed model is equivalent to applying a manifold learning objective on the intermediate outputs of the row and column autoencoders. The proposed model can then be thought of as performing two tasks simultaneously, that is, learning to reconstruct and manifold learning. The latter is the auxiliary task, playing the role of improving the main reconstruction task.
IiiB Objective Function
While in singletask learning we optimize one loss function, in multitask learning, the objective function is a combination of loss terms. In the proposed model, the loss associated to each of the known entries in
consists of three terms, namely, the reconstruction loss for the row autoencoder , the reconstruction loss for the column autoencoder , and the representation loss defined in (5), where denotes the th element of the th row, and the th element of the th column. Averaging over all training samples in , we obtain the objective function:(6) 
with , , appropriate weights.
IiiC Comparison with Existing Work
Multitask learning has also been employed for MC in [19]. The model presented in [19]
is based on a neural network component that incorporates side information, coined additional stacked denoising autoencoder (aSDAE). Specifically, aSDAE is a twobranch neural network that takes a pair of (corrupted) inputs, namely, a row (column) vector and the side information, and produces a pair of reconstructed outputs. Employing two branches that realize a row and a column aSDAE, the model performs matrix factorization as a main task, using the latent representations provided by the row and column aSDAEs. Concretely, the objective of this work concerns deep matrix factorization with side information. Different from
[19], our work focuses on settings in which the available data is highly sparse and side information is unavailable. Our main objective is to compensate for the scarcity of data by providing an efficient regularization technique for autoencoder MC models that can address overfitting. In our setting, the row and column reconstruction is the main task and the matrix factorization is an auxiliary task. Our approach is simpler with fewer branches compared to [19]and fewer hyperparameters to be tuned.
Other regularization methods aiming to improve the generalizability of AutoRec have been reported in [17, 18, 20]. The model presented in [17, 18] uses a denoising autoencoder and employs side information to augment the training data and help the model learn better. The authors of [20] extend AutoRec with a very deep architecture and heavily regularize its parameters by employing dropout regularizer with very high dropping ratio, while proposing a dense refeeding technique to train the model. Nevertheless, the performance of these models is reduced in case of high scarcity in the training data or lack of side information. Similar to Dropout [21], denoising autoencoders average the realizations of a given ensemble and try to make the model difficult to fit random error. On the other hand, our model learns attributes that are present in the data rather than preventing the model learning nonattributes. It should be noted that existing techniques are complementary to the proposed approach and their combination could lead to further performance improvement.
Iv Experimental Results
We carry out experiments involving two benchmark datasets, namely, the MovieLens100K and the MovieLens1M [30], containing approximately and one million observations, respectively. The datasets contain users’ movie ratings summarized in two matrices where rows correspond to users and columns to movies. Only a small fraction of ratings is observed in each dataset. We randomly split these ratings into three sets; are used for training, for validation and for evaluation. We evaluate the performance of the proposed model in terms of regularization quality and reconstruction accuracy. On each dataset, we run the models on five different random splits of the data and report the average RMSE and MAE values, calculated as follows: , and , with the set of indices corresponding to entries available for evaluation and its cardinality.
Iva Hyperparameter Settings
Following the original AutoRec model [16], we employ only one hidden layer with units for each branch of our model, while the numbers of units in the input and output layers are set according to the sizes of the matrices. We train our model using the Adam optimizer [31], with minibatches of size , and initial learning rate equal to decaying by a factor
every 30 epochs. The number of training epochs is set to
. The regularization weight is set to for the MovieLens100K, and for MovieLens1M dataset.We search for the best weights for the loss terms in (IIIB) using a separate random split of the two datasets. As the roles of row and column branches in our model are equivalent, we keep and fixed, equal to , and only vary . The results for different values of on the two datasets are shown in Table I. For the MovieLens100K dataset, higher reconstruction accuracy is delivered with , while for the MovieLens1M dataset, is more effective. The obtained values of confirm our conjecture that the role of regularization is critical when the number of available samples is small, that is, when we train our model on the MovieLens100K dataset.
MovieLens100K  
Rowbased  
Columnbased  
MovieLens1M  
Rowbased  
Columnbased 
IvB Regularization Performance
As the proposed approach can be seen as a datadependent regularizer, we carry out experiments on the MovieLens100K dataset to evaluate its regularization performance. Figure 3 illustrates the training and validation reconstruction losses of the proposed model while training proceeds. We train the model with and without regularization, for and . When , the model reduces to a rowbased and a columnbased Autorec working independently, and the proposed regularizer is not applied.
As can be seen in Fig. 2(a), without any regularization, both models heavily overfit; however, when the proposed approach is applied (), the generalization error (i.e., the gap between the training and the validation loss) becomes lower. The generalization errors decrease further when the regularization is employed (Fig. 2(b)). Combining both the regularization and the proposed approach, we obtain the best results in terms of generalization error. This confirms that different regularization techniques can be complementary to each other.
We note that for a highly tuned problem like matrix completion, a small improvement in the validation loss that improves the reconstruction accuracy is significant (see also Sec. IVC).
IvC Reconstruction accuracy
We carry out experiments to compare the proposed model to other neuralnetworkbased MC models in terms of reconstruction quality. In this experiment, we use the tuned values of (see Sec. IVA) while all other hyperparameter settings are the same. Table II presents a comparison between the proposed model against the AutoRec model [16] and the matrix factorization model proposed in [10]. We do not include a comparison with the more recent autoencoderbased models [17, 18], as they do not improve over AutoRec when the training data is small, which is the focus of this work.
MovieLens100K  MovieLens1M  
RMSE  MAE  RMSE  MAE  
Nguyen et al. [10]  
AutoRec [16] (rowbased)  
Proposed (rowbased)  
AutoRec [16] (columnbased)  
Proposed (columnbased) 
As can be seen, the proposed model consistently improves over the corresponding Autorec models (rowbased and columnbased). It should be noted on MovieLens datasets, columnbased Autorec usually performs better than rowbased Autorec [16] as the number of ratings per item is usually higher than per user. The improvements are more significant on the MovieLens100K dataset, which has far fewer training data than the MovieLens1M dataset. On the MovieLens100K dataset, the column branch of the proposed model outperforms [10] which generalizes much better than the AutoRec model. On the MovieLens1M dataset, the column branch of our model delivers the best results, followed by the columnbased AutoRec and [10].
V Conclusion
We propose a datadependent regularizer for autoencoderbased matrix completion models. Our approach relies on the multitask learning principle, performing prediction as the main task and manifold learning as an auxiliary task. The latter acts as a regularizer, improving the generalizability on the main task. Experimental results on two realworld datasets show that the proposed approach effectively reduces overfitting for both row and column autoencoderbased models, especially, when the number of training data is small; and consistently outperforms stateoftheart models in terms of reconstruction accuracy.
References

[1]
F. Cao, M. Cai, and Y. Tan, “Image interpolation via lowrank matrix completion and recovery,”
IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 8, pp. 1261 – 1270, 2015. 
[2]
H. Ji, C. Liu, Z. Shen, and Y. Xu, “Robust video denoising using low rank
matrix completion,”
IEEE Conf. Computer Vision and Pattern Recognition (CVPR)
, pp. 1791–1798, 2010.  [3] Y. Zheng, B. Tang, W. Ding, and H. Zhou, “A neural autoregressive approach to collaborative filtering,” in Int. Conf. Machine Learning (ICML), 2016, pp. 764–773.
 [4] F. Monti, M. M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multigraph neural networks,” arXiv:1704.06803, 2017.
 [5] E. J. Candès and B. Recht, “Exact matrix completion via convex optimization,” Found. Comput. Math., vol. 9, no. 6, p. 717, 2009.
 [6] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization,” SIAM Review, vol. 52, no. 3, pp. 471–501, 2010.
 [7] R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” in Adv. Neural Inf. Process. Syst. (NIPS), 2007, pp. 1257–1264.
 [8] Y. Koren, R. Bell, and C. Volinsky, “Matrix Factorization Techniques for Recommender Systems,” IEEE Computer, vol. 42, no. 8, pp. 30–37, 2009.
 [9] M. Volkovs, G. Yu, and T. Poutanen, “Dropoutnet: Addressing cold start in recommender systems,” in Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 4957–4966.
 [10] D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Extendable neural matrix completion,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) [Available: arXiv:1805.04912], 2018.
 [11] ——, “Learning discrete matrix factorization models,” IEEE Signal Process. Lett., vol. 25, no. 5, pp. 720–724, 2018.

[12]
A. r. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,”
IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 1, pp. 14–22, 2012. 
[13]
D. M. Nguyen, E. Tsiligianni, and N. Deligiannis, “Deep learning sparse ternary projections for compressed sensing of images,” in
IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2017, pp. 1125–1129.  [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
 [15] T. H. Do, D. M. Nguyen, E. Tsiligianni, B. Cornelis, and N. Deligiannis, “Twitter user geolocation using deep multiview learning,” in IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) [Available: arXiv:1805.04612], 2018.
 [16] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “AutoRec: Autoencoders Meet Collaborative Filtering,” in Int. Conf. World Wide Web (WWW), 2015, pp. 111–112.
 [17] F. Strub, R. Gaudel, and J. Mary, “Hybrid recommender system based on autoencoders,” in The 1st Workshop on Deep Learning for Recommender Systems (DLRS), 2016, pp. 11–16.
 [18] F. Strub, J. Mary, and R. Gaudel, “Hybrid collaborative filtering with neural networks,” arXiv:1603.00806, 2016.

[19]
X. Dong, L. Yu, Z. Wu, Y. Sun, L. Yuan, and F. Zhang, “A hybrid collaborative
filtering model with deep structure for recommender systems,” in
AAAI Conf. Artificial Intelligence
, 2017, pp. 1309–1315.  [20] O. Kuchaiev and B. Ginsburg, “Training deep autoencoders for collaborative filtering,” arXiv:1708.01715, 2017.
 [21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
 [22] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, 2006.
 [23] J. Huang, Q. Qiu, R. Calderbank, and G. Sapiro, “GraphConnect: A regularization framework for neural networks,” arXiv:1512.06757, 2015.
 [24] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp. 41–75, Jul. 1997.
 [25] T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, pp. 615–637, 2005.
 [26] S. Ruder, “An overview of multitask learning in deep neural networks,” arXiv:1706.05098, 2017.
 [27] R. Caruana, “Multitask Learning: A KnowledgeBased Source of Inductive Bias,” in Int. Conf. Machine Learning. Morgan Kaufmann, 1993, pp. 41–48.
 [28] J. Baxter, “A model of inductive bias learning,” Journal of Artificial Intelligence Research, vol. 12, no. 1, pp. 149–198, Mar. 2000.
 [29] S. BenDavid and R. Schuller, “Exploiting task relatedness for multiple task learning,” in Learning Theory and Kernel Machines, 2003, pp. 567–580.
 [30] F. M. Harper and J. A. Konstan, “The MovieLens Datasets: History and Context,” ACM Trans. Interact. Intell. Syst., vol. 5, no. 4, pp. 19:1–19:19, 2015.
 [31] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
Comments
There are no comments yet.