This paper studies a stylized, yet natural, learning-to-rank problem and points out the critical incorrectness of a widely used nearest neighbor algorithm. In this problem, let be the set of agents (users) and be the set of alternatives. Each agent or alternative is associated with a latent feature vector , where . The utility of to is determined by , where is a bivariate function. We observe a (partial) ranking of agent (for all ) over the alternatives. The distribution of ranking is determined by the alternatives’ utilities, . When is larger, is more likely to rank higher in .
The nearest neighbor problem. For an and a parameter , we aim to design an efficient algorithm that finds (almost) all ’s such that . The nearest neighbor problem for alternatives can also be defined similarly.
This fundamental machine learning problem is embedded in many critical operations. For example, recommender systems use partial ranking information (partial observations of
’s) to estimate agents’ preferences over unranked alternatives, product designers estimate the demand curve of a new product based on consumers’ past choices(Berry et al., 1995), security firms estimate terrorists’ preferences based on their past behavior, and political firms estimate campaign options based on voters’ preferences Liu (2009).
A widely-used algorithm produces incorrect results. The most widely studied and deployed algorithm Liu (2009); Katz-Samuels and Scott (2017) uses Kendall-Tau (KT) distance (see Section 2) as the metrics and uses k-nearest neighbors (kNN) to identify similar agents: for any given , it finds all such that the KT distances between and are minimized. We will refer to this algorithm as KT-kNN.
In this paper, we show that under a natural and widely applied preference model, the KT distance-based kNN for agents is provably incorrect even when the sample size grows to infinite.
Novel (correct) algorithms. First, we design a new algorithm that correctly identifies similar agents based on . We introduce a set of new features, denoted by , so that enables us to identify similar agents. A salient property of is that it relies on the rankings of other agents, which we will refer to as “global information”. This property is in sharp contrast to most existing practices of feature engineering in learning-to-rank algorithms Liu (2009).
Second, we design another new algorithm for identifying similar alternatives. We find that construction of alternative features can be done using local information, making identifying similar alternatives significantly easier.
Agent-wise or alternative-wise similarities. Finding similar alternatives (items) is easier than finding similar agents (users) in collaborative filtering Sarwar et al. (2001): in practice, recommender systems based on “item-similarities” are usually more effective. One explanation is the “missing data problem”. Because there are often more users than items, the intersection between items ranked by two arbitrary users is often small, and measuring item similarities is usually unreliable.
Our result provides a new explanation of the performance discrepancies: under the Plackett-Luce model, finding similar agents is fundamentally more difficult than finding similar alternatives.
Finding neighbors implies learning to rank. We focus on the problem of identifying nearest neighbors in this work. Our approach can be naturally extended to infer ’s preferences over unranked alternatives by aggregating rankings from the neighbors using methods developed in the literature, such as in Conitzer et al. (2006); Alon (2006); Ailon (2007); Kenyon-Mathieu and Schudy (2007).
Nondeterministic preferences. We assume that agent rank alternative according to her perceived untility , where
is a radial basis function (RBF)Scholkopf and Smola (2001) (i.e., the value of depends only on ) and is a random noise.
Practical implications. We focus on conceptual and theoretical investigation of the learning-to-rank problem with nondeterministic preferences. Although we point out the harm from using -, it may not be the root cause of a practical system based on -. To diagnose a ranking algorithm (specifically whether our theoretical results are relevant), one shall first check whether our model is suitable for his/her datasets.
Our model. Let be the set of agents (or users) and be the set of alternatives (or items).
Utility functions. Agent ’s utility on alternative is determined by a utility function . Throughout this paper, we use , where is the norm. Most results developed in this paper can be generalized to many radial-basis functions Scholkopf and Smola (2001).
Observation and rankings. We observe the ranking of each user in the decreasing order of perceived utility of the alternatives . When follows a Gumbel distribution, then the nondeterministic preferences model is also known as the Plackett-Luce model Plackett (1975); Luce (1977). Let be a random permutation of , and we have
Distributions of and . We further assume that and are i.i.d. generated from fixed but unknown distributions and , respectively. Let the cdf (respectively, pdf) of and be and (respectively, and ). For exposition purposes, we make simplifying assumptions that (i) , and (ii) and are on and “near uniform” (i.e., and are bounded by a constant ). There assumptions are widely used in latent space models and can be relaxed via more careful analysis, see Abraham et al. (2013) and references therein.
Our problem. Given an agent , we say is an -nearest neighbor set for if
For all such that , .
For all such that , .
For any such that , we do not require any performance guarantee (i.e., whether ).
Similarly, we can define -nearest neighbor set for alternatives. In other words, all ’s that are within away from should be included in , and all ’s that are more than away from should not be in . Therefore, our goal is to design efficient algorithms to compute
-nearest neighbor sets with high probability, where.
Partial observations and forecasts. All results presented in this paper can be generalized to the partial ranking scenario, where each only consists of a subset of linear size. Furthermore, a natural problem in this scenario is to infer an agent’s preferences over unranked alternatives. We note that an -nearest neighbor set for can be used to infer its rankings over the entire via existing techniques Conitzer et al. (2006); Alon (2006); Ailon (2007). Therefore, our problem is strictly harder than the preference estimation problem.
Comparison to the KS model by Katz-Samuels and Scott (2017). In the KS model, agent ’s ranking is deterministic, i.e., iff , whereas our model allows to “add noise” to the observations, which is a more standard practice in learning to rank.
2.1 Kendal-tau distance and prior algorithms
Let and be two rankings over and let denote the rank of the -th alternative. The Kendall-tau distance is
where is an indicator function that sets to one if and only if its argument is true. The Normalized Kendall-tau distance between and is .
Nearest-neighbor algorithms. See Algorithm 1. We shall refer to the algorithm as -. This algorithm uses KT-distance as the distance metrics and run a kNN algorithm on top of it.
3 Incorrectness of - under Nondeterministic Preferences
This section explains why - is incorrect under the Plackett-Luce model. Let be the ground-truth ranking of agent , i.e., the -th element in is the -th largest value of the set .
Intuition behind -. Previously, - was considered correct because of two intuitions: (1) if and are close, then and are also close, and (2) if and are close, their “realizations” and will also be close. Therefore, when minimizes for large and , it also minimizes .
Intuition (1) is theoretically grounded (see Katz-Samuels and Scott (2017)). The key problematic part is that for nondeterministic users, and do not have a monotone relationship. That is, an increase in does not necessarily imply an increase in , and vice versa.
Let , , and . Consider the following two optimization problems:
We can see that the structures of these two optimization problems are very different. For (3), the optimal solution set is . But for (4), the optimal solution set is . The key difference is that itself would be an optimal solution to (3), but it is not an optimal solution to (4).
Interpreting the result. We need to solve (3) to find nearest neighbors, but the objective of - is closer to (4). Specifically, consider a scenario with only two alternatives but is sufficiently large. The above example shows that is far away from . Because is sufficiently large, we have . The right side of the approximation resembles the nearest-neighbor approximation (i.e., ) because converges to its expectation for large . Therefore, - solves (4) which is different from solving (3).
This observation can be used to build a more general negative theorem, which implies that - cannot output any -nearest neighbor set with high probability, because with probability the output of - is away.
Proof: Since is near-uniform on , we know . Then, we prove the following two claims indicating for all :
Those two claims above also indicate that - cannot output any -nearest neighbor set with probability. Here, we focus on the most difficult case in the first claim above to highlight our new analytical techniques (see Appendix A for the full proof). Specifically, below we show
Because is a continuous function of , we use to characterize the minimal point . Specifically, we shall show that for all , which means the function is minimized when .
We next analyze in the following events respectively:
: when .
: when and .
: when and .
: when and .
According to the case-by-case analysis shown in Lemma A.1 and noting that ,
Equality sign holds if and only if . Therefore, is minimized at for . Appendix A completes the proof for using similar techniques.
4 Nearest-neighbor with global information
We propose a novel (and correct) kNN algorithm based on a new set of features for all that can be used by nearest-neighbor algorithms. Each feature needs to use global information of all the rankings.
Features based on all-pair normalized Kendall-tau distance. We associate each agent with a feature , where is constructed as below:
First, we group to pairs so that the -th pair consists of . We then let:
Our features are
It follows that ’s are independent and for all . Then we define the distance function between agent and agent as
The new kNN (-). Let , (i.e., ). Our algorithm, hereafter -, returns the set . See Algorithm 2.
Global vs. local information. Algorithm - uses only local information to construct features (i.e., feature of depends only on ), whereas - needs to use all ranking information to construct a feature . We note that relying on local information is unlikely to be sufficient to construct high-quality features. Instead, we need to use a slower procedure that takes advantage of all of the local information available to construct . Earlier works in network analysis (see Li et al. (2017)
and references therein) developed similar techniques in classifying nodes.
Using the notations above, for all , that are near-uniform on and , let . There exist positive constants , and such that is an -nearest neighbor set of with probability at least .
Proof: We first show that there exist constants and such that
All the analysis below assumes conditioning in knowing and (i.e., means ). W.l.o.g., assume . We use techniques similar to those developed in Theorem 3.1. Specifically let . We have
Let us define three events:
: when or .
: when and .
: when or .
We compute conditioned on the three events (i.e., for ).
Event : One can see that by using a symmetric argument.
Event . We have
Event . We have
follows from combining all results for .
Next, we show the tail bound of , where decays exponentially in . Observing that ’s are independent in any , we have according to standard Chernoff bound. Combining the tail bound above with (7), we know there exists constants , and such that
We now interpret (8) in the context of -nearest neighbor set. We analyze the part first. For any agent , we have,
We also note that there are at most agents in . By applying union bound to all agents in , we got the conclusion of .
Letting , we get . Then, Theorem 4.1 follows by applying union bound to ’s and ’s conclusions.
5 Nearest-Neighbor algorithm for alternatives
This section designs an algorithm for finding -nearest neighbor set for an alternative . While global information is needed for finding -nearest neighbor for agents, we need only local information for alternatives. For exposition purpose, we focus on uniform and .
Additional notations. Define . Here, represents an agent and represents its ranking over all alternatives. Intuitively, is if and is otherwise. Next, define . Note that the terms in the summation are i.i.d. random variables, each of which has the same distribution with .
Our algorithm and its intuition. Our goal is to find an -nearest neighbor set of . To determine whether and are close, we shall check : when , with probability exactly that , which implies . When and are far away, then it is unlikely that (there is a catch; see below). As is the mean of copies of independent , it will drift away from .
A “bug” due to symmetry. One issue of the above argument is that large does not always imply . For any , when , we have:
One can check that for any . Therefore, for .
A two-step algorithm. Let be a suitable parameter and . We design a two-step algorithm to circumvent the symmetric bug:
Step 1. Construction of candidate set: We let . All neighbors of are in .
Step 2. Filtering: We design a procedure that can determine whether is close to or to for all . Using this algorithm, and use the procedure filter out all the alternatives in that are not close to .
Details of step 1 and 2 will be given below. The performance of our algorithm is characterized by the following proposition.
Using the above notations, let be an arbitrary alternative. There exists an efficient algorithm that constructs an -nearest neighbor set for any . Here, and are two suitably chosen constants.
Step 1: Construction of candidate set. Let .
Let (for all ) and . Then there exist constants and such that
Step 2: Filtering out unwanted alternatives. Now we have a candidate set such that for any , is either close to or . Next, we describe an algorithm that eliminates elements in that’s not close to . We now formally describe the problem.
The Split-cluster problem. Let be a set such that for any , either or , where . Our goal is to find all such that .
Our split-cluster algorithm is shown in Algorithm 3, with analysis and remarks shown in Appendix.
When , Algorithm 3 returns all such that .
6 Numeric validation
This section presents results of experiments based on synthetic data to validate our theoretical results. We randomly generated 1200 agents and 6000 alternatives according to . Then we introduce a new agent and reveal its partial ranking to the system. Our goal is to predict . We examine three algorithms: (i) -, (ii) -, (iii) Ground-truth (i.e., directly using the nearest neighbors of an agent in latent space). The ground-truth algorithm cannot be implemented in practice and only serves as a optimal bound for any kNN based algorithms. We consider ( is the number of neighbors to keep). See Figure 1. One can see that - consistently has bad performance whereas -’s performance is very close to the lower bound.
Figure 1(c) shows experiments for high-dimensional latent spaces () under the same setting as 1D except . We see - consistently has worse performance than -, whose performance is very close to ground truth.
7 Additional related work
Nonparametric learning in practice.
Our model is sometimes considered as a non-parametric model. Nonparametric preference learning methods are widely applied in practice but little is known about their theoretical guarantees. Our work is related to the recent line of work in preference completionMcNee et al. (2006); Liu and Yang (2008b); Cremonesi et al. (2010); Wang et al. (2012, 2014); Huang et al. (2015); Cheng et al. (2017); Katz-Samuels and Scott (2017). Some most recent algorithms (e.g., Wang et al. (2014); Huang et al. (2015)) have impressive performance in practice, but have no theoretical explanations justifying the successes.
Non-ranking observations. There is a rich literature (e.g., see Herlocker et al. (1999); Liu and Yang (2008a); Bobadilla et al. (2013); Lee et al. (2016) and references therein) on learning information about based on partial observations. For example, in the classical collaborative filtering problem, noisy observations of (e.g., the observation is
for some white noise). These results are not comparable to ours. Other work Kleinberg and Sandler (2003, 2008) assumes an observation model related to ours: an alternative is more likely to be used/evaluated by an agent if is high.
(or its expectation) is low ranks. This matrix is full rank in all the utility functions and models considered in our program. Furthermore, their loss functions are not in terms of rank correlations (the most natural choice).
Parametric inference. Parametric preference learning has been extensively studied in machine learning, especially learning to rank Cheng et al. (2010); Mollica and Tardella (2016); Negahban et al. (2017); Azari Soufiani et al. (2012, 2014, 2013b, 2013a); Maystre and Grossglauser (2015); Khetan and Oh (2016); Hughes et al. (2015); Zhao et al. (2016). These method often assume the existence of a parametric model, usually Random Utility Model or Mallows’ model.
8 Concluding remarks
This paper introduced a natural learning-to-rank model, and showed that under this model a widely-used KT-distance based kNN algorithm failed to find similar agents (users). To fix the problem, we introduced a new set of features for agents that relies on the ranking of other agents (i.e., relying on “global information”). We also design an algorithm for finding similar alternatives, based on using only local information. The two algorithmic results showed that the “item-similarity” problem is fundamentally different from the “user-similarity” problem.
Generalization. We made two assumptions in our analysis: (i) we observe each agent’s full ranking over ; and (ii) and are in 1-dimensional space. Relaxing assumption (i) is straightforward because we need only develop specialized tail bounds for (discussed in Section 4). Relaxing assumption (ii), however, is challenging because our analysis heavily relies on symmetric properties over the 1-dimensional lines, many of which break in high-dimensional space. We note that in practice, the improvement of predictive power using high-dimensional models is usually incremental Li et al. (2017).
Limitation. RBF utilities are not universally applicable in all recommender systems (e.g.,
in some circumstances, “cosine similarities” are more suitable utility functions). This paper’s major contribution is the theoretical investigation of a fundamental learning-to-rank problem. It remains a future work to apply our results to understand their impacts on practical recommender systems.
- Abraham et al.  Ittai Abraham, Shiri Chechik, David Kempe, and Aleksandrs Slivkins. Low-distortion inference of latent similarities from a multiplex social network. In Proceedings of the Twenty-fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’13, pages 1853–1883, Philadelphia, PA, USA, 2013. Society for Industrial and Applied Mathematics.
- Ailon  Nir Ailon. Aggregation of partial rankings, p-ratings and top-m lists. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2007.
- Alon  Noga Alon. Ranking tournaments. SIAM Journal of Discrete Mathematics, 20:137–142, 2006.
- Azari Soufiani et al.  Hossein Azari Soufiani, David C. Parkes, and Lirong Xia. Random utility theory for social choice. In Proceedings of Advances in Neural Information Processing Systems (NIPS), pages 126–134, Lake Tahoe, NV, USA, 2012.
Azari Soufiani et al. [2013a]
Hossein Azari Soufiani, William Chen, David C. Parkes, and Lirong Xia.
Generalized method-of-moments for rank aggregation.In Proceedings of Advances in Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 2013a.
Azari Soufiani et al. [2013b]
Hossein Azari Soufiani, David C. Parkes, and Lirong Xia.
Preference Elicitation For General Random Utility Models.
Proceedings of Uncertainty in Artificial Intelligence (UAI), Bellevue, Washington, USA, 2013b.
- Azari Soufiani et al.  Hossein Azari Soufiani, David C. Parkes, and Lirong Xia. Computing Parametric Ranking Models via Rank-Breaking. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 2014.
- Berry et al.  Steven Berry, James Levinsohn, and Ariel Pakes. Automobile prices in market equilibrium. Econometrica, 63(4):841–890, 1995.
- Bobadilla et al.  J. Bobadilla, F. Ortega, A. Hernando, and A. GutiéRrez. Recommender systems survey. Knowledge-Based Systems, 46:109–132, 2013.
- Cheng et al.  Peizhe Cheng, Shuaiqiang Wang, Jun Ma, Jiankai Sun, and Hui Xiong. Learning to recommend accurate and diverse items. In Proceedings of the 26th International Conference on World Wide Web, WWW ’17, pages 183–192, 2017. ISBN 978-1-4503-4913-0.
- Cheng et al.  Weiwei Cheng, Krzysztof J. Dembczynski, and Eyke Hüllermeier. Label ranking methods based on the plackett-luce model. Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 215–222, 2010.
- Conitzer et al.  Vincent Conitzer, Andrew Davenport, and Jayant Kalagnanam. Improved bounds for computing Kemeny rankings. In Proceedings of the National Conference on Artificial Intelligence (AAAI), pages 620–626, Boston, MA, USA, 2006.
- Cremonesi et al.  Paolo Cremonesi, Yehuda Koren, and Roberto Turrin. Performance of recommender algorithms on top-n recommendation tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, pages 39–46. ACM, 2010. ISBN 978-1-60558-906-0.
- Gunasekar et al.  Suriya Gunasekar, Oluwasanmi O. Koyejo, and Joydeep Ghosh. Preference Completion from Partial Rankings. In Advances in Neural Information Processing Systems, 2016.
- Herlocker et al.  Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. An algorithmic framework for performing collaborative filtering. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 230–237, 1999.
- Huang et al.  Shanshan Huang, Shuaiqiang Wang, Tie-Yan Liu, Jun Ma, Zhumin Chen, and Jari Veijalainen. Listwise collaborative filtering. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 343–352. ACM, 2015.
- Hughes et al.  David Hughes, Kevin Hwang, and Lirong Xia. Computing Optimal Bayesian Decisions for Rank Aggregation via MCMC Sampling. In Proceedings of the Conference on Uncertainly in Artificial Intelligence (UAI), pages 385–394, 2015.
- Katz-Samuels and Scott  Julian Katz-Samuels and Clayton Scott. Nonparametric preference completion. CoRR, abs/1705.08621, 2017. URL http://arxiv.org/abs/1705.08621.
Kenyon-Mathieu and Schudy 
Claire Kenyon-Mathieu and Warren Schudy.
How to Rank with Few Errors: A PTAS for Weighted Feedback Arc Set on
Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, pages 95–103, San Diego, California, USA, 2007.
- Khetan and Oh  Ashish Khetan and Sewoong Oh. Data-driven rank breaking for efficient rank aggregation. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, 2016.
- Kleinberg and Sandler  Jon Kleinberg and Mark Sandler. Convergent algorithms for collaborative filtering. In Proceedings of the 4th ACM conference on Electronic commerce, pages 1–10, 2003.
- Kleinberg and Sandler  Jon Kleinberg and Mark Sandler. Using mixture models for collaborative filtering. Journal of Computer and System Sciences, 74(1):49–69, 2008.
- Lee et al.  Christina E. Lee, Yihua Li, Devavrat Shah, and Dogyoon Song. Blind Regression: Nonparametric Regression for Latent Variable Models via Collaborative Filtering. In Advances in Neural Information Processing Systems, 2016.
- Li et al.  Cheng Li, Felix MF Wong, Zhenming Liu, and Varun Kanade. From which world is your graph. In Advances in Neural Information Processing Systems, pages 1468–1478, 2017.
- Liu and Yang [2008a] Nathan N. Liu and Qiang Yang. EigenRank: a ranking-oriented approach to collaborative filtering. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 83–90, 2008a.
- Liu and Yang [2008b] Nathan N. Liu and Qiang Yang. Eigenrank: A ranking-oriented approach to collaborative filtering. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, pages 83–90, 2008b. ISBN 978-1-60558-164-4.
- Liu  Tie-Yan Liu. Learning to rank for information retrieval. Found. Trends Inf. Retr., 3(3):225–331, March 2009. ISSN 1554-0669.
- Luce  R. Duncan Luce. The choice axiom after twenty years. Journal of Mathematical Psychology, 15(3):215–233, 1977.
- Maystre and Grossglauser  Lucas Maystre and Matthias Grossglauser. Fast and accurate inference of Plackett-Luce models. In Proceedings of the 28th International Conference on Neural Information Processing Systems, pages 172–180, 2015.
- McNee et al.  Sean M. McNee, John Riedl, and Joseph A. Konstan. Being accurate is not enough: How accuracy metrics have hurt recommender systems. In CHI ’06 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’06, pages 1097–1101, New York, NY, USA, 2006. ACM. ISBN 1-59593-298-4.
- Mollica and Tardella  Cristina Mollica and Luca Tardella. Bayesian Plackett–Luce mixture models for partially ranked data. Psychometrika, pages 1–17, 2016.
- Negahban et al.  Sahand Negahban, Sewoong Oh, and Devavrat Shah. Rank centrality: Ranking from pairwise comparisons. Operations Research, 65(1):266–287, 2017.
- Park et al.  Dohyung Park, Joe Neeman, Jin Zhang, Sujay Sanghavi, and Inderjit S. Dhillon. Preference Completion: Large-scale Collaborative Ranking from Pairwise Comparisons. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, pages 1907–1916, 2015.
- Plackett  Robin L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2):193–202, 1975.
- Sarwar et al.  Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pages 285–295. ACM, 2001.
Scholkopf and Smola 
Bernhard Scholkopf and Alexander J Smola.
Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
- Wang et al.  Shuaiqiang Wang, Jiankai Sun, Byron J. Gao, and Jun Ma. Adapting vector space model to ranking-based collaborative filtering. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, pages 1487–1491, 2012. ISBN 978-1-4503-1156-4.
- Wang et al.  Shuaiqiang Wang, Jiankai Sun, Byron J. Gao, and Jun Ma. Vsrank: A novel framework for ranking-based collaborative filtering. ACM Trans. Intell. Syst. Technol., 5(3):51:1–51:24, July 2014. ISSN 2157-6904.
- Zhao et al.  Zhibing Zhao, Peter Piech, and Lirong Xia. Learning Mixtures of Plackett-Luce Models. In Proceedings of the 33rd International Conference on Machine Learning (ICML-16), 2016.
Appendix A Missing analysis for analyzing -
This section presents the missing analysis in Section 3. We have the following three major lemmas.
Let be a uniform distribution on , , and . Let be an arbitrary agent and be the ranking of the agent (which is a random variable conditioned on ). We have:
Proof: Because is a continuous function of , we use to characterize the minimal point . Specifically, we shall show that for all , which means the function is minimized when .
We next calculate . Let and . When the context is clear, we can write and instead. We have