Quantitative evaluation of a phenomenon consists of measuring one or several variables associated to it while controlling other intervening variables that alter the measure, but that are not linked to the phenomenon, i.e. noise . That unavoidable situation produces differences between what is wanted to be measured and what is actually being measured. The accuracy of a particular measurement method depends significantly on whether the variables used are effectively aimed to the target and on the robustness of the method against unwanted or unavoidable factors.
The evaluation for educational purposes also obeys that principle. That is, the tools used to measure a particular skill or knowledge (for example, an exam or written test) sometimes point to a ”moving target” and are usually affected by external factors. For instance, the tests for second language proficiency assessment can deviate from its intrinsic objective if they only take into account what is taught in the teaching curriculum and discard diverse cultural and linguistic backgrounds . Also they are affected by factors such as the artificial preparation (teach to the test) of the individuals being evaluated [11, 17], stamina to answer a long test, ability of discrimination, the handling of particular set of keywords , among others. Consequently, many tests of linguistic competence actually measure (noisily) many factors that may or may not be related to their actual linguistic competence.
The current information era and the raise of social networks provide new approaches for quantitative evaluation based on the principle of the “wisdom of the crowd” . Consider the case of StackOverflow111https://stackoverflow.com/, a social network where computer programmers ask questions that are collaboratively answered by the on-line community. Traditionally, a programmer’s degree of technical competence is determined by the use of written, oral or automated tests, which suffer from many of the aforementioned problems in the language-proficiency domain . A recent study  showed that the reputation gained from the social interactions on StackOverflow is an accurate predictor of the programming skills of the users of that social network.
Even more recently, a new collaborative social network for language practicing, named Yask222https://www.Yask.ai, has been gaining popularity, recognition, and an increasing number of active users . Yask has a similar structure to StackOverflow, opening the research perspective of measuring the written proficiency of the users based on their interactions and votes in the social network. The methods for measuring the user importance or reputation in a social network are based on the analysis of the structure of the social graph. A well known method for that is the algorithm Page Rank . Our method, called Proficiency Rank, extends Page Rank by integrating positive, negative, implicit, and explicit signals from the social graph. In this work, we are focused on determining if Proficiency Rank is an appropriate variable for measuring the language competence of a group of users in a social network like Yask. This approach could produce a new method for language-proficiency evaluation intrinsically free from many of the issues associated with the classic approaches, while creating new challenges to overcome in the endeavor.
Ii-a Language proficiency assessment
Language proficiency in a second foreign language (L2) comprises the ability to do “something” with the language but also knowing about it . This proficiency can be understood as pragmatic knowledge to face real-life communicative situations. After Hymes , two parallel approaches were suggested. The first one takes into consideration sociolinguistic and discourse abilities needed to communicate appropriately. The second one is based on the performance as a result of intertwinedness among linguistic, mental and social competence and their mutual dependence on context . However, in the educational domain there are other assumptions for proficiency  divided mainly into two groups according to the linguistic aim, namely: the Basic Interpersonal Communication Skills (BICS) or everyday interaction skills, and the Communicative Academic Language Proficiency (CALP) or academic and schooling knowledge communication. In terms of language testing proficiency discussion focuses on its innate nature as unitary  or divisible . The unitary vision supposes the existence of an indivisible underlying structure . The divisible proficiency theory  holds that proficiency could be divided into subcategories e.g.: writing, speaking, listening and reading.
Over the last two decades a “multidimensional conceptualization of language proficiency” has pointed out the existence of a set of different communicative skills and strategies. The Common European Framework of Reference for Languages (CEFR) , based on a divisible language proficiency model, depicts six different ascending levels of proficiency as follows A1, A2, B1, B2, C1, and C2. The A-labels corresponds to Elementary level, B-labels to Intermediate and C-labels to Advanced. CEFR describes proficiency from overall skills and abilities to particular and less important aspects of human communication. CEFR has become a mandatory tool for teaching materials, curricula, and assessment since it was proposed in 2001. Language proficiency exams are presented into separated sections like reading, listening, speaking, writing. Each section assesses a particular skill needed for communication, expression and language understanding. The language aim for these exams is expected to cover everyday contexts.
Ii-B Automatically Assessing Writing Proficiency
This task consist in “predicting at which language learning stage a text can be produced or understood by an L2 learner”  according to a scale such as the CEFR. Several linguistics features are extracted from the texts to produce quantitative proficiency predictions. Pilán et al. 
proposed a method consisting of 61 linguistics features divided into five groups (length-based, lexical, morphological, syntactic and semantic features) and a Support Vector Machine (SVMs) for all their experiments of classifying essays written by L2 learners of Swedish into the CEFR levels. Their datasets were error-prone essays written by learners and error-free texts (coursebooks) written by experts, both manually labeled for CEFR levels. The best approach obtained anof and a of , which is the weighted combination of L2 coursebook texts and of Swedish L2 learners’ essays. Lexical features were the most predictive measuring the proportion of tokens per CEFR level in the texts.
Tack et al. 
collected a corpus of English short answers question based on the CEFR levels and implemented an approach via a soft-voting classifier integrating a panel of five traditional models: Gaussian Naive Bayes classifier, a CART Decision Tree, a kNN classifier, a one-vs.-rest (OvR) Logistic Regressor and a OvR polynomial LibSVM Support Vector Machine. They usedindividual features grouped into 18 different families, among which are: lexical features, syntactic features, discursive features, number of psycholinguistic norms. The best approach obtained an of and an adjacent accuracy of . Sentence and word length, lexical features and information about the age of acquisition of words had a strong positive correlation with the assessed CEFR level. Also, the system did not have any particular difficulties in correctly predicting the lowest CEFR levels.
Yannakoudakis et al.  proposed a method consisting of a corpus of English texts with their CEFR scores, which were assigned by a human expert, five feature types (character sequences, Parts of Speech sequences, hybrid word and Parts of Speech sequences, phrase structure rules, and errors and error rate) and a classification algorithm. The best approach obtained a Pearson of , a Spearman of and a of , with
of standard error of. This model used the test set (consisting of 260 texts) and the PoS feature.
“My Tailor is rich!” was a Machine learning level prediction competition in conjunction with CAp2018333http://cap2018.litislab.fr/competition-en.html, which task was predicting English level according to the 6 reference levels of the CEFR by analyzing written texts between 20 and 300 words and a set of characteristics calculated from these texts. The Organizing Committee provided fifty-nine feature variables, mainly shallow features based on the state of the art of stylometry and language readability. The participating systems produced predictions with very high accuracy. However, a further analysis of the results revealed that the texts contained lexical features that made the classification trivial for some systems . For instance, texts labeled with C1 level were prompted by the instruction “Write a movie review”. Therefore, the simple identification of words such as “movie”, “film”, “actor”, etc., were accurate predictors of the level. They concluded that different data is needed and more research is necessary to tackle the problem.
Ii-C Collaborative Social Networks
A collaborative social network is an arrangement of individuals with common or compatible goals that collaborate between them to achieve those goals. A particular case of this social structure is the online communities where members make requests that are addressed voluntarily by the community . Members of these communities interact with each other by posting a request, answering a request or voting for in favor or against any post. Figure 1 illustrates this type social network. An example of these social networks is StackOverflow (SO), which gathers the community of computer programmers around questions and problems from that domain. There, users can make requests about computer programming, which receive voluntary answers from other users. Any answer (and also questions) receives votes from other users expressing their approval or disapproval. As a consequence of this dynamic, in the short term, the best answers are identified, and in the middle term, the skill of the users are revealed. In 2018, SO had more 9 million users and 16 million of answered questions444https://stackoverflow.com/company, https://sostats.github.io/. Recently, SO evolved to SO Jobs, an initiative that aims to match employers and SO’users based on the reputation of the users in SO. That new approach for evaluating people skills based on social media constitute an interesting alternative to traditional methods such as exams, tests, and interviews.
Yask is another collaborative social network with a similar structure to SO but oriented to the learning of languages. In Yask, users contribute with their expertise in their native languages to requests made by language learners. Figure 2a shows a screenshot of the Yask’s home screen in a mobile device illustrating the type of requests supported. The requests are automatically sent to possible contributors, which can vote (see Figure 2b) or answer the request by posting another answer. When a user is asked for a vote s/he has not any information about the possible previous votes of other users. When an answer reaches a certain number of votes, then it is considered solved and no more votes are requested to the community. The limit of votes per answer is not explicitly established, but answers with more than 15 votes are rare.
Ii-D Ranking Nodes’ Importance in a Graph
A graph is a model consisting of a set of nodes interconnected between them by arcs or edges. The graph can be directed or undirected, where each arc has a source and destination nodes (arrows), or the arcs simply connect two nodes bidirectionally. In addition, arcs may be labeled with numeric weights, becoming a weighted graph. Figure 3 illustrates a weighted-directed graph along with its representation as an adjacency matrix. Graphs are useful for modeling many problems. For instance, directed graphs can be used for modeling scientific journals (nodes) interconnected by citations (arcs) [25, 19], the World Wide Web (web pages) interconnected by hyperlinks , among others.
. These methods are based on the recursive hypothesis that the importance of a node depends on the importance of the nodes that have arcs towards that node. This hypothesis is equivalent to the probability of a random walker of visiting any node in the graph. Page Rank is probably the most popular method for obtaining such importance. Let be the number of nodes in a directed graph, the adjacency matrix, and a vector of node’s importance at the iteration number . The Page Rank algorithm requires that the columns of and
to be probability distributions, i.e. to sum up 1. The initial nodes’ importanceare assigned randomly. The recursive relation of Page Rank is defined as (the operator is the dot-product or matrix multiplication):
Where the entries of the matrix are:
Here are the entries of and is a “damping” factor, which allows the random walker to escape from “sink” nodes, i.e. nodes only having inbound arcs. The value of is set to 0.85 for the world-wide-web problem, but it needs to be determined for other applications. After several iterations of applying eq. 1, , so the algorithm converges.
By applying Page Rank to the graph showed in Figure 3, the following ranks are obtained: 0.254 for A, 0.137 for B and C, 0.207 for D, and 0.265 for E. This is the probability distribution of a random walker of visiting each node. Thus, E is the most “popular” node by receiving two incoming arcs. However, D, which receives also two arcs, ranks third after A, which inherits E’s popularity. Finally, B and C rank last because they share the inherited popularity of A.
Iii Method: Proficiency Rank
Our method for Proficiency Rank consists of building two adjacency matrices, one for positive votes and another for negative. Thus, each time an answer posted by a user receives a vote form a user , we draw a directed edge from the note to node either in the graph of positive votes or in one of the negative votes. Next, we apply the Page Rank algorithm separately to each one of the graphs/matrices. Then, the two Page Rank run converge, their results are linearly combined using a parameter that controls the weights of the positive and negative signals. The ranks obtained from the graph build with positive votes should increase according to language proficiency. Conversely, the rankings obtained from the graph build form negative votes are related inversely with the proficiency. Therefore, the linear combination of both rankings must be made with a negative sign. Thus, the Proficiency Rank vector , containing the rank values for each user in the network, is defined as:
In this definition, the first line shows the linear combination of the rankings from the positive and negative signals. The next two equations are the Page Rank runs over the positive and negative adjacency matrices, and . The value of corresponds to the iteration where the Page Rank algorithm converges.
There is a limitation in the use of Page Rank, that is, that the ranking can be determined only for those users that post answers, that is, users having incoming votes. For the remaining users that only vote, the Page Ranks algorithm assigns an equal minimum ranking value. Generally, the users that only make votes outnumber significantly those who post answers making that the rankings can be obtained only for a small subset of users. To overcome this issue, we extracted “implicit votes” from the set of voters of a particular answer. For instance, in Figure 1, user C and E voted contrarily the answer posted by D. Then, aside of the explicit votes of C and E toward D, it can be considered that users C and E mutually oppose producing two implicit negative votes between them. We call these votes Implicit Opposition Votes (). We distinguish as the implicit positive vote from C to E, and as the implicit negative vote from E to C. Similarly, users A and C agree positively in their votes, as E and F agree negatively to the same answer. These agreements produce that we call Implicit Agreement Votes (), which in this case produce mutual positive votes between A and C, and between E and F. We distinguish the implicit positive votes between A and C as , and those between E and F as . By considering s and s, the number of users in the graph having incoming votes increases considerably, making possible the computation of their Proficiency Rank.
To build the matrices and we combine explicit and implicit votes using again a weighted linear combination:
Here is the adjacency matrix built using the explicit positive votes, and the equivalent with explicit negative votes. Similarly, is the adjacency matrix built using the s, and the equivalent for s. It is important to note that the entries of all the matrices contain the total number of votes given by the users indexed by the columns, towards the users indexed by the rows, then the columns are normalized to sum up 1 (see Fig. 3). This differs from the traditional setting of Page Rank, where several links from a node to are treated as a single one. In addition, the matrices in eq. 4 need to transformed to “hat” versions using eq. 2.
In summary, our method has 4 parameters, namely: the damping factor of Page Rank, the weighting parameter between positive and negative votes, the weighting parameter between explicit and implicit positive votes, and the analogous for negative votes. In addition, for the construction of it is necessary to determine which combination of and should be used (analogously for of ).
Iv Experimental Validation
Our experiments aim to address two questions. First, to what extent the ranking methods presented in section III are correlated with the English proficiency level of the users in Yask. Second, how much user interaction in Yask is needed to measure adequately the English proficiency level of the users.
Iv-a Dataset Description
The data used in the experiments was extracted manually using the Yask application for smartphones. Firstly, we created a user and interacted actively during approximately one month by posting answers to other users requests in English. Next, we recorded all the users and votes to all of our answers. Finally, we selected 10 users with Spanish at native level and English between beginner to advanced level and recorded all posts, users and votes related to their requests. The result was a graph with 377 users (nodes), 1,571 positive votes (arcs), and 490 negative votes (nodes). The English proficiency level of the users, as established by the users, is distributed as: 69 ‘Native’, 52 ‘Fluid’, 66 ‘Advanced’, 140 ‘Intermediate’, and 50 ‘Beginner’. The number of requests asked by the users is 179, while the number of answers to those requests is 412. These 412 answers were made by only 107 users, which are the only ones able to receive votes. Approximately, the 50% of the answers were posted by 10 users, among them are Google Translate and the Yask Bot, which is an automatic response of a previously answered request in Yask. This observation confirmed our assumption that users that contribute with answers are a minority in comparison with those with contribute with votes. This process of extraction was performed during January 2019.
Iv-B Experimental Setup
The gold standard for comparing the rankings produced by the methods is the English level that the users manifested freely when they signed up into Yask, which we assume to be true. We replaced the categorical levels by a simple numerical scale as follows: 5 for ‘Native’, 4 for ‘Fluid’, 3 for ‘Advanced’, 2 for ‘Intermediate’, and 1 for ‘Beginner’. The evaluation measure to compare the degree of agreement between the gold standard and the produced rankings is the Spearman’s rank correlation.
Each particular configuration of Proficiency Rank consists of a selection of types of positive and negative votes. Positive votes can be a combination of a selection from ,, and . Similarly, negative votes come from ,, and . Once the types of votes to use have been established for a configuration, the parameters are determined in a search grid with a resolution of 0.1. Then, the grid resolution was increased to 0.05 in the vicinity of the current best configuration, and again the resolution is increased until 0.01. The function to optimize is the average of Spearman’s correlations between Proficiency Rank and the gold standard for different subsets of the users filtered by a threshold of representing the minimum number of incoming votes. is incremented from 1 to the maximum number of votes obtained by the top-voted user. Clearly, as increases the number of users that surpass that threshold reduces. For the average calculation, we considered only significant correlations with .
Table I shows seven possible configurations that we consider interesting to discuss. The last row reports the total number of votes on each category of the type of votes found in the data. The baseline method consists of the total number of incoming votes per user for each configuration. This measure quantifies the possible undesired effect that the amount of activity of the users in the network being correlated with their language proficiency. The results for all the 377 users for each configuration can be seen in the last column in Table I.
|Positive Votes||Negative Votes|
|Num. of votes:||1,571||7,398||1,146||490||666||666||11,937|
* significant .
Iv-C CEFR Baseline
We provide an additional test bed for comparison that reflects the methods of the current language teaching curricula. For that, we used the English Vocabulary Profiles for the Common European Framework of Reference (CEFR) provided by EnglishProfile555https://englishprofile.org/wordlists, a non-profit organization devoted to produce resources for teaching English aligned with the CEFR levels (i.e. A1, A2, to C2). They provide manually curated word lists obtained from the Cambridge Learner Corpus , which represent the vocabulary profile for each CEFR level666we use the lists compiled at https://www.toe.gr/course/view.php?id=27. Table II shows the number of words on the vocabulary profiles of each level (table’s diagonal) and the number of common words between levels.
We used the texts of the answers written by the Yask’s users in combination with the CEFR vocabulary profiles to determine the level for each user. For that, we obtained all , which is the number of common words between the set of words derived from the answers written by user and the vocabulary profile corresponding to the level . Since, the CEFR levels are meant to correspond to a linear progression of proficiency in English, we assigned increasing weights to each level. Thus, the proficiency level for a particular user is computed with a weighted average as follows:
Where corresponds to the level A1, to A2, until to C2. Therefore, is a number between 1 and 6…
To evaluate the soundness of this baseline, we applied eq. 5 to the 27,306 texts from the CAp2018 training dataset777http://cap2018.litislab.fr/competition-en.html. The obtained values were compared against the gold standard levels in the same dataset. We observed a Spearman’ rank correlation of 0.65 with . Clearly, the proposed baseline represents the CEFR levels of English proficiency.
Figure 4 shows the results obtained by the Proficiency Rank configurations from conf1 to conf7. The vertical axis corresponds to the Spearman’s rank correlation between the Proficiency Rank produced by each configuration versus the gold standard measured in a subset of the users filtered by (horizontal axis). In figure 4, all seven configurations increase as increases. The total number of votes considered for applying the threshold varies on each configuration. For instance, configuration conf1 considered only 2,061 votes (1,571+490), and conf6 used all the 11,937 available explicit and implicit votes. As increases, the number of users that fulfill that threshold decreases. Figure 5 shows the same results but replacing the abscissa by the number of users, producing a decreasing tendency for all configurations. In general, the best configurations are those that shape the upper-bound in both figures. Note that all configurations outperformed their corresponding baselines by a wide margin.
Iv-E Reproducibility of Experiments
To reproduce the experiments of this study it is necessary to ask Yaks’s representatives for the data used in this study or the consent for extracting a data sample from Yask with a methodology similar to the one described in subsection IV-A. The code written for Python 3.7 for reproducing the experiments and a sample with the data format is available at https:
Figure 6 shows a comparison of the results obtained by Proficiency Rank and the CEFR baseline proposed in eq. 5. This comparison is only possible against conf1 because the users who received votes are the only ones who wrote answers. Thus, the baseline for each user is computed by aggregating all the answers written by each user. In addition, this figure includes a line with the critical values for the Spearman’s rank coefficient for .
Let us discuss the results obtained by conf1, a configuration composed only explicit votes, therefore the one with least number of votes. Conf1 achieved the best results when varies between 1 and 25. This result indicates that explicit votes are the strongest signal in the social graph. However, explicit votes are in short supply producing only significant Proficiency Ranks for only a maximum of 91 users out of 379. Figure 5 shows that other configurations produce Proficiency Ranks for far more users. The optimal value for the parameter , when using conf1, coincides with the default value for the equivalent damping parameter for Page Rank () . Parameter indicates that even though, the number of explicit negative votes is relatively small (), they weight much more than the explicit positive votes (). In addition to the showed experiments, we tested the results of Page Rank for positive and negative votes separately. However, no significant correlation was observed () for any possible value of . In the first test (only votes in ), the number of explicit positive votes seemed to be sufficient for 377 users, but their lack of informativeness could explain the poor results. Contrarily, the explicit negative votes are highly informative, but only 490 edges for 377 nodes produces a very sparse graph. This result proves that our method for combining the positive and negative votes using a linear combination controlled by is effective in comparison with Page Rank using alternatively positive and negative votes.
Figure 4 shows that configurations conf1, conf2, conf3, and conf7 outperformed the others. These configurations have in common the use or and the disregard of . Again this result shows the preponderance of a few negative signals versus a large number of positive signals. We consider that the best configuration is conf7. Figure 5 shows that conf7 performs among the best configurations in the range from 10 to 100 users, and it is the best for more than 100 users. Although, conf5 and conf6 produce Proficieny Ranks for more than 200 users, their correlations are considerably lower than those of conf7.
The top result () was obtained using conf5 and . That result corresponds to a set of 12 users having each one more than 75 incoming votes. In spite of being a small subset of users, the observed correlation was highly significant, . This result is somehow unexpected given the poor performance of conf5 for larger sets of users. However, this result suggests that high correlations can be achieved if there is enough information (votes) associated with each user. To provide conclusive proof of that point it would be necessary to carry out experiments with a considerably larger dataset.
Regarding baseline results (see the last column in Table I, it is clear that the effect of the amount of user activity in the collaborative social network is poorly correlated with language proficiency. Only configurations conf1 and conf6 obtained significant correlations () but with a considerable margin to the lowest Proficiency Rank results. This result indicates that the measure of proficiency is mostly independent of the amount of activity of the users.
The results of the comparison between Proficiency Rank and the CEFR baseline show that there is an possible mismatch between the targets that each measure aims. Figure 6 shows that for all subsets of users (controlled by ) Proficiency Rank produces significant correlations, while the CEFR baseline does not. The fact that the CEFR baseline is strongly correlated with a large corpus produced by learners in a curriculum based on CEFR, but poorly correlated with our gold standard, provides empirical evidence of the misalignment of that curriculum with the language proficiency perceived by the Yask’s users. It means that there is only a loose relation between the CEFR vocabulary profiles and the self-perceived proficiency in the written modality. Therefore, it seems that native speakers are not represented by these profiles, nor are the beginners or any other intermediate level. However, the validity of the current curricula and their standardized tests has been long discussed and accepted by the academia and the language teaching and assessment industry . Although, some academics criticize current approaches , our results seems to be the first empirical evidence of the disagreement between the “wisdom of the crowd” and the written language proficiency tests based on current curricula. This result confirms the difficulty on the construction of valid language assessment tests and the potential of the Social Computing technologies in that area.
Iv-G Research Perspectives
In general, it is possible to say that the results provided by Proficiency Rank are good predictors of the self-evaluations of the users. That is, the users ordered by Proficiency Rank are arranged from “Native” to “Beginner” level meaningfully. It is important to note that the method is independent of the English language and it is not related to any learning curriculum. However, the construction of an assessment tool based on this discovery requires more research. For instance, in our experiments, the Yask users didn’t expect to be evaluated, therefore fraud or artificial preparation are not considered issues. The control of these issues in proficiency evaluation based on social interaction is an interesting research perspective. Similarly, the degree of complexity and difficulty of the requests is Yask is distributed accordingly with the proficiency of the users, which is roughly uniform. In an evaluation scenario, the requests should be provided by an evaluation authority and their difficulty should be controlled. Clearly, extremely difficult or trivial requests can hinder the overall evaluation of the users. In addition, those artificial requests should promote divided voting pools to produce enough negative votes for Proficiency Rank. The determination of an appropriate set of initial requests is also an interesting research topic.
It is also important to note that our method is independent of the modality if the requests. In our experiments, we used only the written modality, but that feature has not been used by Proficiency Rank. Therefore, the requests could include any type of media opening the perspective of constructing requests based on listening, pronunciation, conversation, translation, etc. The use of these modalities in Proficiency Rank is another potential research direction.
We presented Proficiency Rank, a method for measuring user importance in a collaborative social network. By testing Proficiency Rank in a sample of the Yask community, we observed that the rankings obtained by our method are highly correlated with the language proficiency of the users. We carried out experiments that revealed how different amount of users interactions (postings and votes) produce different intensities of that correlation. In addition, we observed that the most informative signal in a collaborative social network is the negative votes. However, the best configurations of Proficiency Rank were obtained by linear combinations of positive, negative, explicit and implicit votes.
-  J Charles Alderson. The cefr and the need for more research. The Modern Language Journal, 91(4):659–663, 2007.
-  Theodora Alexopoulou, Carlos Balhana, Nicolas Ballier, Stéphane Canu, Thomas Gaillat, Gilles Gasso, and Caroline Petitjean. Machine learning for learners of english “a plea for creating learner data challenges”. submitted to the International Journal of Learner Corpus Research, 2007.
-  Lyle F Bachman, Adrian S Palmer, et al. Language testing in practice: Designing and developing useful language tests, volume 1. Oxford University Press, 1996.
-  Thomas R Black. Doing quantitative research in the social sciences: An integrated approach to research design, measurement and statistics. Sage, 1999.
-  Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. Computer networks, 33(1-6):309–320, 2000.
-  Carol A Chapelle. The toefl validity argument. In Building a validity argument for the Test of English as a Foreign Language™, pages 333–366. Routledge, 2011.
-  CNN Chile. Lanzan novedosa app que facilita la comunicación entre chilenos y haitianos. Marzo 2018. Accessed: 2019-01-30.
-  Jim Cummins. Cognitive/academic language proficiency, linguistic interdependence, the optimum age question and some other matters. Working Papers on Bilingualism, 19:121–129, 1979.
-  Christopher Douce, David Livingstone, and James Orwell. Automatic test-based assessment of programming: A review. Journal on Educational Resources in Computing (JERIC), 5(3):4, 2005.
-  Benjamin Golub and Matthew O Jackson. Naive learning in social networks and the wisdom of crowds. American Economic Journal: Microeconomics, 2(1):112–49, 2010.
-  Adriana González Moncada. On alternative and additional certifications in english language teaching: The case of colombian efl teachers’ professional development. Íkala, revista de lenguaje y cultura, 14(22):183–209, 2009.
-  Yo-Sub Han, Laehyun Kim, and Jeong-Won Cha. Evaluation of user reputation on youtube. In International Conference on Online Communities and Social Computing, pages 346–353. Springer, 2009.
-  Yo-Sub Han, Laehyun Kim, and Jeong-Won Cha. Computing user reputation in a social network of web 2.0. Computing and Informatics, 31(2):447–462, 2012.
-  Claudia Harsch. Proficiency. ELT Journal, 71(2):250–253, 2016.
-  Dell Hymes. On communicative competence. Sociolinguistics, 269293:269–293, 1972.
-  Steven J Matthiesen. Essential Words for the TOEFL. Simon and Schuster, 2017.
-  Kate Menken. Teaching to the test: How no child left behind impacts language policy, curriculum, and instruction for english language learners. Bilingual Research Journal, 30(2):521–546, 2006.
-  Dana Movshovitz-Attias, Yair Movshovitz-Attias, Peter Steenkiste, and Christos Faloutsos. Analysis of the reputation system and user contributions on a question answering website: Stackoverflow. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pages 886–893. ACM, 2013.
-  Isar Nassiri, Ali Masoudi-Nejad, Mahdi Jalili, and Ali Moeini. Normalized similarity index: An adjusted index to prioritize article citations. Journal of Informetrics, 7(1):91–98, 2013.
-  Diane Nicholls. The cambridge learner corpus: Error coding and analysis for lexicography and elt. In Proceedings of the Corpus Linguistics 2003 conference, volume 16, pages 572–581, 2003.
-  Council of Europe. Council for Cultural Co-operation. Education Committee. Modern Languages Division. Common European Framework of Reference for Languages: learning, teaching, assessment. Cambridge University Press, 2001.
-  John W Oller and Kyle Perkins. Language tests at school: A pragmatic approach. Longman London, 1979.
-  Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
-  Adrian S Palmer and Lyle F Bachman. Basic concerns in test validation. ELT Documents, 111, 1981.
-  Olle Persson. Identifying research themes with weighted direct citation links. Journal of Informetrics, 4(3):415–422, 2010.
-  Ildikó Pilán, Elena Volodina, and Torsten Zesch. Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2101–2111, 2016.
-  Constance Elise Porter. A typology of virtual communities: A multi-disciplinary foundation for future research. Journal of computer-mediated communication, 10(1):JCMC1011, 2004.
-  Karen L Sandberg and Amy L Reschly. English learners: Challenges in assessment and the promise of curriculum-based measurement. Remedial and Special Education, 32(2):144–154, 2011.
-  Anaïs Tack, Thomas François, Sophie Roekhaut, and Cédrick Fairon. Human and automated CEFR-based grading of short answers. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 169–179, 2017.
-  Lucien Tesnière. Eléments de syntaxe structurale. Klincksieck, 1959.
-  Helen Yannakoudakis, Øistein E Andersen, Ardeshir Geranpayeh, Ted Briscoe, and Diane Nicholls. Developing an automated writing placement system for ESL learners. Applied Measurement in Education, 31(3):251–267, 2018.