Tracing Forum Posts to MOOC Content using Topic Analysis

04/15/2019
by   Alexander William Wong, et al.
University of Alberta
0

Massive Open Online Courses are educational programs that are open and accessible to a large number of people through the internet. To facilitate learning, MOOC discussion forums exist where students and instructors communicate questions, answers, and thoughts related to the course. The primary objective of this paper is to investigate tracing discussion forum posts back to course lecture videos and readings using topic analysis. We utilize both unsupervised and supervised variants of Latent Dirichlet Allocation (LDA) to extract topics from course material and classify forum posts. We validate our approach on posts bootstrapped from five Coursera courses and determine that topic models can be used to map student discussion posts back to the underlying course lecture or reading. Labeled LDA outperforms unsupervised Hierarchical Dirichlet Process LDA and base LDA for our traceability task. This research is useful as it provides an automated approach for clustering student discussions by course material, enabling instructors to quickly evaluate student misunderstanding of content and clarify materials accordingly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/04/2020

Unification of HDP and LDA Models for Optimal Topic Clustering of Subject Specific Question Banks

There has been an increasingly popular trend in Universities for curricu...
01/10/2021

Learning Student Interest Trajectory for MOOCThread Recommendation

In recent years, Massive Open Online Courses (MOOCs) have witnessed imme...
06/22/2018

Personalized Thread Recommendation for MOOC Discussion Forums

Social learning, i.e., students learning from each other through social ...
09/24/2019

Diachronic Topics in New High German Poetry

Statistical topic models are increasingly and popularly used by Digital ...
09/19/2018

Clustering students' open-ended questionnaire answers

Open responses form a rich but underused source of information in educat...
12/16/2014

Application of Topic Models to Judgments from Public Procurement Domain

In this work, automatic analysis of themes contained in a large corpora ...
10/18/2020

Studying Visualization Guidelines According to Grounded Theory

Visualization guidelines, if defined properly, are invaluable to both pr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and Motivation

The primary feature distinguishing a Massive Open Online Course (MOOC) and a traditional course is the MOOC’s capacity to scale to a seemingly limitless number of concurrent students (Pappano, 2012). The scalability of the MOOC can be represented by the ratio of instructors and students. When students vastly outnumber the instructional staff, the opportunity for a student to interact meaningfully with the instructors diminishes (Huang et al., 2014). Furthermore, the instructional staff’s ability to gauge student understanding is also hampered. It is impractical for a fixed number of instructors to respond to the individual needs of an unbounded, growing number of students (Mackness et al., 2010).

One method for students and instructors to interact is through the discussion forums. This medium allows relevant conversation where students ask questions, express their thoughts, and seek help from their peers and instructors about the course material. Stephens-Martinez et al.’s research surveyed 92 MOOC instructors and determined that the conversations students have on course discussion forums are a useful repository of data for course self-evaluation (Stephens-Martinez et al., 2014). Unfortunately, basic visualizations showing discussion board summary statistics are not useful to instructors, as the rich semantic information about the discussions is lost (Stephens-Martinez et al., 2014).

Figure 1. Sample discussion post from ”Agile Planning for Software Products” classified by our topic models. TF-IDF and Labeled-LDA models matched researcher given label.

This paper presents a methodology to trace discussion forum activity back to the underlying course material, as demonstrated in Figure 1. One motivation is to show instructors potential areas of misunderstanding with respect to a given lecture video or reading. Instructors could view discussions grouped by lecture and ordered by post count. This would allow instructors to more quickly respond to multiple students struggling with the same issue. Our approach has the additional benefit that no additional labelling or overhead is required to perform this mapping. We restrict ourselves to data that is readily available in the course material and discussion forum.

In this paper, we define a topic as similar words that occur in a collection of text documents. A topic model like LDA receives a set of input documents, a pre-defined number of topics , and some additional set of prior parameters. LDA then attempts to find a set of topics that describe the input documents. A linear combination of all discovered word distributions can then be mapped to each input document. Topics themselves can be thought of as a ranked list of words, ordered from high topic relevance to low topic relevance.

We utilize four different variants of Latent Dirichlet Allocation (LDA) to extract topic-document distribution feature vectors from discussion forum content and course material 

(Blei et al., 2003)

. Term Frequency Inverse Document Frequency (TF-IDF), a normalization approach extracting number of times a word appears in the document divided by the total number of words in the document, is used as the baseline model for forum content feature extraction 

(Salton and Buckley, 1988).

We compare the effectiveness of unsupervised LDA approaches with labeled extensions of LDA for discussion traceability. For labels, we utilize the defined course module, lesson, and item headings already associated with each course document. Author Topic models train on text with associated authors, enabling association of extracted topics to the individuals that wrote the document (Rosen-Zvi et al., 2004). Labeled LDA trains on text with a set of tags, which enables inferring word association to the defined tags that exist within the input corpus (Ramage et al., 2009). Our study aims to answer the following research questions:

  1. [label=RQ0:]

  2. Are course materials an appropriate corpus for training discussion forum evaluating topic models?

  3. Are topic extraction models useful in tracing discussion forum conversations back to MOOC content?

  4. Do supervised topic models outperform unsupervised topic models in tracing and classifying forum activity?

2. Methodology

We first relate MOOC discussion forum posts to topics extracted from MOOC course content. Afterwards, we evaluate if topic mapping vectors are appropriate features to use for discussions and course material traceability.

Our methodology is to extract course material from the Coursera platform and convert the unstructured text into a natural language processing (NLP) ready format. Then we perform topic analysis using unsupervised LDA, unsupervised HDP-LDA, supervised Author-Topic models, and Labeled LDA. We perform topic model inference on discussion posts. A post is given weights indicating topic contributions to the post. We compare our topic models to our baseline TF-IDF model to determine the utility of topic extraction models to create features from discussion posts and course material. Finally, we present our results and analysis.

2.1. Data Mining Coursera

Coursera is an online learning platform that allows universities and other organizations to offer MOOCs, specializations, and degrees. A MOOC can be versioned by branches or be composed of a single branch. Figure 2 shows the structure of course material within a branch. A branch is composed of modules, which are groups of lessons intended to encompass a week’s worth of material. A lesson is a more focused group containing items on a specific subject matter. An item is the smallest document for Coursera MOOCs. Items are used to encapsulate specific lecture videos, readings, quizzes, or assignments.

Multiple forums may exist for a single course. The forums are curated by the instructional staff and exist to tailor discussion to a general domain. Common course forum titles include ”Introductions”, ”General Course Discussion”, and ”Technical Issues”. When interacting with the forums, students are limited to either posting Questions or Answers. Questions are top level discussion entities that exist immediately underneath a forum, usually seeking subject matter clarification and help. Answers are replies to Questions and typically contain hints and guidance for the related question. In our study, we did not distinguish between the forums a user posted in or the type of discussion.

Only the most recent, active branch was evaluated for each course. We extracted the hierarchical structure for each course, adding lecture video subtitles and readings to our available corpus. Unstructured text from quizzes and assignments were ignored due to limitations arising from converting interactive student experiences into static documents. All mined textual data was pre-processed by stripping XML tags (characters encapsulated by ”<”, and ”>”), punctuation, consecutive white-spaces, numeric digits(0 to 9), stop words (”this”, ”and”, ”the”, etc.), and words less than three characters in length. The remaining words were then porter-stemmed to their root form (removal of ”-ing”, ”-s”, ”-ed”, ”-ly”, etc.).

The hierarchical structure of the course was obtained through privileged access to course material by course administrators. The raw document information, such as lecture video subtitles and readings, was mined through polling the Coursera On-Demand API endpoint as an authenticated user enrolled in all of the relevant courses. Figure 2 shows the scope of our data model for our research.

We performed supervised and unsupervised topic extraction on five MOOCs run on the Coursera platform, described in Table 1. All courses studied were related to Computer Science and Software Engineering. After pre-processing the course material, we obtained 190 documents, with 8,805 unique stemmed word stubs. The average document contained 158 word stubs. An example of the document structure and label hierarchy can be found in Figure 3.

Coursera Course Name Num. Num. Num.
Modules Lessons Items
Agile Planning for Software Prod. 4 20 38
Client Needs and Software Reqs. 4 21 42
Design Patterns 4 9 37
Introduction to SPM 3 19 30
Object Oriented Design 4 15 43
Table 1. Five analyzed Coursera courses and corresponding number of extracted modules, lessons, and items.
Figure 2. Course material consists of modules, lessons, and items. Topics infer discussion board activity. Topics have a many-to-many relationship to course material.
Figure 3. Two documents (D) from ”Agile Planning for Software Products” with corresponding labels (M, L, I).

2.2. Topic Analysis Approaches

Topics were extracted from the course content material using unsupervised and supervised LDA. New models were trained for each course.

We used the gensim@3.7.1 implementation of LDA, Hierarchical Dirichlet Process LDA (HDP-LDA) and Author-Topic model (Řehůřek and Sojka, 2010). To form our unsupervised set of topics, we used both LDA and HDP-LDA. All LDA parameters were set as the library provided defaults (, , ). HDP-LDA addresses one shortcoming of traditional LDA by using a Dirichlet process to capture the number of topics, rather than defining the number of topics a priori (Wang et al., 2011). We ran HDP-LDA with the default configuration provided by the library (, , , , , , ).

The Author-Topic model, an extension of base LDA, was introduced by Rosen-Zvi et al. to correlate documents with authorship information to provide more details on the subject knowledge of the given author (Rosen-Zvi et al., 2004). The Author component of the Author-Topic model is classically represented by labeling an existing corpus of documents with the author(s) of the document. However, we modified the standard usage and replaced Author with modules, lessons, and items. The underlying assumption was course subject knowledge labeled by subject headings could be analogous to course content labeled by the content author. The parameter choices for the Author-Topic model was left as the library provided defaults (, , ).

Additional topic analysis using Labeled LDA was run. Unlike Author-Topic models, Labeled LDA outputs topics constrained to the labels defined in the corpus (Ramage et al., 2009). Classical applications of Labeled LDA are analyzing tagged blog entries, where a given blog may have multiple associated tags. We extend an existing implementation of Labeled LDA found on GitHub, with slight modification for topic inference without training 111Labeled LDA in Python. https://github.com/fann1993814/llda . Our parameter choices for Labeled LDA are , , with 50 iterations. The differences between the four studied models are defined in Table 2. We use the module, lesson, and item names as labels for our Author-Topic and Labeled LDA approaches.

Model Labels Feature Characteristic Vector Size
TF-IDF False # unique words/all words # words
LDA False topics: bounded, latent set 100
HDP-LDA False topics: unbounded, latent cap 150
Author-Topic True topics: label distributed set 100
Labeled LDA True topics: restricted to labels # labels
Table 2. Comparison of the baseline and four different topic analysis models used in our study.

2.3. Relating Topics to Discussion Activity

To derive a relationship between discussion forum activity and the course material, we used our trained topic models to infer topic distribution of the discussion form posts. We inferred the relationship between the word distribution within the discussion post and the word distribution within the topic. We applied the same pre-processing step from the course material on the discussion forum activity. We removed tags, punctuation, consecutive white-space, numbers, stop words, and words less than three characters in length before stemming all remaining words to their root word.

LDA inference takes as input a set of documents and estimates the topic weight distribution for each document. Because LDA inference does not modify the existing model and no learning occurs, we performed inference using only the subset of words in the discussion entity that have appeared in our course material vocabulary. No out of vocabulary tokens are used for evaluation. These discussion questions and answers were related to pre-existing topics extracted from the course material.

(1)

If a discussion post has similar topics to a given lecture, we want to suggest that lecture as a likely candidate for the post. Specifically, the document-topic vector of the discussion post is compared with the document-topic vector of the course item. To determine topic similarity, we used cosine distance as defined in Equation 1. The cosine distance function returns a floating value between 0 and 1. As the cosine distance approaches 1, the two elements are more distinct. The two elements are more similar as the cosine distance approaches 0. We choose cosine distance as our measurement due to its success in prior literature in ranking topically-similar documents (Asuncion et al., 2010).

2.4. Evaluating Discussion Classification

To determine the effectiveness of our models in tracing discussion forum text back to the underlying course content, we randomly sampled from the discussion activity and manually assigned the discussion to the underlying course item. Posts were queried for manual labelling using sqliteORDER BY RANDOM()”. Only the primary author labeled the discussion posts. Discussion posts were assigned a module, lesson, and item name. Labelling the discussion posts was challenging as there were many discussions that did not have a clear mapping to one course item. Discussion posts that were blatantly off topic were excluded from topic model traceability evaluation, using the researcher’s judgment. We evaluated our models on 100 manually classified posts for each of the 7 courses.

(2)

We evaluated each model-provided forum post ordering with our manually labeled discussion data using mean reciprocal rank (MRR), as shown in Equation 2. Given a sample number of queries , we return the multiplicative inverse of the rank for the correct answer. For example, if a model ranked a discussion question with our value in the first place, the MRR would be . Second and third place would be and respectively.

To evaluate performance we produced test sets (not training sets) by bootstrapping 1000 samples from 100 discussion-material document pairs that we had manually labeled for each of the five courses. We bootstrap in order to provide a better estimate of performance based on limited manual labelling and sampling.

2.5. Qualitative Course Vocabulary Evaluation

To help answer the first research question, we summarized the underlying themes of manually labeled off-topic discussions in the course context. We looked at the vocabulary intersection of words used by students with words used in the lecture videos and readings. Given common words used by students that did not appear in the course material, we used our judgment to explain the recurring off-topic themes that did not map cleanly back to course material.

3. Results

TF-IDF LDA HDP-LDA Author-Topic Labeled LDA
Agile Planning for Software Products 0.824 0.388 0.400 0.072 0.436
Client Needs and Software Requirements 0.654 0.163 0.478 0.362 0.246
Design Patterns 0.654 0.266 0.173 0.066 0.390
Introduction to Software Product Management 0.187 0.077 0.172 0.050 0.217
Object Oriented Design 0.927 0.400 0.292 0.087 0.432
All Courses Combined 0.649 0.259 0.303 0.127 0.344

Table 3. Bootstrapped mean reciprocal ranks for the baseline and topic models on our courses. Best values are bolded.
Figure 4. Reciprocal rank violin plot for ”Agile Planning for Software Products”. Horizontal line indicates model MRR.

The calculated MRRs derived from our analyzed courses are shown in Table 3. The baseline TF-IDF model outperforms topic models in all courses except for ”Introduction to Software Product Management”. The least effective model for discussion forum traceability was the Author-Topic model, as it had the lowest MRR for four of the five courses.

A sample violin plot for one of our courses is shown in Figure 4. With the MRR closest to 1, the plot indicates that TF-IDF is the most accurate model for tracing discussion posts back to relevant course material. Labeled LDA, HDP-LDA, and base LDA have MRRs around 0.4, meaning that the correct output is frequently ranked as the second or third place in the output.

3.1. Course Vocabulary Applicability to Forums

We looked at the frequently occurring words used by students that were not encompassed within our course vocabulary. Within ”Agile Planning for Software Products”, these common words included:

peer, coursera, thank, gykcv, rntxi, submiss, ukr, submit, upload, kindli, gui, dev, lectur, regard, bug, request, happi, amazonaw, pdf

The five courses that were analyzed in our study contained peer assessments. Other students were required to evaluate student submitted assignments. The majority of discussion forum posts were requests to trade peer assessment grading. These types of posts could still be mapped back to course material manually, as assignment grading requests usually included plain text urls containing the associated module and lesson.

Other student discussion posts often included clarifications about lecture videos and readings, and applicability of course learned material on real world scenarios. Many students expressed their fondness for the Agile methodology and gave concrete examples of their experiences with Agile after introducing the approach in their workplace and daily lives. These two types of posts could also be mapped to relevant course lecture videos and readings, as there were clearly defined terminology used that gave appropriate context for their conversation.

Students also created many posts where they introduced themselves to the other members of the course. These types of posts were manually labelled as off topic, as no course material existed for individual student introductions.

[sharp corners, top=1mm, bottom=1mm] RQ1: For peer assessment requests and lecture video & readings discussion, course material derived vocabulary was adequate for performing our traceability task.

3.2. Topic Model Utility in Forum Traceability

We compared the MRRs of our models with random chance using the Wilcoxon Rank Sum Test with Continuity Correction. Across all of the five courses, our trained topic models outperform random mapping with a .

Topic models are less effective than the baseline TF-IDF model, however the correctly labeled output is still consistently within the top three ranks for the LDA, HDP-LDA, and Labelled LDA models. Author-Topic models performed the worst of all models, where the appropriate label was usually ranked ten places lower.

Our result suggests that overall, topic models are a promising candidate for generating features from discussion forum activity which allow mapping back to the course material.

[sharp corners, top=1mm, bottom=1mm] RQ2: Topic models outperform randomly mapping discussion posts to course material. This suggests that topic models are an effective candidate for traceability analysis.

3.3. Supervised vs. Unsupervised Topic Models

To validate the effectiveness of supervised versus unsupervised topic models, we compared the Labeled LDA MRRs with the LDA and HDP-LDA models. We ignored comparison tests using Author-Topic models due to their relatively poor performance.

We compared Labelled LDA with base LDA and HDP-LDA for each course using Wilcoxon Rank Sum Test with Continuity Correction. All calculated -values are displayed in Table 4. This table shows that the means for our three models are statistically different from each other.

Labelled LDA provided statistically significant higher mean reciprocal ranks compared with HDP-LDA and base LDA. These results suggest that the use of labels for topic model analysis is important for traceability. Although supervised topic models do not always outperform unsupervised approaches, labelled topic models appear to better infer traceability features for student discussion posts.

[sharp corners, top=1mm, bottom=1mm] RQ3: Properly labeled, supervised topic models perform better than unsupervised topic models in tracing discussion posts back to MOOC content.

Comparison W
Labeled LDA, LDA 6478600
Labeled LDA, HDP-LDA 5589900
Table 4. Wilcoxon rank sum test results for Labelled LDA against HDP-LDA and LDA using combined course data.

4. Discussion

Topic models have shown promise in mapping student discussion posts back to the underlying course material. Compared to TF-IDF, our study showed that topic models could not achieve the same level of performance. However topic models are advantageous in that their outputted vectors have constant size as defined by the number of topics used. TF-IDF models output vectors that have a length upper bound equal to the size of the entire vocabulary. It is more efficient to use topic models for documents with large vocabularies, as computing the document vector and performing the cosine similarity comparison is much slower using TF-IDF.

We did not attempt to label any of the model extracted topics directly, as they were only proxies for our traceability evaluation. Additionally, we did not train a comprehensive model on the entire set of course material and discussion forum documents, as we wanted comparable baselines between our labelled and unlabelled topic models.

5. Threats to Validity

We acknowledge construct validity threats, namely we used courses from a single institution that discussed computer science and software engineering material. We mitigated this threat by using multiple courses rather than one single course. The primary threat to our research validity was the fact that only a single author labeled the evaluation dataset, therefore evaluation relied on a single author’s judgment. Potential bias could have been mitigated with multiple individuals labelling the dataset and reporting inter-rater reliability.

6. Related Work

This paper contributes to research in software engineering traceability and discussion forum topic analysis.

6.1. Traceability & Topic Analysis

Tracing discussion forum content back to course material has similarities to the problem of tracing source code back to the specified requirements. Prior work done by Abadi et al. showed that term frequency-inverse document frequency (TF-IDF) models can encode documents into a rich vector space model, allowing for specification traceability across software artifacts (Abadi et al., 2008). Tata and Patel presented an approach for using TF-IDF models to compare two documents through cosine similarity (Tata and Patel, 2007). Hindle et al.’s research showed that topics extracted from LDA on software requirements documentation can be traced to corresponding version control commits (Hindle et al., 2012). Our research extends the state of the art in software engineering traceability by showing empirical results using labeled topic models for the same task.

6.2. Discussion Forum Analysis

The analysis of discussion forums in an educational context is a well-studied domain, spanning insight from sentiment analysis, social network interactions, user reputation, content popularity, and forum usage semantics 

(Wen et al., 2014; Wong et al., 2015; Coetzee et al., 2014; Breslow et al., 2013; Onah et al., 2014).

Of prior work utilizing topic analysis, Ezen-Can et al. explored an unsupervised approach for clustering MOOC discussion board activity. Their strategy involved using the -medoids clustering algorithm on bag-of-words inputs derived from discussion forum entities, then applying LDA to extract topics that the researchers could assign instructor meaningful labels to (Ezen-Can et al., 2015). Atapattu et al. used LDA to capture influential topic clusters and isolate discussions requiring intervention on Coursera MOOCs (Atapattu et al., 2016). Ramesh et al.’s research predicted MOOC student survival using features extracted from seeded topic models on discussion forum posts (Ramesh et al., 2014). Topic analysis was performed by Chen et al. on students’ reflection journals, showing exploration of underlying themes and enabling prediction of journal grades (Chen et al., 2016).

7. Future Work

One limitation of our current approach is the inability to handle off topic discussions. Many discussions were either unrelated to the course, or were too general to be mapped appropriately to a single course item. Future relevant work includes investigating how to best model ”Off-Topic” discussion posts, and how to accommodate course material relevance granularity.

We provide the source code used to perform this experiment in the intent that replication studies of this work can be performed on other MOOCs 222Source code for research experiment. https://github.com/awwong1/topic-traceability . Our study focuses entirely on computer science and software engineering courses. Another potential replication study could measure if discussion topicality matches course material across courses in Arts and Humanities, Business, Mathematics, Science, and other varying domains.

8. Conclusion

We performed an evaluation of using multiple variants of LDA topic models to trace discussion forum posts back to its corresponding course material in five MOOCs. We investigated the traceability accuracy and found that although topic models could not surpass the TF-IDF model baseline, our trained topic models were significantly better than random chance.

We have provided results demonstrating that topic models can be used as a feature extraction step for unstructured text traceability. These topic models do not require any additional data that does not already exist in MOOCs. Unlike TF-IDF, they are fast and generate constant size feature vectors for similarity comparisons. Our research can benefit MOOC stakeholders and advance other domains dealing with traceability.

References

  • (1)
  • Abadi et al. (2008) A. Abadi, M. Nisenson, and Y. Simionovici. 2008. A Traceability Technique for Specifications. In 2008 16th IEEE International Conference on Program Comprehension. 103–112. https://doi.org/10.1109/ICPC.2008.30
  • Asuncion et al. (2010) H. U. Asuncion, A. U. Asuncion, and R. N. Taylor. 2010. Software traceability with topic modeling. In 2010 ACM/IEEE 32nd International Conference on Software Engineering, Vol. 1. 95–104. https://doi.org/10.1145/1806799.1806817
  • Atapattu et al. (2016) Thushari Atapattu, Katrina Falkner, and Hamid Tarmazdi. 2016. Topic-wise Classification of MOOC Discussions: A Visual Analytics Approach.. In EDM. IEDMS, 276–281.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of machine Learning research

    3, Jan (2003), 993–1022.
  • Breslow et al. (2013) Lori Breslow, David E Pritchard, Jennifer DeBoer, Glenda S Stump, Andrew D Ho, and Daniel T Seaton. 2013. Studying learning in the worldwide classroom research into edX’s first MOOC. Research & Practice in Assessment 8 (2013), 13–25.
  • Chen et al. (2016) Ye Chen, Bei Yu, Xuewei Zhang, and Yihan Yu. 2016. Topic Modeling for Evaluating Students’ Reflective Writing: A Case Study of Pre-service Teachers’ Journals. In Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (LAK ’16). ACM, New York, NY, USA, 1–5. https://doi.org/10.1145/2883851.2883951
  • Coetzee et al. (2014) Derrick Coetzee, Armando Fox, Marti A Hearst, and Björn Hartmann. 2014. Should your MOOC forum use a reputation system?. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. ACM, 1176–1187.
  • Ezen-Can et al. (2015) Aysu Ezen-Can, Kristy Elizabeth Boyer, Shaun Kellogg, and Sherry Booth. 2015. Unsupervised modeling for understanding MOOC discussion forums: a learning analytics approach. In Proceedings of the fifth international conference on learning analytics and knowledge. ACM, 146–150.
  • Hindle et al. (2012) Abram Hindle, Christian Bird, Thomas Zimmermann, and Nachiappan Nagappan. 2012. Relating requirements to implementation via topic analysis: Do topics extracted from requirements make sense to managers and developers?. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 243–252.
  • Huang et al. (2014) Jonathan Huang, Anirban Dasgupta, Arpita Ghosh, Jane Manning, and Marc Sanders. 2014. Superposter behavior in MOOC forums. In Proceedings of the first ACM conference on Learning@ scale conference. ACM, 117–126.
  • Mackness et al. (2010) Jenny Mackness, Sui Mak, and Roy Williams. 2010. The ideals and reality of participating in a MOOC. In Proceedings of the 7th international conference on networked learning 2010. University of Lancaster.
  • Onah et al. (2014) Daniel FO Onah, Jane E Sinclair, and Russell Boyatt. 2014. Exploring the use of MOOC discussion forums. In Proceedings of London International Conference on Education. 1–4.
  • Pappano (2012) Laura Pappano. 2012. The Year of the MOOC. The New York Times 2, 12 (2012), 2012.
  • Ramage et al. (2009) Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 248–256.
  • Ramesh et al. (2014) Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume, and Lise Getoor. 2014. Understanding MOOC discussion forums using seeded LDA. In Proceedings of the ninth workshop on innovative use of NLP for building educational applications. 28–33.
  • Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.
  • Rosen-Zvi et al. (2004) Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In

    Proceedings of the 20th conference on Uncertainty in artificial intelligence

    . AUAI Press, 487–494.
  • Salton and Buckley (1988) Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24, 5 (1988), 513 – 523. https://doi.org/10.1016/0306-4573(88)90021-0
  • Stephens-Martinez et al. (2014) Kristin Stephens-Martinez, Marti A Hearst, and Armando Fox. 2014. Monitoring moocs: which information sources do instructors value?. In Proceedings of the first ACM conference on Learning@ scale conference. ACM, 79–88.
  • Tata and Patel (2007) Sandeep Tata and Jignesh M. Patel. 2007. Estimating the Selectivity of Tf-idf Based Cosine Similarity Predicates. SIGMOD Rec. 36, 2 (June 2007), 7–12. https://doi.org/10.1145/1328854.1328855
  • Wang et al. (2011) Chong Wang, John Paisley, and David Blei. 2011. Online variational inference for the hierarchical Dirichlet process. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. 752–760.
  • Wen et al. (2014) Miaomiao Wen, Diyi Yang, and Carolyn Rose. 2014. Sentiment Analysis in MOOC Discussion Forums: What does it tell us?. In Educational data mining 2014. Citeseer.
  • Wong et al. (2015) Jian-Syuan Wong, Bart Pursel, Anna Divinsky, and Bernard J Jansen. 2015. An analysis of MOOC discussion forum interactions from the most active users. In International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer, 452–457.