With rising concerns over the utilization of machine learning (ML) algorithms in everyday activities as well as high-stakes environments such as the law (Angwin et al., 2016), new directions for the development and deployment of algorithmic systems for non-technical groups have emerged. On the one hand, legislative imperatives such as the GDPR’s implicit right to explanation (Selbst and Powles, 2018) have led to design strategies that supply explanations for ML algorithms in user interfaces. However, these are predominantly created by technical experts and can prove unsuitable for non-technical groups (Miller et al., 2017; Edwards and Veale, 2017). As an extension to explanations, interactive approaches, such as sample review, feedback assignment, model inspection, and task overview, have been suggested in order to foster more meaningful participation in systems that use ML algorithms (Dudley and Kristensson, 2018). We see these interactive approaches as integral to the goal of making ML more accessible. In our paper, we focus on the model inspection facet of machine learning systems (cp. Figure 1). We present an interactive user interface for the ML back-end service ORES that supports developers in selecting a model configuration meeting their requirements.
2. Use Case: ORES
Only a few years after its inception, the number of active volunteers in Wikipedia grew exponentially. At the same time, this success lead to increasing vandalism in Wikipedia. The English Wikipedia, for example, receives over 150 thousand new edits every day, which go live immediately and without verification. Wikipedians accept this risk of an open encyclopedia but work tirelessly to maintain quality. However, it has become no longer possible to do so manually. Due to its ongoing growth, Wikipedia entered into a phase of automation, and many quality control tools, such as ClueBot NG111 ClueBot NG pre-classifies edits in Wikipedia by with a Bayesian Classifiers to reduce the percentage of false positives. Then an artificial neural network is used to classify the detected vandalism. ClueBot NG generates a vandalism probability for each edit.
Developers who want to apply the ORES damaging prediction need to choose a threshold of confidence that supports the work practices they are designing for. But inspecting the model and determining an appropriate choice for a specific purpose is not well supported for non-ML experienced developers. In the next section, we describe existing challenges that occur when employing ORES as quality control system.
3. Human-Centered Optimization of Model Configuration
Halfaker et al. (Halfaker et al., 2018) describe the case of PatruBot from Spanish Wikipedia. An editor developed PatruBot based on ORES to revert damaging edits in Spanish Wikipedia automatically. However, soon after its initiation, the Wikimedia Scoring Platform team received complaints from editors who did not understand why PatruBot reverted their edits. After investigation, it showed that the bot reverted edits that passed a low threshold likelihood of being damaging. In case of a fully automated quality control process, the model needs to be optimized to a high precision, i.e., only damaging edits are flagged, which results in a lower recall, i.e., some suspicious edits remain undetected. What we derive from this case is that even with knowledge about ORES, it is not straightforward for people to come up with a confidence threshold that meets their operational requirements (e.g., high precision at the cost of recall). The interplay of model fitness metrics and expectations requires interpretation on a case-by-case basis.
We see ORES as a particularly fruitful setting for developing interfaces that lower the barrier for non-technical community access to ML. Accordingly, we were motivated to prototype PreCall, an interactive visual interface to support non-technical experts in developing a mental model of the ORES classifier when selecting a suitable model configuration for their application. Previous research has shown that interactive visualizations enabling people to tweak ML systems help them to make more effective use of ML-services (Kapoor et al., 2010; Amershi et al., 2015). In our research, we build upon this line of research and seek to support people in finding optimal model configurations for ORES that meet their requirements, without having to understand how exactly the system works internally.
4. The PreCall Visual Interface Design
The visual interface aims to support the interpretation of different configurations of the damaging classifier expressed by model fitness metrics, and the confidence threshold that defines which score separates good from damaging edits. We designed two views covering the main tasks: a parameter view to inform a person about possible configurations of the damaging model, and a preview of the expected outcome of the classifier (Figure 3).
4.1. Parameter View
The first aim of the visual interface was to show the relationship of the three major fitness metrics of the ORES damaging model: recall, precision, and false-positive rate. In the GUI they are represented as three axes of a radar chart (Figure 3, top left). A person can vary any metric and the other two are updated instantly. The second aim was to demonstrate how the confidence threshold relates to the model metrics. A slider next to the radar chart represents the threshold which determines if an edit is declared as good or damaging (Figure 3, top right). A color gradient illustrates the fact that the transition from good to damaging edits is fluid, i.e., there is a range of uncertainty. Changing the threshold in the slider also immediately changes the values in the radar chart. This way, interaction facilitates the exploration of different thresholds and model metrics, as well as their interdependence.
4.2. Preview of Results
Another crucial goal of PreCall is to demonstrate how the outcome of the model changes with different configurations. The view on the bottom (Figure 3
, bottom) shows the predicted outcome for the chosen configuration as stacked symbols. This view is designed to provide an intuitive representation of the expected result to let the user quickly grasp the number of elements belonging to the different groups: true negative, false positive, true positive, and false negative flags of edits. Color expresses how the algorithm tagged the edits: good (blue) and damaging (red). The shape of the elements represents their true state: good (circle) and actually damaging (triangle) edits. Compared to the common way of showing classification results in a confusion matrix (e.g.(Kapoor et al., 2010)), we hypothesize that this visualization provides a more intuitive representation of a classification outcome. Moreover, by also adapting instantly, this preview further strengthens PreCall’s interpretive support by describing the relationship between model configuration and expected output.
4.3. Determining a Suitable Model Configuration for Semi-Automated Edit Review
Based on the use case described above, the semi-automated review of edits, we demonstrate of how PreCall can help finding the optimal model configuration for a specific application:
We start with a threshold of 0.5, which results in recall of , precision of , and false positive rate of . With this threshold, the number of falsely detected good edits is still quite high (”2% wrongly detected as good”), as we have the same amount of correctly detected damaging edits.
In order to let the system find further damaging edits, we decrease the decision threshold to 0.3. The parameter view reveals that recall goes up () and precision down (). The fraction of “wrongly detected as good” edits went down to 1%, however there are still 12% of edits altogether that are (correctly and falsely) detected as damaging and would have to be reviewed manually.
Trying out other thresholds, we find a better choice: with a threshold of the number of edits that are detected as damaging is minimized to 8% (with 6% wrongly and 2% correctly detected, see Figure 4). Given the 91% of edits correctly detected as good this is a better outcome for our purpose of reviewing a small number of uncertain edits among a large set of edits.
After gaining a better understanding of the model characteristics, we are satisfied with this payoff and decide to use the chosen configuration to check new data for damaging edits.
This scenario shows how PreCall, with its integrated visual approach, is intended to support the configuration of the ORES damaging model. We hope we can show in planned user studies that PreCall helps people build a meaningful understanding of model metrics and confidence threshold, their relationship, and how they affect the possible outcome.
In this paper we described the context, the requirements, and the current design rationales of the work-in-progress development of PreCall. The main goal of the approach is to support the editors in Wikimedia projects, i.e. non-technical experts, in arriving at a case-specific meaningful interpretation when selecting a configuration of the ORES damaging model that fits their requirements. The current prototype serves as a demonstration of the concept and as testing platform for the wider community.
For evaluation of PreCall’s potential to support the interpretation of the ML model for specific case-by-case usage of ORES, we envision a qualitative user study with Wikipedia editors. A particular concern is the level of abstraction PreCall should provide, such as whether our inclusion of measures like precision and recall is interpretable for Wikipedia tool developers. Therefore, our study should also compare ours to more abstract approaches such as an interactive confusion matrix as proposed by Kapoor et al.(Kapoor et al., 2010). Another possible qualitative dimension to our studies is comparing the understanding gained by using PreCall as opposed to reading the officially supplied documentation for ORES parameters (e.g., https://www.mediawiki.org/wiki/ORES/Thresholds). If our approach turns out to be useful, a future goal would be to provide the Wikimedia community with an enhanced version of PreCall for long-term field studies. In this way, we hope to improve our understanding of how such visual interfaces can impact the acceptance and usage rate of ML-systems in the community.
We see visual parameter selection support approaches like PreCall as valuable contributions to participatory use of machine learning systems. In this workshop we would like to discuss our strategy of facilitating better interpretability of machine learning systems, without necessarily pursuing the goal of making them entirely transparent. We are convinced that following this strategy, visual approaches have the potential to foster a better understanding of machine learning-based decision making.
- Wik ([n. d.]) [n. d.]. Wikimedia Scoring Platform Team. https://www.mediawiki.org/wiki/Wikimedia_Scoring_Platform_team. Accessed: 2019-02-07.
- Amershi et al. (2015) Saleema Amershi, Max Chickering, Steven M Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. Modeltracker: Redesigning performance analysis tools for machine learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. ACM, 337–346.
- Angwin et al. (2016) Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
- Dudley and Kristensson (2018) John Dudley and Per Ola Kristensson. 2018. A Review of User Interface Design for Interactive Machine Learning. (March 2018). https://doi.org/10.17863/CAM.21110
- Edwards and Veale (2017) Lilian Edwards and Michael Veale. 2017. Slave to the Algorithm? Why a ’Right to an Explanation’ Is Probably Not the Remedy You Are Looking For. SSRN Scholarly Paper ID 2972855. Social Science Research Network, Rochester, NY. https://papers.ssrn.com/abstract=2972855
- Halfaker et al. (2018) Aaron Halfaker, R Stuart Giger, Jonathan T Morgan, Amir Sarabadani, and Adam Wight. 2018. ORES: Facilitating re-mediation of Wikipedia’s socio-technical problems. (2018).
- Kapoor et al. (2010) Ashish Kapoor, Bongshin Lee, Desney S Tan, and Eric Horvitz. 2010. Interactive optimization for steering machine classification. CHI (2010), 1343.
- Miller et al. (2017) Tim Miller, Piers Howe, and Liz Sonenberg. 2017. Explainable AI: Beware of Inmates Running the Asylum Or: How I Learnt to Stop Worrying and Love the Social and Behavioural Sciences. arXiv:1712.00547 [cs] (Dec. 2017). http://arxiv.org/abs/1712.00547 arXiv: 1712.00547.
- Selbst and Powles (2018) Andrew Selbst and Julia Powles. 2018. “Meaningful Information” and the Right to Explanation. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR, New York, NY, USA, 48–48. http://proceedings.mlr.press/v81/selbst18a.html