Time flies in the domain of deep learning language modeling, indeed:
The day this paper was submitted (August 13th, 2019) for internal
review, NVIDIA published yet another, larger language model of the
type used in this paper. The MegratronLM (apart from taking a bite
out of the pun in my article’s title) is currently the largest language
model based on the transformer architecture ( ).
This latest neural network language model has >8 billions of parameters,
which is incomprehensible compared to the type of neural networks
we used only two decades ago.
Time flies in the domain of deep learning language modeling, indeed: The day this paper was submitted (August 13th, 2019) for internal review, NVIDIA published yet another, larger language model of the type used in this paper. The MegratronLM (apart from taking a bite out of the pun in my article’s title) is currently the largest language model based on the transformer architecture (https://nvidianews.nvidia.com/news/nvidia-achieves-breakthroughs-in-language-understandingto-enable-real-time-conversational-ai
). This latest neural network language model has >8 billions of parameters, which is incomprehensible compared to the type of neural networks we used only two decades ago.
At that time, in the winter semester 1999-2000, I taught classes about artificial Neural Networks (NNs, e.g. Rosenblatt, 1958). Back then, Artificial Intelligence (AI) already entered what was referred to as AI winter, as most network sizes were limited to rather small architectures unless supercomputers were employed. On smaller machines that were available to most researchers, only rather limited versions of these NNs could be trained and used, so successful applications were rare, even though one of the key contributions that enabled deep learning and a renaissance of NN-based AI, the Long-Short-Term-Memory (LSTM) design (Hochreiter & Schmidhuber, 1997) was made in those years. In 2017, I started looking into neural networks again because I wanted to learn how to program Graphical Processing Units (GPUs) for computational statistics and high performance computing (HPC) used in estimating psychometric models (von Davier, 2016). This finally led me to write a paper on using deep neural networks for automated item generation (von Davier, 2018), a field that has seen many different attempts, but most were only partially successful, and involved a lot of human preparations, and ended up more or less being fill-in-the-blanks approaches such as we see in simple form as MadLibs books for learners.
While I was able to generate something that resembled human written personality items, using a public database that contains some 3000, and while several of the (cherry-picked) generated items sounded and functioned a lot like those found in personality inventories (e.g. Goldberg, 1999, Goldberg et al. 2006), I was somewhat skeptical whether one would be able to properly train neural networks for this task, given that it would require a very large number of items, and I assumed that each network for that purpose would need to be solely trained on items of the form it is supposed to generate. Part of my concern was that the items that were generated had to be hand-picked, as many of the generated character or word sequences ended up not to be properly formed statements. However, those that were selected for an empirical comparison with human coded items were found to show the same dimensionality (von Davier, 2018) and hence to be fully useful as replacements of human authored items. Nevertheless, some doubt remained due to the needed handpicking and the limited supply of training material, after all, AI and neural networks have a long history (e.g., Rosenblatt, 1958; also MIT’s 1961 TV program: “The Thinking Machine”, https://www.youtube.com/watch?v=5YBIrc-6G-0) and have been hyped to be the next big thing that may soon replace humans and take our jobs.
As mentioned, items I generated using deep learning (von Davier, 2018) were passing empirical evaluations and hence functioned a lot like the human written items in an online data collection. However, many of the generated items were either not properly formed statements that are typical for this domain, or if the network was trained too long on too little data, they were almost exact copies of what was entered as training material. Therefore, I concluded I would need a lot more data, or an unforseen qualitative jump in deep learning that I expected to be years away. I was wrong, it turns out that time indeed flies, and the field of deep learning did not rest, and while in the paper published in 2018 I stated that operational use could be years away, I am not so sure anymore that we have to wait that long (so am I allowing the recent deep-learning hype to change my views?).
It may well be that we will see automated item generation based on deep learning systems as soon as 2021 in tools that support item writers for developing test questions for high-stakes exams, and that deep neural networks will be used to generate questions or distractors for multiple choice questions used in test preparation and practice exams much sooner. The reason why I believe this has to do with a graduate student who developed a software tool for programmers based on a product that was released by OpenAI (Radford et al., 2018; https://github.com/openai/gpt-2). The software that supposedly makes programmer lives so much better is called TabNine (https://tabnine.com/) and it provides context-sensitive (intelligent?) auto-completion based on indexed source code files. The author of the software estimates that TabNine will save programmers at least 1 second per minute by suggesting how lines of program code are completed, or what the most likely next line of code may be, based on the code that the programmer provides and the software uses to improve a predictive model.
The title of this article is a reference to two relevant lines of inquiry. There was an article with the title “Doctor A.I.” (Choi et al., 2017) which described a deep learning approach using generative adversarial networks (GANs) to generate electronic health records (EHRs) that can pass as plausible EHRs, and the other is the recently ignited race around language models that use a specific neural network structure called ’Transformer’, which was an obvious trigger for many references to the sci-fit toys and movies. The remainder of this article is structured as follows: The next section introduces language models that are based on approaches that can be e used to generate the probability of a next word or language token using information about a previously observed sequence of words. The following section outlines potential areas of application and shows select examples of how NN-based language models could be utilized in medical licensure and other assessment domains for AIG.
2 Neural Networks as Language Models
The basis of these predictive approaches are sequential models that provide the probability of the next word (or other language token such as full stop, newline, etc.) given a number of previous words. These models are not new, I recall my first encounter of this type of model was an article in ’Scientific American’ before 1985 when I was still a high school student and part-time programmer working for a small educational gaming company located in Northern Germany (yes, game-based learning existed back then). This 1980-version actually goes back to the seminal paper by Shannon (1948), and constitutes a primitive language model. This simple model did of course not have the many layers and the complex network architecture of deep learning applications that are nowadays used for machine translations, picture annotations, or automated item generation (von Davier, 2018), but was much rather based on a single layer that connected an input word (previous encounter) to an output word (next encounter). Technically, the basis of this model was a transition matrix, with input (previous) and output (next) words coded as binary vectors, and the model basically implemented the Markov assumption for as a model for natural language.
2.1 Markov Models
The model mentioned above is simple language model that can be viewed as direct translation of the Markov assumption for modeling a sequence of words with index Here, is a finite set of words, the vocabulary of a language and denotes the size of the vocabulary. Let be an index, i.e., a bijective function that maps integers to words. That is we can obtain an integer that represents a word by applying and the associated word can be retrieved from any integer through .
In this most simple case of a language model we assume that
for any namely that the probability of observing a next word at position of the sequence depends only on the last observed word , and nothing else. The whole sequence preceding the next to last word is ignored in this model. Then, if we assume homogeneity of the transitions, i.e., whenever and . With this we can define
which is a transition matrix that provides a conditional probability distribution for any. If there are no constraints, this transition matrix has parameters, i.e., roughly the square of the vocabulary size. The parameters can be obtained by estimating simple sample statistics, or by some other methods (reference).
A more complex language model would take into account more than one previous word. This can be implemented as follows. In order to take the previous words into account, define
, which is an n-gram of length.
Then assume for that
While this is a perfectly sound definition, it has practical implications that may make applications impossible, as soon as the vocabulary contains more than a few handful of words and the length of sequence, , grows larger than, say, 3. The issue is that the mini-sequence is an element of , a much larger set, with elements. For a vocabulary of only 100 words and three-word sequences, there are already different elements.
A transition matrix that contains all conditional probabilities for the next words, given the previous three, we would need to train, estimate, or otherwise obtain probabilities. Therefore, most traditional approaches to construct such a large transition matrix have not been pursued, as this would require very large amounts of data.
One way of circumventing the need to use classical statistical estimation methods, and to be able to ignore some of the more rigorous requirements of these methods, is using NNs for the purpose of language modeling. NNs have been shown to be universal function approximators (e.g. Hornik, 1991; Hanin, 2017). This means that an NN with proper design can be used to plug in an estimate of a function that is otherwise hard to calculate, or hard to specify based on more traditional approximation or estimation methods. This advantage is paid for by having only vague knowlegde about the actual form of the function that is being approximated, as NNs operate as black boxes and do not easily reveal how the approximation is achieved.
In order to further reduce demands, one could model the sequence of characters rather than words, as natural languages often contain several thousands of words, while alphabetic languages can be expressed using a much smaller character set. Therefore, an alternative to word-based language models using neural networks can be implemented as a character based language model. A few years ago, Google released Tensorflow (Abedi et al., 2015) a powerful software toolbox to design, train, and sample from neural networks. This triggered implementation of a variety of deep learning approaches using this new tool, among these a character based deep recurrent neural network (Char-RNN, e.g., Ozair, 2016), and more recently, other architectures that will be described below. Obviously, there are many more tools for deep learning, and the models released for further analyses and fine-tuning, as done in the current study, are typically available in more than one framework. Wikipedia provides a list of neural network oriented tool kits at .
2.3 Attention is all you need
Recent language models introduced the concept of attention, as a structure that was part of the neural network architecture aimed at keeping certain concepts more salient. This was initially implemented in addition to the recurrent structures of deep learning models designed for sequence to sequence and language modeling. However, a recent article (Vaswani et al. 2017; http://https://arxiv.org/pdf/1706.03762.pdf) proposed an alternative, much simpler structure in which the context and the attention mechanism would replace the sequential structures of RNNs. The title of this article is mirrored in the subsection title, and this article led to multiple language models published in short succession, one of which was recently released by OpenAI and forms the basis of the retrained / fine-tuned model presented in the current study.
In the article Vaswani et al. (2017) describe the new network structure as a ’transformer’ consisting only of a decoder and an encoder with multi-headed attention, which provides a distribution of most likely language tokens, given a context of a certain length (say 1024 words and information about their position) which on the decoder side is mirrored with a similar structure. Psychoanalysts would probably say that transformers conduct some form of free association. Interestingly, the attention architecture used in the transformer based models is simpler than what what previously deemed necessary in language models based on recurrent neural networks such as the one used in von Davier (2018). This simpler structure allows much faster training as the transformer architecture allows parallel processing by means of simultaneously using word and position encoding rather than encoding the text sequentially. The drawback is that (currently) only limited lengths of text can be encoded, as the parallel processing makes it necessary to have the sequence to be encoded (input) as well as the output to be present as a whole (for example, sentence-by-sentence), rather than word-by-word.
2.4 Reincarnations of the Transformers: GPT-2, Transformer-XL, Grover, now MegatronLM
The GPT-2 model was trained by a team of researchers at OpenAI (
The GPT-2 model was trained by a team of researchers at OpenAI (https://openai.com/blog/better-language-models/) using four different levels of complexity of the transformer architecture. In an unprecedented move, OpenAI only released the two smallest models, which comprise of network weights amounting to 117 Million and 345 Million parameters, respectively. The larger models are not published due to concerns of malicious use cases, and contain up to 1.4 Billion (!) parameters. However, this number was recently toppled by NVIDIA, publishing the MegatronLM model that includes more than 8 billion parameters, and making the code available on github (https://github.com/NVIDIA/Megatron-LM). For the gpt-2 model, it says on the OpenAI website:
"Due to our concerns about malicious applications of the technology, we are not releasing the trained model. As an experiment in responsible disclosure, we are instead releasing a much smaller model for researchers to experiment with, as well as a technical paper.",
All gpt-2 models were trained on what OpenAI called WebText, which is a 40GB database of text scraped from the WWW, excluding Wikipedia, as OpenAI researchers assumed that Wikipedia may be used by secondary analysts to retrain/fine-tune for specific topics. As the full model is not available, this means that the actual performance of the GPT-2 Transformer model cannot be verified independently, and other researchers can only use and modify (retrain) the smaller models. The examples presented in this paper are based on experiments with the that contains 345M hyper-parameters.
There are several other transformer based language models that are currently under active development and are being made available to researchers for fine-tuning and adaptation to different applications. Among these are the Transformer-XL (Dai et al., 2019) and Grover (Zellers et al., 2019), and most recently MegatronLM (Nvidia, August 13th, 2019; https://github.com/NVIDIA/Megatron-LM). While the NVIDIA model used a corpus called _WebText_ that contains 40GB of data and was modeled after the corpus used by OpenAI, Grover was trained on 46GB of real news and can be used to either generate, or detect, fake news.
This ability to both detect and generate is based on the fact that all of these approaches can be viewed as probabilistic models that predict a sequence of new words (fake news, a translation, next poem lines, next syntax line in a software program) based on the previous sentence(s) or lines of code. More formally, we can calculate the loss function
where is the estimated distribution of word given context (history) . This is an estimate of the cross entropy, or logarithmic entropy (Shannon, 1948) of the observed sequence given some initial context . This quantity can be used to evaluate generated sequences relative to the dsitribution the loss based on true (human-generated) sequences to help distinguish them. The cross entropy is a measure of how well predicted (in terms of expected log-likelihood, e.g. Gilula & Haberman, 1994) an observed sequence is if a certain model is assumed to hold. This loss function is also used during training or fine-tuning in order to evaluate how well the network predicts new batches of data that are submitted to the training algorithm.
It is worth mentioning that while all of these are variations on a theme, the transformer architecture for language modeling has shown great potential in improving over previous designs in terms of performance on a number of tasks (e.g. https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html). In terms of the use for generating test questions, Grover (Zellers et al. 2019) may prove useful in future applications as it was designed to produce and detect fake news by using 46GB worth of data based on actual news scraped from the internet. Retraining Grover with targeted assessment materials around a content domain is one of the future directions to take for applied research into automated item generation using NN-based language models.
3 Areas of Applications
The applications of deep learning and recurrent neural networks as
well as convolutional networks range from computer vision and picture
annotation to summarizing, text generation, question answering, and
generating new instances of trained material. In some sense, RNNs
can b ere viewed as the imputation model of deep learning. One example
of medical applications is medGAN (
The applications of deep learning and recurrent neural networks as well as convolutional networks range from computer vision and picture annotation to summarizing, text generation, question answering, and generating new instances of trained material. In some sense, RNNs can b ere viewed as the imputation model of deep learning. One example of medical applications is medGAN (https://github.com/mp2893/medgan) an generative adversarial network (reference) that can be trained on a public database of EHRs and then be used to generate new, synthetic health records. However, medGAN can also be considered an ’old style’ approach just as the approach I used for generating personality items (von Davier, 2018), as medGAN was not based on a pre-trained network that already includes a large body of materials in order to give it general capabilities that would be fine-tuned later.
The latest generation of language models as represented by GPT-2 are pre-trained based on large amounts of material that is available online. GPT-2 was trained on 40GB of text collected from the internet, but excluding WikiPedia as it was considered that some researchers may want to use this resource to retrain the base GPT-2 model. These types of language models are considered multitask learners by their creators, i.e., they claim these models are systems that can be trained to perform a number of different language related tasks such as summarization, question answering and translation (e.g. Radford et al., 2018). This means that a trained model can be used as the basis for further targeted improvement, and that the rudimentary capabilities already trained into the model can be improved by presenting further task specific material.
3.1 AI based AIG trained on Workstations with Gaming GPUs
While this should not distract from the aim of the article it is important to know that some considerations have to be made with respect of how and where calculations will be conducted. Software tools used for deep learning are free (Abadi, 2015) and preconfigured servers and cloud services exist that facilitate the use of these tools (reference). At the same time, there are significant cost involved, and in particular researchers who develop new models and approaches may need multiple times more time and resources compared to standard applications that are used to analyze data. The dilemma is that while most tools for training deep learning system are made freely available, these tools are worthless without powerful computers. And pointing to the cloud is not helpful, as the cloud is “just someone else’s computer” (as geek merchandise on Amazon proves: https://www.amazon.com/s?k=Geek + Tshirt + There + Is + No + Cloud + Computer + Science + Meme + T + Shirt&ref=nb_sb_noss): High performance hardware and algorithms that employ parallelism are needed to train these kinds of networks, either in the form of hardware on site, in a data center, or rented through the cloud. The training of RNNs as well as transformer-based language models takes many hours of GPU time, which comes at significant costs if the cloud is used. For recent language models of the type of GPT-2 large (1.4B parameters), or Grover Mega, or XLNet the estimated costs was around $30K - $245K (XLNet) and $25K (Grover-Mega) for more details:
Obviously, cloud computing services come at a cost, and while new preconfigured systems pop up daily and prices will decrease due to reduced hardware cost and competition, any more involved project that requires training specialized systems, or re-training existing large models, will incur significant costs as well. The model used in the current paper was pretrained on several TPUs (specialized Google hardware for tensor computations) for over a week, and retraining as well as finetuning will take weeks of GPU-time in order to produce a system that is useful for a specific purpose. Therefore, building or purchasing a deep learning computer is one of the options that should be carefully considered as well as the use of cloud computing or on demand GPU time such as Vast.AI. Nowadays, even modest hardware such as gaming desktops can be utilized as most of these contain powerful GPUs for graphical processing, which can be turned into thousands of processing units through tool kits such as CUDA provided by the makers of these graphics cards (e.g. Chevitarese et al., 2012).
The hardware needed for training large NNs can be found at specialized
vendors (e.g. https://lambdalabs.com ) who often also provide turnkey
solutions such as operating system images that include all the common
machine learning tool-kits such as KERA, TensorFlow, PyTorch and others.
An alternative is to DIY and to use the many web resources that describe
which workstations can be obtained cheaply and how many of the essential
graphical processing units (GPUs) can be housed, with or without modifications.
In addition there are free web resources, for example Google Colab
) who often also provide turnkey solutions such as operating system images that include all the common machine learning tool-kits such as KERA, TensorFlow, PyTorch and others. An alternative is to DIY and to use the many web resources that describe which workstations can be obtained cheaply and how many of the essential graphical processing units (GPUs) can be housed, with or without modifications. In addition there are free web resources, for example Google Colab (https://colab.research.google.com/notebooks/welcome.ipynb) which is essentially an jupyter notebook that anyone with a google account can use for deep learning and machine learning experiments (free for short term use), or time share on demand GPU services such as Vast.AI (https://www.vast.ai) can be used for a fee.
Without further digressions, we now turn to how these systems, either purchased fully configured as turnkey solution, or put together from used parts, can be utilized to produce text that to a much greater extent than imaginable only two years ago can facilitate automated generation of assessment materials, including the generation of electronic health record, the production of suggestions for distractor choices in multiple choice items, and the drafting of patient vignettes based on prompts provided by item writers.
3.2 Electronic Health Records and Deep Learning
The fact that medicine uses IT for storing and managing patient data brought with it that computer scientists were needed and hired to work on systems for this purpose. At the same time, data on patients, as it is stored in electronic health records (EHRs) is highly sensitive, so that developers working in this area looked for ways to use databases that would not directly reflect anyone’s real data. One way was to use the same data, carefully anonymized so that individuals cannot be identified. A second approach was to generate health data of non-existent patients using the regularities found in real health data.
This was the birth of synthetic EHRs, either in the form of expert-generated models (Synthea: Walonoski et al. 2018) or in the form of deep learning based models that either predict the next clinical event or generate (plausible) synthetic EHRs based on a training data set of real EHRs (Dr. AI, Choi et al. 2017; MedGAN, Choi et al. 2017). These models can be used to generate data that can be considered a limitless resource, and they are anonymous by design, so that concerns about privacy and data protection are alleviated when using these synthetic records in design and analysis work.
A recent systematic review (Xiao et al. 2018) describes 98 studies that use deep learning in conjunction with EHRs. The studies range from generating synthetic EHRs to enable users to experiment with these data without any privacy issues to building predictive models that are used to forecast the next clinical encounter based on existing EHR history. EHRs are an important source of information, and can be used to look at systematic differences in trajectories between patient groups, as well as how different treatments play out by comparing the prevalence of subsequent encounters.
3.3 Distractor Generation through Question Answering
The utility of the GPT-2 language model is currently explored by means of retraining for specific purposes. One application that was mentioned in the introduction is the tabNine software that allows prediction of line completions and programming code continuation. For that purpose, each user provides data of their own project related code, which can further improve prediction.
The ability to generate distractors for existing multiple choice items is already given even in the original, not retrained 345M GPT-2 model. The training material contained a large number of cases in which a question was followed by the prompt A: and an answer. By means of introducing this type of training material, the model was enabled to react to a prompt that ends in A: with a continuation of the text that in many cases represents a meaningful response to a question that was contained in the prompt. Here, we show two examples that were generated using a local copy of GPT-2 on the authors workstation equipped with a GPU and Tensorflow (Abadi et al., 2015) which facilitates sampling from the language model.
The GPT-2 model was trained with material that includes text that has the structure: “Q: What is X? A: X is a Y.”, i.e., the model is prompted to associate a sequence of words that is bracketed in Q: and A: as a question that requires an answer. The next figure shows an example of output generated using the 345M model. Note that these are by far not perfect, but they could serve as inspiration for human item writers. The first example was generated without any re-training, using the downloadable version of the 345M gpt-2 model.
It is clear that not all of the listed side effects are actual ones patients may experience, however, some overlap with side effects mostly listed in online resources exists, and some others may be ’plausible enough’ to potentially serve as wrong options in a multiple choice test. The next example asks about common symptoms of IBS, the selection of responses were not cherry-picked, and from among two sets of 4 answers, most are on topic.
It is important to note that the responses are based on a general language model that has not been trained specifically to answer questions about medical content. This model is, on top of that, the second smallest of the GPT-2 models, and contains only(?) 345 million parameters, while other, larger variants contain much more complex model layers and approximately 1.4 billion parameters (Radford et al. 2018). Again, note that these responses that could potentially be used as distractor suggestions were generated without any retraining of specifically medical assessment materials.
3.4 Automatic Item Generation
The tests reported in this section are based on the GPT-2 (345M) pre-trained language model and roughly 800.000 open access subset articles from the PubMed collection (https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/) used for re-training. The data was encoded using the gpt2 (https://github.com/nshepperd/gpt-2) toolbox for accessing the vocabulary used for pretraining and fine-tuning GPT-2 using Tensorflow. The 800,000 articles roughly equate to 8GB worth of text from a variety of scientific journals that allow open access to some or all of their articles. Training took 6 days on a Dell T7610 equipped with 128GB RAM, two 10-core Intel Xeon processors, and two Nvidia 1080 Ti GPUs using CUDA 10.0 and Tensorflow 1.14, Python 3.6.8 and running Ubuntu 18.04 LTS. It was necessary to use the memory efficient gradient storing (Chen et al., 2016; Gruslys et al. 2016; also https://github.com/cybertronai/gradient-checkpointing) options as the size of data structures for the 345M model used in the retraining exceeded the 11GB memory of the GPUs without it.
The amount of training data available through open access papers that can be downloaded from PubMed repositories is quite impressive: The number of OA articles exceeds 800,000 and the compressed pre-processed databases used for retraining in this study exceeds 8GB. However, free medical texts are available in abundance, a 2011 survey (Singh et al, 2011) lists many resources. Language models for data of this size were not able to be processed on customary hardware only a few years ago, while nowadays (with a few tricks) even the medium size (345 million hyper parameter) gpt-2 model can be retrained on decent gaming GPUs.
Incidentally, during the 6 days of training there is some downtime, which allowed me to find a recent arXiv submission that talks about automated Patent Application generation using GPT-2 pre-trained with (public, granted) patent applications available online (https://arXiv.org/pdf/1907.02052.pdf). Other applications include the syntax completion software TabNine described in the introduction as well as experiments aimed at automatic generation of poems (https://www.gwern.net/GPT-2). The authors of the GPT-2 patent retraining study used Google Colab, a free online deep learning platform that allows users access to GPUs for NN training for up to 12 hours. This is insufficient for the 8GB of PubMed data to be fully retrained on gpt-2 medium, so the author of this paper resorted to upgrading and using a dual GPU workstation.
The following table shows exemplary results after 2 days of retraining with the 800,000 article PubMed database. Other publicly available medical text and article databases are listed by Singh et al. (2011). While the results are encouraging, they can certainly not be used as is, when produced by the NN. However, some editing and human expert input could use these raw output as inspiration for authoring clinical vignettes. Results should be closer to human authored item stems using a transformer that was trained on a large number of real medical licensure and certification item stems and distractors, and as larger pre-trained transformer models get published. It should be noted that these early examples are certainly not texts that would pass as real items. However, the quality of the texts is such that it can be assumed that the larger transformer model, trained on real medical licensure items would be able to produce source material that could be used by medical experts as prompts to facilitate authoring expert generated items.
A second example contains two sentences as a prompt which provides a bit more context for the ’attention’ ciruits of the transformer network. The network ’checkpoint’ that was used represents the retrained GPT-2 after 200,000 cycles using the 800,000 PubMed open access database.
The point to be made here is that the existing network architecture can be used for question answering, and to a limited extent also for ’inspiration’ of human test developers who could enter ideas as prompts and have the neural network spit out ideas. Current applications that are similar in kind used the gpt-2 model for retraining based on openly available patent texts, poems, as well as source code files. It appears plausible that further fine-tuning with targeted assessment material should improve the results dramatically, for example by using all available items in a certain subject domain such as cardiology. It is not claimed that the current system is fully useful as is, but the quality of text produced by the currently available transformer architecture makes it rather likely that correctly formed item stems can be produced by deep learning based language models in the very near future.
4 Summary and Outlook
We are at the frontier of AI entering many domains of daily life. While phone makers contribute to the hype and advertise the next generation of smartphones as running neural networks, there are industrial domains in which these applications are essential. Among these are computer vision and assisted driving, others are recommenders for e-commerce, but also applications that are trained to detect the use of AI for deep fakes, video material that was made by a machine, programmed with malicious intent to fool humans. However, there are also many applications that support human creativity in more benign ways such as gauGAN (http://nvidia-research-mingyuliu.com/gaugan), a tool that helps illustrators to compose landscapes easily with only a few clicks, and tools based on AI that support wellness (https://www.qualcomm.com/news/onq/2019/07/11/ai-your-supportive-wellness-companion) using the same technologies to analyze data on health that are used to predict what music one may like based on past purchase and download behavior.
The prospects of this technology get really exciting when looking at how these pre-trained models could be deployed. There are efforts underway to develop tool-kits that utilize language models, currently GPT-2 and BERT, another transformer-based language model developed by Google (Devlin et al. 2018), on iOS devices. This would not train these networks on phones, but would allow to utilize a trained network to generate new text based on a context sentence or paragraph entered by a user. For automated item generation, apps could be developed that use the language generation on smartphones, for supporting item developers in writing new content on their mobile devices (https://github.com/huggingface/swift-coreml-transformers). Once pre-trained models for medical specialties are available, it would be straightforward to develop a tool in which medical experts can enter a draft vignette or even a few keywords that are wrapped by the app into a case description draft, which can then be finalized and submitted by the human expert for further editing and finalization by item writers at the testing agency who assembles, administers and scores the certification tests. At the testing agency, the just developed case vignette could be finalized using yet another set of machine learning tools to generate correct and incorrect response options which are either used in multiple choice formats or for training an automated scoring system for short constructed responses.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jozefowicz, R., Jia, Y., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Schuster, M., Monga, R., Moore, S., Murray, D., Olah, C., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org (Google Research). https://arxiv.org/pdf/1603.04467.pdf
Chen, T., Xu, B., Zhang, C. & Guestrin, C., (2016). Training deep nets with sublinear memory cost. arXiv preprint. arXiv:1604.06174
Chevitarese D.S., Szwarcman D., Vellasco M. (2012) Speeding Up the Training of Neural Networks with CUDA Technology. In: Rutkowski L., Korytkowski M., Scherer R., Tadeusiewicz R., Zadeh L.A., Zurada J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2012. Lecture Notes in Computer Science, vol 7267. Springer, Berlin, Heidelberg DOI https://doi.org/10.1007/978-3-642-29347-4_4
Choi, E., Bahadori, M.T., Schuetz, A., Stewart, W.F., Sun, J. (2017) Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. arXiv preprint arXiv:1511.05942
Choi, E., Schuetz, A., Stewart, W.F., Sun, J. (2017). Medical Concept Representation Learning from Electronic Health Records and its Application on Heart Failure Prediction. arXiv preprint arXiv:1602.03686
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., V. Le, Q., Salakhutdinov, R., (2019). Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. (Submitted on 9 Jan 2019 (v1), last revised 2 Jun 2019 (this version, v3)) arXiv:1901.02860
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Google AI Language. https://arxiv.org/pdf/1810.04805.pdf
Gilula, Z., & Haberman, S. J. (1994). Models for analyzing categorical panel data. Journal of the American Statistical Association, 89, 645–656.
Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. C. (2006). The International Personality Item Pool and the future of public-domain personality measures. Journal of Research in Personality, 40, 84-96.
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality Psychology in Europe, Vol. 7 (pp. 7-28). Tilburg, The Netherlands: Tilburg University Press.
Gruslys, A., Munos, R., Danihelka, I., Lanctot, M., Graves, A., (2016). Memory-Efficient Backpropagation Through Time. Part of: Advances in Neural Information Processing Systems 29 (NIPS).
Hochreiter, Sepp; Schmidhuber, Jürgen (1997). LONG SHORT-TERM MEMORY. Neural Computation. 9 (8): 1735–1780.
Ozair, S. (2016). char-rnn for Tensorflow. https://github.com/sherjilozair/char-rnn-tensorflow.
Hornik, K. (1991) Approximation Capabilities of Multilayer Feedforward Networks. Neural Networks, 4(2), 251–257. doi:10.1016/0893-6080(91)90009-T
Hanin, B. (2017). Universal function approximation by deep neural nets with bounded width and ReLU activations. arXiv preprint arXiv:1708.02691.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2018). Language models are unsupervised multitask learners. Tech. rep., Technical report, OpenAi. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, Vol 65(6), Nov, 386-408.
Roy, Y. (2019). Deep learning-based electroencephalography analysis: a systematic review. J. Neural Eng. in press https://doi.org/10.1088/1741-2552/ab260c
Rumelhart, D.E., Hinton, G. E., and Williams, R. J. (1986) Learning representations by back-propagating errors. Nature, 323, 6088, 533
Shannon, C.E. (1948) A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379-423, 623-656. http://dx.doi.org/10.1002/j.1538-7305.1948.tb01338.
Singh, A., Singh, M., Singh, A. K., Singh, D., Singh, P., & Sharma, A. (2011). "Free full text articles": where to search for them?. International Journal of Trichology, 3(2), 75–79. doi:10.4103/0974-7753.90803
von Davier, M. (2016). High-Performance Psychometrics: The Parallel-E Parallel-M Algorithm for Generalized Latent Variable Models. ETS Research Report Series, 2016: 1–11. doi:10.1002/ets2.12120
von Davier, M. (2018). Automated Item Generation with Recurrent Neural Networks. Psychometrika. 2018 Mar 12. doi: 10.1007/s11336-018-9608-y
Walonoski, J., Kramer, M., Nichols, J., Quina, A., Moesel, C., Hall, D., Duffett, C., Dube, K., Gallagher, T., McLachlan, S. (2018). Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record, Journal of the American Medical Informatics Association, Volume 25, Issue 3, March 2018, Pages 230–238, https://doi.org/10.1093/jamia/ocx079
Xiao,C., Choi, E., Sun, J. (2018). Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association, 25(10), 1419–1428. doi: 10.1093/jamia/ocy068
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., Choi, Y. (2019). Defending Against Neural Fake News. (Submitted on 29 May 2019) arXiv:1905.12616