Log In Sign Up

Pseudo-OOD training for robust language models

While pre-trained large-scale deep models have garnered attention as an important topic for many downstream natural language processing (NLP) tasks, such models often make unreliable predictions on out-of-distribution (OOD) inputs. As such, OOD detection is a key component of a reliable machine-learning model for any industry-scale application. Common approaches often assume access to additional OOD samples during the training stage, however, outlier distribution is often unknown in advance. Instead, we propose a post hoc framework called POORE - POsthoc pseudo-Ood REgularization, that generates pseudo-OOD samples using in-distribution (IND) data. The model is fine-tuned by introducing a new regularization loss that separates the embeddings of IND and OOD data, which leads to significant gains on the OOD prediction task during testing. We extensively evaluate our framework on three real-world dialogue systems, achieving new state-of-the-art in OOD detection.


page 1

page 2

page 3

page 4


Predictions For Pre-training Language Models

Language model pre-training has proven to be useful in many language und...

Privacy Adhering Machine Un-learning in NLP

Regulations introduced by General Data Protection Regulation (GDPR) in t...

Calibrated Language Model Fine-Tuning for In- and Out-of-Distribution Data

Fine-tuned pre-trained language models can suffer from severe miscalibra...

A Large and Diverse Arabic Corpus for Language Modeling

Language models (LMs) have introduced a major paradigm shift in Natural ...

MASKER: Masked Keyword Regularization for Reliable Text Classification

Pre-trained language models have achieved state-of-the-art accuracies on...

Regularized Training of Nearest Neighbor Language Models

Including memory banks in a natural language processing architecture inc...

The future is different: Large pre-trained language models fail in prediction tasks

Large pre-trained language models (LPLM) have shown spectacular success ...

1 Introduction

The authors contributed equally to this work

Detecting Out-of-Distribution (OODGoodfellow et al. (2014); Hendrycks and Gimpel (2016); Yang et al. (2021) samples is vital for developing reliable machine learning systems for various industry-scale applications of natural language understanding (NLP) (Shen et al., 2019; Sundararaman et al., 2020) including intent understanding in conversational dialogues (Zheng et al., 2020; Li et al., 2017), language translation (Denkowski and Lavie, 2011; Sundararaman et al., 2019), and text classification Aggarwal and Zhai (2012); Sundararaman et al. (2022). For instance, a language understanding model deployed to support a chat system for medical inquiries should reliably detect if the symptoms reported in a conversation constitute an OOD query so that the model may abstain from making incorrect diagnosis Siedlikowski et al. (2021).

Although OOD detection has attracted a great deal of interest from the research community Goodfellow et al. (2014); Hendrycks and Gimpel (2017); Lee et al. (2018), these approaches are not specifically designed to leverage the structure of textual inputs. Consequently, commonly used OOD approaches often have limited success in real-world NLP applications. Most prior OOD methods for NLP systems Larson et al. (2019); Chen and Yu (2021); Kamath et al. (2020) typically assume additional OOD data for outlier exposure Hendrycks et al. (2018). However, such methods risk overfitting to the chosen OOD set, while making the assumption that a relevant OOD set is available during the training stage. Other methods Gangal et al. (2020); Li et al. (2021); Kamath et al. (2020)

assume training a calibration model, in addition to the classifier, for detecting  

OOD inputs. These methods are computationally expensive as they often require re-training the model on the downstream task.

Motivated by the above limitations, we propose a framework called POsthoc pseudo Ood REgularization (POORE) that generates pseudo-OOD data using the trained classifier and the In-Distribution (IND) samples. As opposed to methods that use outlier exposure, our framework doesn’t rely on any external OOD set. Moreover, POORE can be easily applied to already deployed large-scale models trained on a classification task, without requiring to re-train the classifier from scratch. In summary, we make the following contributions:

  1. We propose a Mahalanobis-based context masking scheme for generating pseudo-OOD samples that can be used during the fine-tuning.

  2. We introduce a new Pseudo Ood Regularization (POR) loss that maximizes the distance between IND and generated pseudo-OOD samples to improve the OOD detection.

  3. Though extensive experiments on the three benchmarks, we show that our approach performs significantly better than existing baselines.

2 Related Works

OOD Detection. It is a binary classification problem that seeks to identify unfamiliar inputs during inference from in-distribution (IND) data observed during training. Standard OOD methods can be divided into two categories. The first category Lee et al. (2018); Podolskiy et al. (2021); Nalisnick et al. (2019); Ren et al. (2019) corresponds to approximating a density

, where density is used as a confidence estimate for binary classification. The second category of approaches 

Hendrycks and Gimpel (2016, 2017); Li et al. (2017); Gal and Ghahramani (2016)

use the predictive probability to estimate the confidence scores. In our experiments, we compare against approaches from both the categories.

OOD Detection in NLP. There have been several methods developed for OOD detection in NLP. Li et al. (2021) proposed using sub models, where each model is trained with different masked inputs. Kamath et al. (2020) uses an external  OOD set to train an additional calibration model for OOD detection. Most related to our proposed framework is MASKER Moon et al. (2021) that leverages IND data to generate pseudo-OOD samples, and uses self-supervision loss inspired from Devlin et al. (2018) and predictive entropy regularization for pseudo-OOD inputs. We also use BERT self-supervision inspired keyword masking, however, we propose a novel keyword selection criterion. Moreover, we also introduce a novel model regularization loss that directly increases the distance of IND and pseudo-OOD samples.

3 Preliminaries and Notations

We consider a deep learning model

composed of an encoder and a classifier that maps to the output space, where corresponds to natural sentences composed of a sequence of tokens , , is the length of the sequence, and is the token vocabulary. For a downstream classification task, the class prediction is defined as .

Architecture. In this work, we construct using the bi-directional Transformer architecture Vaswani et al. (2017). Specifically, we use the encoder architecture proposed in Devlin et al. (2018) such that

is the final hidden representation of the CLS token. We use a two-layer multi-layer perceptron (MLP) as the classifier


Mahalanobis OOD Scoring. OOD methods typically learn a confidence estimator that outputs a score such that , where and are sampled from IND distribution () and OOD distribution () respectively. Lee et al. (2018) proposed using Mahalanobis distance estimator for OOD detection that uses pre-trained features of the softmax neural classifier. Namely, given feature of a test sample , the mahalanobis score is computed as follows


where is an intermediate layer of the neural classifier and denotes the class. The parameters of the estimator denote the class-conditional mean and the tied covariance of the IND features.

4 Post hoc Pseudo-OOD Regularization

In this section, we describe our framework called POsthoc pseudo Ood REgularization (POORE), which uses pseudo-OOD samples for fine-tuning a pre-trained classifier. We first describe our masking-based approach to generate pseudo-OOD samples from the IND samples available during training. These generated pseudo-OOD samples are used to regularize the encoder during post-hoc training of a pre-trained classifier, which leads to improved robustness of the model towards OOD samples.

4.1 Masking for Pseudo-OOD Generation

We perform context masking of IND samples for generating pseudo OOD samples. To generate context-masked pseudo OOD samples, we first identify a set of tokens that have high attention scores and consequently, a higher influence in model predictions. Given the set of keywords, we perform random masking of non-keywords in a given IND sample to generate a pseudo OOD sample .

Keyword Selection. We follow the attention-based keyword identification method proposed in Moon et al. (2021). The token importance is measured using average model attention values computed in the final layer of the pre-trained transformer encoder. While this approach generates context-deprived inputs, the identified tokens are uniformly selected from all the IND samples in the training data. Instead, we propose a novel weighting criterion for keyword selection, such that a higher weight is given to the tokens belonging to the training inputs that have a higher distance from the overall IND distribution determined by . This encourages the selection of keywords that belong to IND inputs which are far from the estimated IND distribution. Specifically, we propose the importance score criterion as follows:

where (4)

where is the attention value for token in the last self-attention layer in . The keyword set is formed by selecting top tokens based on the token importance score . Note that in (3), the tokens in the IND samples having a lower mahalanobis score ( a higher mahalanobis distance) will be more likely to be selected as a keyword.

Context Masking. Given the set of keywords , pseudo OOD samples can be generated by randomly masking the context. The context in an input refers to the non-keyword tokens . We randomly mask the context tokens to generate a pseudo-OOD input , which we use for regularization to improve the model reliability towards OOD inputs. More specifically, we create as follows


where MASK is the masking token, is the masking probability and is the token in the corresponding IND sample .

4.2 Pseudo-OOD Regularization

To increase the Mahalanobis distance of OOD inputs relative to their IND counterparts, we propose Pseudo Ood Regularization (POR) loss. The POR regularization maximizes the distance between the IND sample and its corresponding pseudo-OOD sample. The POR loss is defined as


where the context-masked version of .

Post hoc Training. The total loss used during post hoc fine-tuning is defined as


where is the standard cross-entropy (CE) loss, and is the self-supervised keyword loss (SKL) proposed in Devlin et al. (2018). The SKL loss has been found to improve the generalization of the model by avoiding overfitting of the model to certain tokens in the training data Moon et al. (2021). The post hoc training process finetunes the model using (10). Note that the post hoc fine-tuning is carried on a model previously trained on the downstream task using the standard loss.

Maxprob (Hendrycks and Gimpel, 2017) 68.27 77.18 61.10 84.23 91.49 54.30
Dropout (Gal and Ghahramani, 2016) 52.77 100.0 51.86 100.0 55.25 100.0
Entropy (Lewis and Gale, 1994) 70.29 77.84 62.02 79.45 91.86 53.83
Gradient Embed 67.61 80.80 71.21 70.25 98.53 2.58
BERT Embed (Podolskiy et al., 2021) 71.96 73.56 61.16 87.26 98.88 2.27
Mahalanobis (Lee et al., 2018) 76.89 65.25 73.13 63.84 99.45 1.00
MASKER (Moon et al., 2021) 71.54 72.82 68.16 67.52 86.95 54.26
MASKER-Maha (Enhanced) 79.38 59.97 72.99 65.23 99.41 1.15
POORE (Ours) 81.11 48.30 74.08 69.26 99.51 0.97
Table 1: AUROC and FPR@90 of Baseline and POORE on the three target benchmarks. MASKER uses Maxprob for inference, MASKER-Maha and POORE use use Mahalanobis for OOD detection during inference.
(a) AUROC for STAR
(b) AUROC for FLOW
(c) FPR@90 for STAR
(d) FPR@90 for FLOW
Figure 1: AUROC () and FPR@90 () using Maxprob, ODIN, Entropy, and BERT. The average improvements across estimators on AUROC are 5% on STAR (0(a)) and 4% on FLOW (0(b)). The FPR@90 improvements on an average are 8% on STAR (0(c)) and 2% on FLOW (0(d)).

5 Experiments

We demonstrate the effectiveness of our proposed approach in this section. For reproducibility of the experiments, we include the codebase in the supplementary.

5.1 Datasets

We use three task-oriented dialogue datasets for OOD detection. Namely, Schema-guided Dialog Dataset for Transfer Learning (STAR)

(Mosig et al., 2020), SM Calendar flow (FLOW) (Andreas et al., 2020), and Real Out-of-Domain Sentence From Task-oriented Dialog (ROSTD) (Gangal et al., 2020). We follow the data splits and pre-processing steps as described in Chen and Yu (2021). A detailed description of these tasks is provided in Appendix A.

5.2 Experimental Setup

Our approach is demonstrated on the BERT pre-trained model Devlin et al. (2018) with around 110M parameters trained on a single Titan-X GPU. We optimize for each task through grid search from to by a factor of 10, and use for STAR, FLOW, and ROSTD respectively.

This paper compares our approach (POORE) with existing baseline OOD detection methods including Maxprob, Entropy, Mahalanobis, BERT Embed, Gradient Embed, and Dropout. We also consider MASKER Moon et al. (2021) as a baseline. A detailed analysis of the differences in performance between all the inference methods and our approach appears in Section 5.3

. The baseline methods are trained for 25 epochs using the AdamW optimizer with a learning rate of 1e-5, 3e-5, 1e-5 for STAR, FLOW, and ROSTD respectively. We use minimal post hoc fine-tuning of only 1 additional epoch for MASKER and POORE. We use AUROC and FPR@90 to evaluate OOD detection performance. For more details on these metrics, refer to Appendix 


5.3 Results

Table 1 shows the performance gains from our approach relative to all the baseline methods on three target tasks namely STAR, FLOW, and ROSTD. POORE outperforms existing evaluation baselines by significant margins. Specifically on the STAR dataset, relative to Bert Embed and Mahalanobis baselines, we observe 9% and 4% respective absolute improvement in AUROC, while observing 26% and 17% respective absolute reduction in FPR@90. Similarly on FLOW, the AUROC gains were 13%, and 1% relative to BERT Embed and Mahalanobis, while doing worse only on the FPR@90 metric compared to the Mahalanobis baseline. We noted similar consistent gains on the ROSTD.

We also evaluate our framework POORE by pairing it with other confidence estimators like Maxprob, ODIN Liang et al. (2017), Entropy, and BERT Embed. Figure 1 compares a model trained using POORE with a standard model, while using various confidence estimators during inference. As shown in figure 1, we observe significant gains with our framework over the baseline model for all the confidence estimators. Specifically, the AUROC on FLOW using Bert Embed with POORE achieved an improvement of 9%. We also pair the above estimators with the MASKER baseline and evaluate these combinations in the ablation shown in Appendix C. In Appendix D, we show an ablation comparing our novel keyword selection criterion with the keyword selection criterion used in the MASKER baseline.

6 Conclusions

In this paper, we propose a novel framework, which we call POORE, for improving the robustness of model towards OOD data. Using a combination of Mahalanobis distance and POR regularization that maximizes the distance between IND and OOD representations, we demonstrated significant performance gains in a number of target benchmark tasks. Further work could tap into the potential of using external OOD data to achieve even more gains over other baselines that use outlier exposure.


While our work POORE has shown significant gains on the three benchmarks with minimal fine-tuning of a trained classifier, there are a few limitations of our proposed framework. The requirement of pair-wise correspondence for the Euclidean-distance-based regularization in our approach: For OOD data, the other approaches mentioned in related works use KL divergence-based loss, which is not dependent on pair-wise correspondence, but Euclidean-distance-based approaches assume that both vectors used in the distance calculation are in the same space. The approach we employ uses psuedo-OOD data from IND distribution, and hence this is not a limiting factor but may not hold for external OOD data.


  • C. C. Aggarwal and C. Zhai (2012) A survey of text classification algorithms. In Mining text data, pp. 163–222. Cited by: §1.
  • J. Andreas, J. Bufe, D. Burkett, C. Chen, J. Clausman, J. Crawford, K. Crim, J. DeLoach, L. Dorner, J. Eisner, et al. (2020) Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics 8, pp. 556–571. Cited by: Appendix A, §5.1.
  • D. Chen and Z. Yu (2021) GOLD: improving out-of-scope detection in dialogues using data augmentation. arXiv preprint arXiv:2109.03079. Cited by: §1, §5.1.
  • M. Denkowski and A. Lavie (2011) Meteor 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In Proceedings of the sixth workshop on statistical machine translation, pp. 85–91. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §3, §4.2, §5.2.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §2, Table 1.
  • V. Gangal, A. Arora, A. Einolghozati, and S. Gupta (2020) Likelihood ratios and generative classifiers for unsupervised out-of-domain detection in task oriented dialog. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 34, pp. 7764–7771. Cited by: Appendix A, §1, §5.1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1.
  • D. Hendrycks and K. Gimpel (2016)

    A baseline for detecting misclassified and out-of-distribution examples in neural networks

    arXiv preprint arXiv:1610.02136. Cited by: §1, §2.
  • D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. Proceedings of International Conference on Learning Representations. Cited by: §1, §2, Table 1.
  • D. Hendrycks, M. Mazeika, and T. Dietterich (2018)

    Deep anomaly detection with outlier exposure

    arXiv preprint arXiv:1812.04606. Cited by: §1.
  • A. Kamath, R. Jia, and P. Liang (2020) Selective question answering under domain shift. arXiv preprint arXiv:2006.09462. Cited by: §1, §2.
  • S. Larson, A. Mahendran, J. Peper, C. Clarke, A. Lee, P. Hill, J. K. Kummerfeld, K. Leach, M. Laurenzano, L. Tang, and J. Mars (2019) An evaluation dataset for intent classification and out-of-scope prediction. ArXiv abs/1909.02027. Cited by: §1.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31. Cited by: §1, §2, §3, Table 1.
  • D. D. Lewis and W. A. Gale (1994) A sequential algorithm for training text classifiers. In SIGIR’94, pp. 3–12. Cited by: Table 1.
  • J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, and D. Jurafsky (2017) Adversarial learning for neural dialogue generation. arXiv preprint arXiv:1701.06547. Cited by: §1, §2.
  • X. Li, J. Li, X. Sun, C. Fan, T. Zhang, F. Wu, Y. Meng, and J. Zhang (2021) Folden: -fold ensemble for out-of-distribution detection. arXiv preprint arXiv:2108.12731. Cited by: §1, §2.
  • S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §5.3.
  • S. J. Moon, S. Mo, K. Lee, J. Lee, and J. Shin (2021) Masker: masked keyword regularization for reliable text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 13578–13586. Cited by: §2, §4.1, §4.2, Table 1, §5.2.
  • J. E. Mosig, S. Mehri, and T. Kober (2020) Star: a schema-guided dialog dataset for transfer learning. arXiv preprint arXiv:2010.11853. Cited by: Appendix A, §5.1.
  • E. T. Nalisnick, A. Matsukawa, Y. W. Teh, and B. Lakshminarayanan (2019) Detecting out-of-distribution inputs to deep generative models using a test for typicality. ArXiv abs/1906.02994. Cited by: §2.
  • A. Podolskiy, D. Lipin, A. Bout, E. Artemova, and I. Piontkovskaya (2021) Revisiting mahalanobis distance for transformer-based out-of-domain detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 13675–13682. Cited by: §2, Table 1.
  • J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Cited by: §2.
  • D. Shen, P. Cheng, D. Sundararaman, X. Zhang, Q. Yang, M. Tang, A. Celikyilmaz, and L. Carin (2019) Learning compressed sentence representations for on-device text processing. arXiv preprint arXiv:1906.08340. Cited by: §1.
  • S. Siedlikowski, L. Noël, S. A. Moynihan, M. Robin, et al. (2021)

    Chloe for covid-19: evolution of an intelligent conversational agent to address infodemic management needs during the covid-19 pandemic

    Journal of Medical Internet Research 23 (9), pp. e27283. Cited by: §1.
  • D. Sundararaman, S. Si, V. Subramanian, G. Wang, D. Hazarika, and L. Carin (2020) Methods for numeracy-preserving word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4742–4753. Cited by: §1.
  • D. Sundararaman, V. Subramanian, G. Wang, S. Si, D. Shen, D. Wang, and L. Carin (2019) Syntax-infused transformer and bert models for machine translation and natural language understanding. arXiv preprint arXiv:1911.06156. Cited by: §1.
  • D. Sundararaman, V. Subramanian, G. Wang, L. Xu, and L. Carin (2022) Number entity recognition. arXiv preprint arXiv:2205.03559. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.
  • J. Yang, K. Zhou, Y. Li, and Z. Liu (2021) Generalized out-of-distribution detection: a survey. arXiv preprint arXiv:2110.11334. Cited by: §1.
  • Y. Zheng, G. Chen, and M. Huang (2020) Out-of-domain detection for natural language understanding in dialog systems. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 28, pp. 1198–1209. External Links: ISSN 2329-9290, Link, Document Cited by: §1.

Appendix A Tasks

STAR. This is a dialog dataset with 6651 dialogues spanning multiple domains and intents Mosig et al. (2020). Responses to dialogs that were marked either “ambiguous” or “out-of-scope” are used as OOD examples. The dataset has 29,104 examples with 104 intent labels.

FLOW. The FLOW dataset is a semantic parsing dataset with annotations for each turn of a dialog Andreas et al. (2020). In FLOW, the OOD samples are from discussions where the user stays far away from the central topic. The dataset has 71,551 examples spanning 44 intents.

ROSTD. Gangal et al. (2020) designed ROSTD, a dataset proposed for OOD detection. They use external source for OOD samples, while the internal data represents IND. ROSTD contains 47,913 examples with 13 classes.

Appendix B OOD Evaluation Metrics

OOD detection is evaluated using the Area Under the Receiver-Operating Curve (AUROC) metric for the binary classification task based on the estimated confidence score. An OOD method that perfectly separates from achieves an AUROC score of 100%. Another common metric used to evaluate OOD detection is false positive rate (FPR) at a fixed recall.

Appendix C Adaptation of MASKER Baseline

In the table 2

, improvised Masker baseline results can be seen, which include the results on a number of evaluation metrics. While the performance using improvised baseline is better than the GOLD baseline, our approach beats this model considerably.

Maxprob 71.54 7282 68.16 67.52 86.95 54.26
ODIN 72.45 71.86 68.57 66.81 86.86 54.25
BERT 75.93 62.89 69.79 70.07 99.16 1.73
Mahalanobis 79.38 59.97 72.99 65.23 99.41 1.15
POORE (Ours) 81.11 48.30 74.08 69.26 99.51 0.97
Table 2: Masker baseline and Adapted approaches.

Appendix D Ablation for choosing keywords

Table 3 compares the OOD detection performance of our proposed keyword selection approach described in Section 4.1 with the keyword selection criterion in the baseline MASKER in our proposed  POORE framework.

POORE with baseline keywords 80.69 58.71 73.41 68.08 99.52 1.06
POORE (Ours) 81.11 48.30 74.08 69.26 99.51 0.97
Table 3: Ablation for keywords