SCOTT: Self-Consistent Chain-of-Thought Distillation

05/03/2023
by   Peifeng Wang, et al.
5

Large language models (LMs) beyond a certain scale, demonstrate the emergent capability of generating free-text rationales for their predictions via chain-of-thought (CoT) prompting. While CoT can yield dramatically improved performance, such gains are only observed for sufficiently large LMs. Even more concerning, there is little guarantee that the generated rationales are consistent with LM's predictions or faithfully justify the decisions. In this work, we propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To form better supervision, we elicit rationales supporting the gold answers from a large LM (teacher) by contrastive decoding, which encourages the teacher to generate tokens that become more plausible only when the answer is considered. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective, which prevents the student from ignoring the rationales to make inconsistent predictions. Experiments show that, while yielding comparable end-task performance, our method can generate CoT rationales that are more faithful than baselines do. Further analysis suggests that such a model respects the rationales more when making decisions; thus, we can improve its performance more by refining its rationales.

READ FULL TEXT

page 5

page 6

page 7

research
12/16/2022

Teaching Small Language Models to Reason

Chain of thought prompting successfully improves the reasoning capabilit...
research
10/31/2021

Rethinking the Knowledge Distillation From the Perspective of Model Calibration

Recent years have witnessed dramatically improvements in the knowledge d...
research
06/19/2021

Teacher's pet: understanding and mitigating biases in distillation

Knowledge distillation is widely used as a means of improving the perfor...
research
12/10/2021

DisCo: Effective Knowledge Distillation For Contrastive Learning of Sentence Embeddings

Contrastive learning has been proven suitable for learning sentence embe...
research
02/21/2023

MaskedKD: Efficient Distillation of Vision Transformers with Masked Images

Knowledge distillation is a popular and effective regularization techniq...
research
08/09/2023

Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA

Large Language Models (LLMs) have shown outstanding performance across w...
research
06/24/2023

Can GPT-4 Support Analysis of Textual Data in Tasks Requiring Highly Specialized Domain Expertise?

We evaluated the capability of generative pre-trained transformers (GPT-...

Please sign up or login with your details

Forgot password? Click here to reset