Knowledge Distillation

Useful Papers/Articles On-Policy Distillation (from Thinking Machine) Link On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes Link Abstract Training Stages Pre-training: teaches general capacities such as language use, broad reasoning, and world knowledge. Mid-training: imparts domain knowledge, such as code, medical databases, or internal company documents. Post-training: elicits targeted behavior, such as instruction following, reasoning through math problems, or chat. 所以说看起来专业知识的注入被归纳为Mid-training的部分了？感觉在做Pre-training的时候，如果数据集覆盖面本身已经比较强，那就是已经包含了Mid-training。但是如果是面对一个新的domain，我是否有可能在post-training之后再次通过mid-training注入domain-specific knowledge？我有些猜测： Mid-training并不需要很干净的文本，可以是比较嘈杂的包含知识的数据。因此，如果在post-training之后看起来没法直接做这样的训练，不然有可能会破坏instruction-following的能力。我有点怀疑如果Post-training数据量太大的话，这个Mid-training中注入的某些知识可能会被遗忘了，或者说很难被引发。所以解决的方案应该是，要不构建一个高质量的post-training专业知识训练集，要不就另辟蹊径（知识蒸馏之类的）。 On/Off-Policy KD KD Type Mechanism Method On-Policy KD Teacher does, Student learns Teacher Trajectory/Logits Off-Policy KD (GKD) Student does, Teacher corrects Logits KL Divergences Forward KL (mode-coverage): $D_{KL}(P||Q_\theta)=\sum_{c\in C}P(c)\log\frac{P(c)}{Q_\theta(c)}$ Reverse KL (mode-seeking): $D_{KL}(Q_\theta||P)=\sum_{c\in C}Q_\theta(c)\log\frac{Q_\theta(c)}{P(c)}$ where $P(C)$ is the empirical distribution, $Q_\theta(C)$ is the model distribution. ...