
最近在看模型量化的课。
里面在量化下面这个模型的时候说建议不要量化最后的lm_head。
CodeGenForCausalLM( (transformer): CodeGenModel( (wte): Embedding(51200, 1024) (drop): Dropout(p=0.0, inplace=False) (h): ModuleList( (0-19): 20 x CodeGenBlock( (ln_1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) (attn): CodeGenAttention( (attn_dropout): Dropout(p=0.0, inplace=False) (resid_dropout): Dropout(p=0.0, inplace=False) (qkv_proj): W8A16LinearLayer() (out_proj): W8A16LinearLayer() ) (mlp): CodeGenMLP( (fc_in): W8A16LinearLayer() (fc_out): W8A16LinearLayer() (act): NewGELUActivation() (dropout): Dropout(p=0.0, inplace=False) ) ) ) (ln_f): LayerNorm((1024,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=1024, out_features=51200, bias=True) ) 他说的原文如下:
2:14 And as I said we're not going to quantize the language model head 2:18 because since the model is an autoregressive model, it uses 2:22 the output from the previous iteration to get the output of the next iteration. 2:27 If you quantize the language model head, a lot of errors might 2:31 might be accumulating over the generation steps. 2:34 And you will most likely end up, having some gibberish after some tokens. 没看懂他说的理由,为什么量化 lm_head 会积累错误?有大佬能简单易懂的解释一下吗?