LarsDu88 4 days ago

Recent pre-print from Intel Habana labs regarding training instability they encountered when training a Llama model at FP8, traced to the use of the SwiGLU from Noam Shazeer's 2020 paper: https://arxiv.org/abs/2002.05202

zaptrem 3 days ago

34% throughput improvement seems less than expected. Shouldn’t it be closer to 2x?