SwiGLU activation function causes instability in FP8 LLM training

10 points by LarsDu88 a year ago

Recent pre-print from Intel Habana labs regarding training instability they encountered when training a Llama model at FP8, traced to the use of the SwiGLU from Noam Shazeer's 2020 paper: https://arxiv.org/abs/2002.05202

zaptrem a year ago

34% throughput improvement seems less than expected. Shouldn’t it be closer to 2x?