SwiGLU activation function causes instability in FP8 LLM training arxiv.org 10 points by LarsDu88 4 days ago
LarsDu88 4 days ago Recent pre-print from Intel Habana labs regarding training instability they encountered when training a Llama model at FP8, traced to the use of the SwiGLU from Noam Shazeer's 2020 paper: https://arxiv.org/abs/2002.05202
zaptrem 3 days ago 34% throughput improvement seems less than expected. Shouldn’t it be closer to 2x?
Recent pre-print from Intel Habana labs regarding training instability they encountered when training a Llama model at FP8, traced to the use of the SwiGLU from Noam Shazeer's 2020 paper: https://arxiv.org/abs/2002.05202
34% throughput improvement seems less than expected. Shouldn’t it be closer to 2x?