Open
Description
I have been trying to get this repo working for several months, but my loss keeps exploding between 30k and 100k iterations.
I have tried many things:
Turn flash attention off ( based on this issue: #524)
Using fp16 (based in this: #468)
Using GPT-4 tokenizer (based on #468)
At first the loss was going back up to about 8-10, now it is just going to NaN with fp16.
I have also tinkered with other setting such as gradient clipping, learning rate, etc. I keep my configuration at roughly 500k batch size.
I am lost on what to try next. Did anyone else fix this issue?
I have gotten GPT-2 Small down to about 3.0 loss.
Metadata
Assignees
Labels
No labels
Activity