Pretraining loss explosion

I have been trying to get this repo working for several months, but my loss keeps exploding between 30k and 100k iterations.  

I have tried many things: 
Turn flash attention off ( based on this issue: https://github.com/karpathy/nanoGPT/issues/524)
Using fp16 (based in this: https://github.com/karpathy/nanoGPT/issues/468)
Using GPT-4 tokenizer (based on https://github.com/karpathy/nanoGPT/issues/468)

At first the loss was going back up to about 8-10, now it is just going to NaN with fp16. 

I have also tinkered with other setting such as gradient clipping, learning rate, etc. I keep my configuration at roughly 500k batch size.  

I am lost on what to try next.  Did anyone else fix this issue?  

I have gotten GPT-2 Small down to about 3.0 loss.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretraining loss explosion #554

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development