ICLR 2025 ML papers
Published:
I am curating (mainly) ICLR ‘25 submitted papers related to hyperparameter tuning of large-scale training.
Title | Summary |
---|---|
Scaling Optimal LR Across Token Horizons | ${\rm LR} \propto N^{-0.23}T^{-0.32}$ (fixed batch size) |
How Does Critical Batch Size Scale in Pre-training? | ${\rm crit. BS} \propto T$ (fixed LR) |
Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit | Relations of BS, LR, and $T$ are complicated |
How to set AdamW’s weight decay as you scale model and dataset size | the “timescale” 1/(LR * WD) should be constant |