ICLR 2025 ML papers

less than 1 minute read

Published:

I am curating (mainly) ICLR ‘25 submitted papers related to hyperparameter tuning of large-scale training.

TitleSummary
Scaling Optimal LR Across Token Horizons${\rm LR} \propto N^{-0.23}T^{-0.32}$ (fixed batch size)
How Does Critical Batch Size Scale in Pre-training?${\rm crit. BS} \propto T$ (fixed LR)
Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data LimitRelations of BS, LR, and $T$ are complicated
How to set AdamW’s weight decay as you scale model and dataset sizethe “timescale” 1/(LR * WD) should be constant