Optimizing ML training with metagradient descent

(arxiv.org)

83 points | by ladberg 171 days ago

3 comments

munchler 170 days ago
In the ML problem I'm working on now, there are about a dozen simple hyperparameters, and each training run takes hours or even days. I don't think there's any good way to search the space of hyperparameters without a deep understanding of the problem domain, and even then I'm often surprised when a minor config tweak yields better results (or fails to). Many of these hyperparameters affect performance directly and are very sensitive to hardware limits, so a bad value leads to an out-of-memory error in one direction or a runtime measured in years in the other. It's a real-world halting problem on steroids.
This is not to even mention more complex design decisions, like the architecture of the model, which can't be captured in a simple hyperparameter.
[-]
- pama 170 days ago
  Optuna often works fine in this context (even with the memory errors or, with some tuning, with the non-halting runs): https://github.com/optuna/optuna
- lamename 170 days ago
  You might find this helpful for prioritizing which knobs to turn first https://github.com/google-research/tuning_playbook
  [-]
  - OccamsMirror 170 days ago
    Starting to get a bit out of date. Pity they stopped updating it.
    [-]
    - ayepif 170 days ago
      Pity indeed! Do you have any suggested resources that are more up-to-date?
    - lamename 170 days ago
      Last commit 1 yr. They welcome your contributions in CONTRIBUTING.md
- brandonpelfrey 170 days ago
  Are you already employing Bayesian optimization techniques? These are commonly used to explore spaces where evaluation is expensive.
  [-]
  - riedel 170 days ago
    They also depend on the design space to somewhat friendly in nature and can be modelled by a surrogate, so that exploit/explore can be modelled in an acquisition function.
    Also successive halving e.g. build on assumptions how the learning curve develops.
    Bottom line is that there is hyperparams for hyperparam searches again. So one starts building hyperparam heuristics on top of the hyperparam search.
    In the end there is no free lunch. But if hyperparam search strategy somewhat works in a domain it is a great tool. Good thing is that one can typically encode the design space in Blackbox optimization algorithms more easily.
- jampekka 170 days ago
  I've been wondering how the training process of the huge models works in practice. If an optimization run costs millions, they probably don't just run a grid of hyperparameters.
  [-]
  - yorwba 170 days ago
    Run a grid of hyperparameters for small models of different sizes to find out how the optimal values change as you scale up (the "scaling laws"), then extrapolate to predict performance at even larger scales, then do a single large run and hope that your predictions aren't too far off.
- logicchains 170 days ago
  That's the advantage of deep learning over traditional ML: if you've got enough data, you don't need domain knowledge or hyperparameter tuning, just throw a large enough universal approximator at it. The challenge lies in generating good enough artificial data for domains without enough data, and getting deep models to perform competitively with simpler models.
jn2clark 170 days ago
How does it compare to previous work on learning to learn? I don't see it referenced https://arxiv.org/abs/1606.04474
MITSardine 170 days ago
Great article, very clear. I was particularly curious because I've observed functions of the form
f : x -> argmin_y g(x,y)
are generally not smooth, like not even continuous, even though, as the article points out, every step of the program to approximate argmin is differentiable.
There are several issues with this kind of function. First, it's not even necessarily well defined. There could be two or more argmin values for each x value. Your algorithm will find just the one, but the underlying mathematical object is still ill-defined.
A second issue is that optimization algorithms don't converge to argmin to begin with, but to local minima. The value can differ dramatically in a completely unpredictable way. "Converge" is also an important word, as they stop before reaching the exact value; if g has mins that vary by less than your floating point representation can handle (for an extreme case), the problem is effectively numerically ill-posed even though it could be mathematically well-posed.
Lastly, even when everything is well behaved, that is single global optimum, ideal optimizer that finds it exactly, g is infinitely smooth, then consider:
- g(x,y) = sin(x)cos(y) defined on [-pi/2,pi/2]^2
- for x in ]0,pi/2], sin(x) > 0, thus sin(x)cos(y) is minimized for y = -pi/2
- for x in [-pi/2,0[, sin(x) < 0, thus sin(x)cos(y) is minimized for y = pi/2
It follows that the function f is -pi/2 Heaviside, which is not even continuous. In conclusion, even when everything is very simple, perfectly behaved, any approach that relies on differentiating the result of a minimization procedure is standing on some very fragile scaffolding.
For these reasons, I find the article stops at the most important part: it assesses the smoothness of such a function f empirically, but it does not explain how you might go about regularizing one found in the wild. This is the most crucial step in my opinion, because otherwise we're still somewhat stuck at tinkering effectively blindly in most cases. I don't mean any disrespect of what's otherwise very interesting and clearly thorough work, but since this is posted on a "layman forum", I just meant to point out what I think is an important limitation to how far-reaching the work here is.
---
Typo in Definition 2: 2hv? Or is this related to h(\Delta_f(z+h;h) - \Delta_f(z;h))? (unsure if it's the authors posting this)
I personally also find the Delta_f notation confusing, I think a more elegant notation coherent with the paper's other notations would have been
\frac{\delta f}{\delta v}(z;h)
Note that df/dx is a directional derivative wrt the basis vector x, it's no trouble putting any v there. Lowercase delta is often used as a discrete difference, so I think this fits the finite difference gradient well. As for the (z;h), this mirrors how the differential is often written: at z and applied to h. (the h is lost in the article's \Delta_f notation)
Question on Definition 2: why take the sign of the vectors, and then compute a weighted scalar product of that? not just compute the scalar product of the raw vectors and normalize by their norms? This would be in [-1,1] as well, so I'm curious why this particular measure was chosen.