Optimizing ML training with metagradient descent

(arxiv.org)

72 points | by ladberg 15 hours ago

2 comments

  • munchler 10 hours ago
    In the ML problem I'm working on now, there are about a dozen simple hyperparameters, and each training run takes hours or even days. I don't think there's any good way to search the space of hyperparameters without a deep understanding of the problem domain, and even then I'm often surprised when a minor config tweak yields better results (or fails to). Many of these hyperparameters affect performance directly and are very sensitive to hardware limits, so a bad value leads to an out-of-memory error in one direction or a runtime measured in years in the other. It's a real-world halting problem on steroids.

    This is not to even mention more complex design decisions, like the architecture of the model, which can't be captured in a simple hyperparameter.

    • pama 4 minutes ago
      Optuna often works fine in this context (even with the memory errors or, with some tuning, with the non-halting runs): https://github.com/optuna/optuna
    • lamename 10 hours ago
      You might find this helpful for prioritizing which knobs to turn first https://github.com/google-research/tuning_playbook
      • OccamsMirror 7 hours ago
        Starting to get a bit out of date. Pity they stopped updating it.
        • ayepif 1 hour ago
          Pity indeed! Do you have any suggested resources that are more up-to-date?
    • jampekka 4 hours ago
      I've been wondering how the training process of the huge models works in practice. If an optimization run costs millions, they probably don't just run a grid of hyperparameters.
      • yorwba 1 hour ago
        Run a grid of hyperparameters for small models of different sizes to find out how the optimal values change as you scale up (the "scaling laws"), then extrapolate to predict performance at even larger scales, then do a single large run and hope that your predictions aren't too far off.
    • brandonpelfrey 7 hours ago
      Are you already employing Bayesian optimization techniques? These are commonly used to explore spaces where evaluation is expensive.
      • riedel 6 hours ago
        They also depend on the design space to somewhat friendly in nature and can be modelled by a surrogate, so that exploit/explore can be modelled in an acquisition function.

        Also successive halving e.g. build on assumptions how the learning curve develops.

        Bottom line is that there is hyperparams for hyperparam searches again. So one starts building hyperparam heuristics on top of the hyperparam search.

        In the end there is no free lunch. But if hyperparam search strategy somewhat works in a domain it is a great tool. Good thing is that one can typically encode the design space in Blackbox optimization algorithms more easily.

    • logicchains 2 hours ago
      That's the advantage of deep learning over traditional ML: if you've got enough data, you don't need domain knowledge or hyperparameter tuning, just throw a large enough universal approximator at it. The challenge lies in generating good enough artificial data for domains without enough data, and getting deep models to perform competitively with simpler models.
  • jn2clark 9 hours ago
    How does it compare to previous work on learning to learn? I don't see it referenced https://arxiv.org/abs/1606.04474