In the ML problem I'm working on now, there are about a dozen simple hyperparameters, and each training run takes hours or even days. I don't think there's any good way to search the space of hyperparameters without a deep understanding of the problem domain, and even then I'm often surprised when a minor config tweak yields better results (or fails to). Many of these hyperparameters affect performance directly and are very sensitive to hardware limits, so a bad value leads to an out-of-memory error in one direction or a runtime measured in years in the other. It's a real-world halting problem on steroids.
This is not to even mention more complex design decisions, like the architecture of the model, which can't be captured in a simple hyperparameter.
Optuna often works fine in this context (even with the memory errors or, with some tuning, with the non-halting runs): https://github.com/optuna/optuna
I've been wondering how the training process of the huge models works in practice. If an optimization run costs millions, they probably don't just run a grid of hyperparameters.
Run a grid of hyperparameters for small models of different sizes to find out how the optimal values change as you scale up (the "scaling laws"), then extrapolate to predict performance at even larger scales, then do a single large run and hope that your predictions aren't too far off.
They also depend on the design space to somewhat friendly in nature and can be modelled by a surrogate, so that exploit/explore can be modelled in an acquisition function.
Also successive halving e.g. build on assumptions how the learning curve develops.
Bottom line is that there is hyperparams for hyperparam searches again. So one starts building hyperparam heuristics on top of the hyperparam search.
In the end there is no free lunch. But if hyperparam search strategy somewhat works in a domain it is a great tool. Good thing is that one can typically encode the design space in Blackbox optimization algorithms more easily.
That's the advantage of deep learning over traditional ML: if you've got enough data, you don't need domain knowledge or hyperparameter tuning, just throw a large enough universal approximator at it. The challenge lies in generating good enough artificial data for domains without enough data, and getting deep models to perform competitively with simpler models.
This is not to even mention more complex design decisions, like the architecture of the model, which can't be captured in a simple hyperparameter.
Also successive halving e.g. build on assumptions how the learning curve develops.
Bottom line is that there is hyperparams for hyperparam searches again. So one starts building hyperparam heuristics on top of the hyperparam search.
In the end there is no free lunch. But if hyperparam search strategy somewhat works in a domain it is a great tool. Good thing is that one can typically encode the design space in Blackbox optimization algorithms more easily.