Isn't it bad design in the first place if you require "right order" of boot up? What if some, but not all, servers crash and reboot? How do you ensure the correct order in that case?
That’s a good point. Ideally, if we had complete control over how applications behave on startup, we could design them to “self-heal” and avoid dependency issues altogether. However, many of the components in the systems we work with are either closed source or require more experience and expertise to modify.
We’ve also noticed that, in practice, for many data-critical production systems, many of our customers prefer to manually control the boot process to confirm data integrity.
The scenario of partial crashes is interesting. I need to think about how to handle that a bit more. Thanks for the feedback!
One thing that might be useful is to allow a given server in the chain to fail. For example, if you had a Proxmox (or other hypervisor) cluster, in the event that a single node fails to come up, you'd probably want everything else to still boot. Or maybe it would be easier if there was a separate category for VM vs. hypervisor?
Either way, neat project, and thank you for sharing.
Thanks for the feedback! So would a tagging system be useful? Right now you can declare dependencies on a single server, but maybe we can have it depend on at least one of the machines in a tagged group booting up?
I could see a group being useful, yeah. Must have one of [server0, server1, server2] to continue. Though there is a lot of bleed-over when talking about hypervisors and the boot order of VMs, since hypervisors generally can handle that, at least on their own node.
A very relatable struggle. Cool project! I remember getting into WoL as a kid playing with our home PC, felt like magic to press a button on my phone and watch the fully powered off machine come to life.
Never sorted out a reliable enough system for it to be practically useful, but this gives me some ideas...
Equip your machines with whack-on-LAN so you can remotely reset them by the same mechanism, and you've got a reasonably complete remote management setup!
I only had one problem with it, and that was that it isn't enough to enable it in the BIOS, but I needed to flip a switch on Windows and set up a systemd service on Linux (I dual boot).
I always wondered: why make it so difficult to turn on? Is it a security issue? I mean, an off-by-default OS setting and an off-by-default BIOS setting? How dangerous is this thing??
It draws more power because the NIC can't power off completely. So Microsoft and every hardware vendor are incentiviced to turn it off to look good. (And probably
to please regulations)
That it is defaulted off I feel is motivated, but to make it so hard to turn it on is pretty pathetic.
The fury Microsoft generated by turning it off in a Windows update still fuels me. I had a remote PC that I need to access remotely during holiday season. And Microsoft turns off my ability to power it on, with me left trying to figure out why I can't access the machine anymore.
Yep, pretty sure. m is for multicast.
From arch wiki: d (disabled), p (PHY activity), u (unicast activity), m (multicast activity), b (broadcast activity), a (ARP activity), g (magic packet activity)
I still have a lot of weirdness when it comes to making things actually stay asleep and wake up again when wanted. There's a lot of hardware that often boots up, or always boots but only if it's been switched on once before, or keeps trying to wake itself up for no good reason.
The absolute most consistent way I've found is using cheap zigbee smart switches or even second hand smart PDUs. Set the machine to boot when power is restored and actually switch the thing on and off from the wall. It saves a whole lot of messing around, you can force the reboot issue and for a tiny amount more in upfront costs you can have power monitoring as well. It also works for network gear that doesn't sleep or anything else that had a physical switch.
edit: Ideally give me hardware with proper out-of-band management (ipmi or AMT at a pinch), but for everything else having control of the power is as good as it gets.
It's pretty heavily used in some on premises HPC contexts... used to run a large supermicro cluster which we would power down when not needed, which saved a fair amount of electricity (and by extension emissions and money.) It's quite solid.
I have weird desktop system on ASRock motherboard.
I have 2 ethernet (10gbe and 1gbe) ports and WiFi build in.
i have 10gbe network so ofc I want to use 10gbe port.
the issue that I discovered after many hours of debugging is that 10gbe port is powered down completely on suspend/power off. so it have no way to work.
because I had limited number of ethernet ports available I set up system to use wake up over WiFi (with wake also on key rotation or disconnect)
I haven't had problems with it the past few years, on SuperMicros.
EDIT: as the sibling comment reminded me, I'm using IPMI, not WoL. That said, I have tested WoL and had no issues with it doing its job – I only switched because I had a server that would randomly fail to find its NVMe drive at boot; rebooting (which IPMI allowed me to do) would fix it.
Been fine for me mid 00's and onwards. From memory with SuperMicro, HP, Dell kit etc. Ususally setup via ipmi. Not done it for a while, but don't recall issues.
In my experience it depends a lot on your hardware. For some versions my msi consumer Mainboard just would not respond to wol packets. No matter what I tried.
We’ve also noticed that, in practice, for many data-critical production systems, many of our customers prefer to manually control the boot process to confirm data integrity.
The scenario of partial crashes is interesting. I need to think about how to handle that a bit more. Thanks for the feedback!
Either way, neat project, and thank you for sharing.
Nice project, thanks for sharing!
What type of API are you thinking of? It already runs on a YAML config, so maybe a web server that takes the config as a JSON body instead?
https://github.com/darwindarak/rallyup/commit/36f9c474c13644...
Never sorted out a reliable enough system for it to be practically useful, but this gives me some ideas...
https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...
https://www.i3detroit.org/reset-on-lan-an-ethernet-aware-rem...
For Linux set it to "g": https://wiki.archlinux.org/title/Wake-on-LAN#Make_it_persist...
For Windows you need to enable "Wake on Magic Packet": https://www.windowscentral.com/how-enable-and-use-wake-lan-w...
That it is defaulted off I feel is motivated, but to make it so hard to turn it on is pretty pathetic.
The fury Microsoft generated by turning it off in a Windows update still fuels me. I had a remote PC that I need to access remotely during holiday season. And Microsoft turns off my ability to power it on, with me left trying to figure out why I can't access the machine anymore.
G for maGic, I presume?
The absolute most consistent way I've found is using cheap zigbee smart switches or even second hand smart PDUs. Set the machine to boot when power is restored and actually switch the thing on and off from the wall. It saves a whole lot of messing around, you can force the reboot issue and for a tiny amount more in upfront costs you can have power monitoring as well. It also works for network gear that doesn't sleep or anything else that had a physical switch.
edit: Ideally give me hardware with proper out-of-band management (ipmi or AMT at a pinch), but for everything else having control of the power is as good as it gets.
the issue that I discovered after many hours of debugging is that 10gbe port is powered down completely on suspend/power off. so it have no way to work.
because I had limited number of ethernet ports available I set up system to use wake up over WiFi (with wake also on key rotation or disconnect)
EDIT: as the sibling comment reminded me, I'm using IPMI, not WoL. That said, I have tested WoL and had no issues with it doing its job – I only switched because I had a server that would randomly fail to find its NVMe drive at boot; rebooting (which IPMI allowed me to do) would fix it.