In my day job, I do a lot of work in moving things to the cloud. But in my free time, I focus on running things at home. This offers me a lot of opportunity to experiment and play, but sometimes things become ‘critical’ and failures can be a little stressful.
You may be asking yourself, ‘Joe, how can you have something critical running at home?’ Well, I have some younger kids who are very dependent on specific routines. Our house also doesn’t have the most even of heating, and we’re in the cold months, so I have space heaters in some bedrooms controlled by Home Assistant with temperature sensors in the rooms. Little things that were the product of experiments quickly chained together to become ‘production’ in the house. No one is immune to the pattern of things accidentally becoming production.
All of this is running on low-cost hardware: old, small-form-factor desktops I bought off eBay. I added memory to them, put Kubernetes on them, and then started moving services off an even older desktop system that has been ‘the server’ for over 8 years. For the most part, it’s been an easy thing to maintain. I use rook/ceph for the storage to make sure things are redundant and most of my services don’t require a lot of storage. It all works well.
Until it doesn’t.
Yesterday, just before my kid’s bedtime, one of the storage drives that came with the machines finally started throwing more errors than it could deal with and would only mount in read-only mode. Normally, I would get into recovery mode and try to clean up the disk to see if I could make it hobble along for a bit more until a convenient time to change things. But, because of bedtime approaching and me being laid-up on the couch after a minor procedure: I needed to just get things working.
My first attempt was to just drain the node. But with the disk in read-only mode, nothing ever finished terminating. I needed to properly get the cluster to realize that the node was gone and never coming back. This meant removing the node from the etcd member list (it is one of the controller nodes), dropping the ceph OSD (it’s also a storage node, my budget for this is pretty small), and running in a degraded mode until I can get that node rebuilt.
I managed to get it all done in time for the bedtime routine to proceed and the heaters to work. In the process, I realized that some of my backups for databases running on this cluster need to be better. So consider this a reminder: old hardware is great and can do a lot of things. But it will fail eventually. Be ready with a plan!