Hold on, people make this out to be like if a disk fails it requires immediate i...

wang_li · on Dec 13, 2022

The risks are the same whether you keep running with a failed disk vs. rebooting/booting with a failed disk. Also, given a system in which the OS is not running and requires booting, how exactly are you going to resync a replacement drive if you don't boot the system with a degraded raid volume?

justsomehnguy · on Dec 13, 2022

> Who reboots a production system without monitoring it?

Anyone.

> Also if you boot with a degraded disk are you not asking for massive trouble

> not being able to boot until you add another disk

Great.

I'm just a [sys]admin who were given a task to do something on $server.

I jump around the red tape, claw out a 15 minute downtime because the task requires reboot, proceed with all that corporate dance with notification emails.

I do my thing, reboot the server and it doesn't return back online.

Suddenly I broke the server, missed maintenance window, amount of mails with CC and RE: in my mailbox grows in geometric progression and the most important - now I need to find out who were responsible for the server, contact him and [kick his ass] ask him to diagnose what is going on.

Bonus points if:

server doesn't have a meaningful BMC/iLO/iDRAC with KVM console

it does have it but it's broken for some reason eg requires Java 6 on Vista

server was configured 10 years ago by a greybeard who is not only retired but already died from the old age

server is 6000km away from any place with a replacement disks and the earliest time when you can send the replacement would be the next spring when the ice would break and taw enough for the ships to move. Of course you can hire a helo to deliver it, which get your ass chewed for an unplanned $50k expenses

Also: https://news.ycombinator.com/item?id=33963000

wtallis · on Dec 13, 2022

It sounds like you're hypothesizing a long chain of bad decisions, and then ridiculing btrfs for taking the choice that means your next bad decision only culminates in (predictable, preventable) downtime rather than data loss.

justsomehnguy · on Dec 13, 2022

> It sounds like you're hypothesizing

This is all the things what I encountered in my admin days.

Including BL670 with incorrectly connected drives, so despite everything saying (and indicating) what the failed drive was in bay 2 it was actually in bay 1.

> then ridiculing btrfs

I ridicule btrfs for it's RAID mode not being a RAID mode by default.

RAID is about availability of data.

If you so hell bent on the data safety then btfrs should kernel panic as soon as one drive degrades. THAT would make sure someone would come and investigate what happened and no data loss.

Right?

wtallis · on Dec 13, 2022

> If you so hell bent on the data safety then btfrs should kernel panic as soon as one drive degrades. THAT would make sure someone would come and investigate what happened and no data loss.

> Right?

Panicking an already-running kernel would only serve to prevent userspace from handling the failure through mechanisms that are inherently beyond the scope and capabilities of the kernel alone (ie. stuff like alerting a sysadmin, activating a hot spare, or initiating a rebalance to restore redundancy among the remaining drives).

Perhaps the kernel should default to freezing a non-root filesystem when it becomes degraded, absent an explicit configuration permitting otherwise. But for the root filesystem, that would be counterproductive and prevent the failure from being handled gracefully.

Obviously, the tradeoffs are different for a system that is still trying to boot as opposed to one that is fully up and running.

awiesenhofer · on Dec 13, 2022

As someone who too had all of these happen at one point or another over the years: These are all process issues, not technical issues. No matter your RAID config or FS features,you're gonna have a bad time.