Software raid check killing Ubuntu 16.04 servers
While investigating why our automated on-site Q&A tests would fail once per month (http timeouts, etc…), indicating heavy load, i eventually discovered that it happened while the monthly raid array check was running, and found out that Ubuntu 16.04 ships
- with the deadline i/o scheduler by default (cat /sys/block/sda/queue/scheduler)
- a monthly software raid check (launched by /etc/cron.d/mdadm) which runs /usr/share/mdadm/checkarray with the –idle argument (which uses ionice)
From /usr/share/mdadm/checkarray:
# queue request for the array. The kernel will make sure that these requests
# are properly queued so as to not kill one of the array.
echo $action > $MDBASE/sync_action
[ $quiet -lt 1 ] && echo "$PROGNAME: I: check queued for array $array." >&2
case "$ionice" in
idle) ioarg='-c3'; renice=15;;
low) ioarg='-c2 -n7'; renice=5;;
high) ioarg='-c2 -n0'; renice=0;;
realtime) ioarg='-c1 -n4'; renice=-5;;
*) break;;
esac
resync_pid= wait=5
while [ $wait -gt 0 ]; do
wait=$((wait - 1))
resync_pid=$(ps -ef | awk -v dev=$array 'BEGIN { pattern = "^\\[" dev "_resync]$" } $8 ~ pattern { print $2 }')
if [ -n "$resync_pid" ]; then
[ $quiet -lt 1 ] && echo "$PROGNAME: I: selecting $ionice I/O scheduling class and $renice niceness for resync of $array." >&2
ionice -p "$resync_pid" $ioarg || :
renice -n $renice -p "$resync_pid" 1>/dev/null || :
break
fi
sleep 1
done
However, since the deadline i/o scheduler does ignore ionice, even though the –idle argument is passed, the raid check (which is very long) will just not run with a low i/o priority…
The most incredible part is how undocumented this all is…

Leave a comment