In this post we discuss the Linux disk I/O performance using either ZFS Raidz or the linux mdadm software RAID-0. It is important to understand that RAID-0 is not reliable for data storage, a single disk loss can easily destroy the whole RAID. On the other hand ZFS Raidz behaves similarly to RAID-5, while creating  ZFS pool without specifying the Raidz1 is effcetively RAID-o.

ZFS Setup:

zpool create -f -m /bigdata bigdata -o ashift=12 raidz1 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1
zfs set recordsize=1024K bigdata
zfs set checksum=off bigdata
zfs set atime=off bigdata

Check the pool, drives and their status:
zpool status

Mdadm Setup:

mdadm –create /dev/md0 -l 0 -n 5 /dev/nvme[0-4]n1

Create file system:
mkfs.xfs -f /dev/md0
mkdir /bigdata
mount -t xfs -o noatime,nobarrier,discard /dev/md0 /bigdata

Check RAID-0 status
mdadm –detail /dev/md0
cat /proc/mdstat

IRQ Pinning for NVME Drives

We won’t need to perform all the following steps if kernel version is 4.4 or higher. Earlier kernels have smp bugs for the nvme driver and thus defaulting all IRQs to CPU 0.

We can get the NUMA node where a particular NVME device is connected using:
# cat /sys/class/nvme/nvme0/device/numa_node
0
So we see that device nvme0 is with NUMA node 0 or with physical processor 1.

# cat /sys/class/nvme/nvme4/device/numa_node
1
But in this case nvme4 is with NUMA node 1 or the second physical processor.

I wrote this small script which pins individual drive’s IRQ to the desired CPU core(s):

#!/bin/sh
# nvme_smp.sh   
device=$1
cpu1=$2
cpu2=$3
cpu=$cpu1
grep $1 /proc/interrupts|awk '{print $1}'|sed 's/://'|while read int
do
  echo $cpu > /proc/irq/$int/smp_affinity_list
  echo "echo CPU $cpu > /proc/irq/$a/smp_affinity_list"
  if [ $cpu = $cpu2 ]
  then
     cpu=$cpu1
  else
     ((cpu=$cpu+1))
  fi
done

Use only one CPU: nvme_smp.sh nvme0 7 7
Use multiple CPUs to distribute IRQs: nvme_smp.sh nvme0 7 9

Performance Results

I have used FIO tool to benchmark the write throughput on the NVME devices. These drives are mx6300-270 from Mangstor. I am using five of these in a SuperMicro X10DRi server using Intel Haswell processors. Drive throughput monitoring is done using dstat which provides real time cpu usage and disk write throughput.

These NVME drives like others require very good airflow, I had to mount two spare fans in the chassis as below:

Mangstor-Fans

In case if the airflow is not adequate, then drives will throttle the write speed, just like we noticed below (three drives throttling their speed):

plot5drives

FIO scripts are balanced on each of the Numa Node where the devices are physically attached. There are two scripts as below. The first script calls fios.sh for a mount point and the numa node. Since the devices are pooled either through ZFS or Mdadm, therefore there is only one mount point available. Using the numactl command, fio is balanced to start three processes on first processor and two processes on the second processor.

#!/bin/sh
 /root/fios.sh bigdata 0
 /root/fios.sh bigdata 0
 /root/fios.sh bigdata 0
 /root/fios.sh bigdata 1
 /root/fios.sh bigdata 1
#!/bin/sh
numactl --cpubind=$2 --membind=$2  fio --name=global --directory=/$1 --rw=write --ioengine=libaio --direct=1 \
 --size=100G --numjobs=16 --iodepth=16 --bs=128k --name=job &

#dstat command line is: dstat -D,total,bigdata,nvme0n1,nvme1n1,nvme2n1,nvme3n1,nvme4n1

It’s report is similar to below:

# dstat -D,total,bigdata,nvme0n1,nvme1n1,nvme2n1,nvme3n1,nvme4n1          
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total--dsk/nvme0n1-dsk/nvme1n1-dsk/nvme2n1-dsk/nvme3n1-dsk/nvme4n1 -net/total- ---paging-- ---system--
usr sys idl wai hiq siq|   read  writ: read  writ: read  writ: read  writ: read  writ: read  writ|   recv  send|    in   out |   int   csw 
  4  31  63   2   0   0|    10M 2680M:     0   574M:     0   550M:     0   559M:     0   494M:     0   503M|   947B 2962B|     0     0 |    40k   48k
  0  55  42   3   0   0|     0  8471M:     0  1872M:     0  1864M:     0  1912M:     0  1449M:     0  1375M|   240B  598B|     0     0 |    75k  133k
  0  53  44   3   0   0|     0  7731M:     0  1158M:     0  1122M:     0  1041M:     0  2180M:     0  2230M|   360B  828B|     0     0 |    69k  121k
  0  33  63   4   0   0|     0  9671M:     0  2149M:     0  2209M:     0  2240M:     0  1544M:     0  1529M|   377B  598B|     0     0 |    76k  150k

 

Write throughput using Mdadm

fio-mdadm

CPU Usage with Mdadm

mdadm-cpu

Write throughput using ZFS

 

CPU Usage with ZFS

ZFS-cpu

 

 

Linux ZFS vs Mdadm performance difference