Friday, August 26, 2016

Dell, how I hate thee (let me count the ways)

A Dell "feature" that appears to be designed to force customers to use only Dell parts reduced the speed of a set of SSDs one of my customers installed on their rack-mountable R900 server by a factor of 1000.

Before I get into this, there are some provisos. This server was using Linux kernel 2.6.32. The SSDs involved are Samsung 850 Pro SATA-style solid state disks. SSD is not quite ready for prime time in the 2.6.32 kernel; NVMe support was first added in 3.3, TRIM wasn't available at all until 2.6.33,
and a ton of other things we all take for granted like the device mapper are part of the 4.* kernel.

Consumer-level Samsung drivers bring their own issues. Despite what the knuckle-heads on Reddit have to say about the topic, the Linux kernel still blacklists queued TRIM functions from every Samsung SSD in the 8** series. As of the latest Github commit as of this writing for kernel 4.8 queued TRIM still doesn't work for these devices.

More importantly, the R900 isn't a new server. This is an 8 year old box. There is a SAS backplane involved which, although having a theoretical max data transfer of 3.0 Gbps, was designed before SSDs were widely available, and introduces a bunch of contacts, wiring and complexity that is likely all screwed up and almost certainly not optimized for fat-guy Peta Belly Flops of computing power.

Initial benchmarking with fio and ioping in addition to monitoring CPU iowait times with top and checking out iostat had this server's SSDs performing *slower* than a similar server with 7500 RPM sata disks in a ZFS pool.

I did a bunch of stuff to this box hoping to shake a few extra IOPS out of it. I installed Dell's dsu to get my hands on the latest drivers & firmware (under the mistaken belief that an update on either front had been released in the last decade). I had never physically seen this server; so there was a lot of lspci-ing and modprobe-ing.

Luckily, I stayed focused on the controller & backplane a SAS 6/iR (FW 00.25.47.00.06.22.03.00) and LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) (FW 1.06), respecticely. Eventually I stumbled upon this post that described - accurately - how the Dell was automatically stepping down the SATA port speed on non-Dell-certified disks from SATA II (3gbps) to SATA I (1.5gbps).

Many moons ago, Dell's RAID cards would simply not allow users to install non-Dell disks. My experience with the R900 using BIOS version 1.2.0 would indicate that - although I am able to use non-certified disks without fatal errors, the backplane deliberately slows these disks down without reason, and in a way that is almost always transparent to the end user. I will hold off on making accusations here until I get my hands on the source code for this firmware, but the evidence up to this point is fairly damning. If anyone from Dell has an explanation for this sort of behavior, I would be happy to publish your feedback here.

There is a workaround for this issue, albeit its incredibly hack-y. It involves the use of the (now defunct) lsiutil application (available here or direct mirror here or you can now find the Linux packages on my GitHub page).  This application allows us to make calls directly to the backplane. In this case, the fix involves resetting the minimum link speed on the back plane from 1.5Gbps to 3.0Gbps. 

Heres a step-by-step:

   - download the zip file
   - unzip the file in a directory of your choice; # unzip LSIUtil_1.62.zip -d /home/joshw/lsiutil/
   - navigate to the directory referencing your OS; # cd /home/joshw/lsiutil/Linux/
   - identify the version of the application matching your processor/OS bit type. For linux, there is a 32 bit, AMD64 and x86_64 version. I selected the x86_64 and applied an executable bit: # chmod +x lsiutil.x86_64
   - make sure youre root: # sudo su
   - run the application: # ./lsiutil.x86_64

You should see something like this:

# ./lsiutil.x86_64

LSI Logic MPT Configuration Utility, Version 1.62, January 14, 2009

1 MPT Port found

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  /proc/mpt/ioc0    LSI Logic SAS1068E B3     105      00192f00     0

Select a device:  [1-1 or 0 to quit]

 Most likely you will only have one option here, so select it by pressing 1:

Select a device:  [1-1 or 0 to quit] 1

 1.  Identify firmware, BIOS, and/or FCode
 2.  Download firmware (update the FLASH)
 4.  Download/erase BIOS and/or FCode (update the FLASH)
 8.  Scan for devices
10.  Change IOC settings (interrupt coalescing)
13.  Change SAS IO Unit settings
16.  Display attached devices
20.  Diagnostics
21.  RAID actions
22.  Reset bus
23.  Reset target
42.  Display operating system names for devices
45.  Concatenate SAS firmware and NVDATA files
59.  Dump PCI config space
60.  Show non-default settings
61.  Restore default settings
66.  Show SAS discovery errors
69.  Show board manufacturing information
97.  Reset SAS link, HARD RESET
98.  Reset SAS link
99.  Reset port
 e   Enable expert mode in menus
 p   Enable paged mode
 w   Enable logging

From here select option 13

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 13

You will be immediately prompted with some configuration questions. Just press RETURN to keep the current / default values:
SATA Maximum Queue Depth:  [0 to 255, default is 8]
Device Missing Report Delay:  [0 to 2047, default is 0]
Device Missing I/O Delay:  [0 to 255, default is 0]

Eventually you will be dumped out here:

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     1.5      3.0    Enabled    Disabled  Auto
   1    Enabled     1.5      3.0    Enabled    Disabled  Auto
   2    Enabled     1.5      3.0    Enabled    Disabled  Auto
   3    Enabled     1.5      3.0    Enabled    Disabled  Auto
   4    Enabled     1.5      3.0    Enabled    Disabled  Auto
   5    Enabled     1.5      3.0    Enabled    Disabled  Auto
   6    Enabled     1.5      3.0    Enabled    Disabled  Auto
   7    Enabled     1.5      3.0    Enabled    Disabled  Auto

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit]

Select 8 to make changes to all of the available ports simultaneously

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit] 8

Again, you will be prompted for several values. You want to be very careful here as we only want to change one value - MinRate (this should be the second value you are prompted to modify. Every other value should remain default by pressing RETURN.

Link:  [0=Disabled, 1=Enabled, or RETURN to not change]
MinRate:  [0=1.5 Gbps, 1=3.0 Gbps, or RETURN to not change] 1
MaxRate:  [0=1.5 Gbps, 1=3.0 Gbps, or RETURN to not change]
Initiator:  [0=Disabled, 1=Enabled, or RETURN to not change]
Target:  [0=Disabled, 1=Enabled, or RETURN to not change]
Port configuration:  [1=Auto, 2=Narrow, 3=Wide, or RETURN to not change]

Once you've finished you will be dumped back to the port menu:

PhyNum  Link      MinRate  MaxRate  Initiator  Target    Port
   0    Enabled     3.0      3.0    Enabled    Disabled  Auto
   1    Enabled     3.0      3.0    Enabled    Disabled  Auto
   2    Enabled     3.0      3.0    Enabled    Disabled  Auto
   3    Enabled     3.0      3.0    Enabled    Disabled  Auto
   4    Enabled     3.0      3.0    Enabled    Disabled  Auto
   5    Enabled     3.0      3.0    Enabled    Disabled  Auto
   6    Enabled     3.0      3.0    Enabled    Disabled  Auto
   7    Enabled     3.0      3.0    Enabled    Disabled  Auto

Press RETURN from here to save your changes.

Select a Phy:  [0-7, 8=AllPhys, RETURN to quit]

You'll be prompted again for some other values; again keep the defaults or current values by pressing RETURN:

Persistence:  [0=Disabled, 1=Enabled, default is 1]
Physical mapping:  [0=None, 1=DirectAttach, 2=EnclosureSlot, default is 2]
Number of Target IDs to reserve:  [0 to 32, default is 8]

This will take you back to the main menu. Select 0 from here to save & quit:

Main menu, select an option:  [1-99 or e/p/w or 0 to quit] 0

... Which takes you back to the device menu. Hit 0 again and you are finally done:

     Port Name         Chip Vendor/Type/Rev    MPT Rev  Firmware Rev  IOC
 1.  /proc/mpt/ioc0    LSI Logic SAS1068E B3     105      00192f00     0

Select a device:  [1-1 or 0 to quit] 0
root at someServer in /home/joshw/Linux
#


Here are some real-world benchmarks using fio showing before & after metrics.


BEFORE

1X 4GB FILE RANDOM READ/WRITE

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.2.8
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m(1)] [100.0% done] [21804KB/7072KB/0KB /s] [5451/1768/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=490: Fri Aug 26 13:37:42 2016
  read : io=3071.7MB, bw=20940KB/s, iops=5235, runt=150207msec
  write: io=1024.4MB, bw=6983.2KB/s, iops=1745, runt=150207msec
  cpu          : usr=1.79%, sys=11.75%, ctx=786417, majf=0, minf=24
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=786347/w=262229/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=64


SEVERAL SMALLER FILES / RANDOM WRITE ONLY

fio --name=randwrite --ioengine=libaio --iodepth=1 --rw=randwrite --bs=4k --direct=0 --size=64M --numjobs=32 --runtime=60 --group_reporting --iodepth=16
randwrite: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=16
...
fio-2.13-83-g747b
Starting 32 processes
Jobs: 28 (f=28): [_(2),w(16),_(1),w(7),E(1),w(5)] [89.8% done] [0KB/311.7MB/0KB /s] [0/79.8K/0 iops] [eta 00m:05s]
randwrite: (groupid=0, jobs=32): err= 0: pid=4447: Fri Aug 26 13:05:05 2016
  write: io=2048.0MB, bw=47437KB/s, iops=11859, runt= 44209msec
    slat (usec): min=19, max=13715K, avg=2573.62, stdev=166998.51
    clat (usec): min=5, max=13728K, avg=38699.79, stdev=646546.36
     lat (usec): min=26, max=13729K, avg=41273.42, stdev=667725.13
    clat percentiles (usec):
     |  1.00th=[  454],  5.00th=[  540], 10.00th=[  580], 20.00th=[  636],
     | 30.00th=[  692], 40.00th=[  756], 50.00th=[  868], 60.00th=[ 1160],
     | 70.00th=[ 5536], 80.00th=[11328], 90.00th=[17536], 95.00th=[24704],
     | 99.00th=[41216], 99.50th=[52480], 99.90th=[12779520], 99.95th=[13697024],
     | 99.99th=[13697024]
    lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.02%
    lat (usec) : 500=2.57%, 750=36.65%, 1000=17.06%
    lat (msec) : 2=8.94%, 4=3.35%, 10=8.24%, 20=15.22%, 50=7.37%
    lat (msec) : 100=0.29%, 250=0.01%, >=2000=0.26%
  cpu          : usr=0.13%, sys=2.11%, ctx=67366, majf=0, minf=982
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=99.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=524288/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=16


AFTER

1X 4GB FILE RANDOM READ/WRITE

test: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 4096MB)
Jobs: 1 (f=1): [m] [100.0% done] [74324K/24284K/0K /s] [18.6K/6071 /0  iops] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=59026: Fri Aug 26 19:10:52 2016
  read : io=3070.6MB, bw=74180KB/s, iops=18545 , runt= 42386msec
  write: io=1025.5MB, bw=24775KB/s, iops=6193 , runt= 42386msec
  cpu          : usr=8.79%, sys=55.13%, ctx=796212, majf=0, minf=20
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued    : total=r=786053/w=262523/d=0, short=r=0/w=0/d=0



SEVERAL SMALLER FILES / RANDOM WRITE ONLY
Jobs: 32 (f=32)
randwrite: (groupid=0, jobs=32): err= 0: pid=36168: Fri Aug 26 17:09:26 2016
  write: io=2048.0MB, bw=1168.1MB/s, iops=299251 , runt=  1752msec
    slat (usec): min=7 , max=21074 , avg=91.01, stdev=468.57
    clat (usec): min=7 , max=21729 , avg=697.25, stdev=1302.17
     lat (usec): min=15 , max=21799 , avg=789.95, stdev=1385.85
    clat percentiles (usec):
     |  1.00th=[  118],  5.00th=[  390], 10.00th=[  482], 20.00th=[  506],
     | 30.00th=[  524], 40.00th=[  540], 50.00th=[  556], 60.00th=[  580],
     | 70.00th=[  588], 80.00th=[  604], 90.00th=[  620], 95.00th=[  636],
     | 99.00th=[10688], 99.50th=[10688], 99.90th=[14656], 99.95th=[20608],
     | 99.99th=[20864]
    bw (KB/s)  : min=24175, max=50576, per=3.10%, avg=37155.51, stdev=7603.34
    lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.27%, 250=2.72%
    lat (usec) : 500=13.04%, 750=82.28%, 1000=0.11%
    lat (msec) : 2=0.04%, 4=0.05%, 10=0.20%, 20=1.20%, 50=0.07%
  cpu          : usr=7.61%, sys=70.17%, ctx=2186, majf=0, minf=913
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=524288/d=0, short=r=0/w=0/d=0

No comments:

Post a Comment