Wednesday, December 8, 2010

Update on SATA Expanders

So we've done some more research, largely following up on work done by Richard Elling, and I have an update on the SAS/SATA expander problem. There is at least some good news here.

The problems that we've had in the past with these have centered around "reset storms", where a single reset expands into a great number of resets, and I/O throughput quickly diminishes to zero.

The problem is that when a reset occurs on an expander, it aborts any in-flight operations, and they fail. Unfortunately, the *way* in which they fail is to generate a generic "hardware error". The problem is that the sd(7d) driver's response to this is to ... issue another reset, in a futile effort to hopefully correct things.

Now the problem is that this behavior is also performed, by default, for media errors as well. E.g. if you have a disk that has a bad sector on it. Of course, if your disk is mostly idle, it won't be a problem. But if you have a lot of I/O going on, its going to result mostly in a melt-down.

There is good news though, because of the way LSI's drivers are designed.

The LSI mptsas driver at least (and I suspect mpt as well, though I don't have code to look at it) treats "bus-level" resets and "target-level" resets as the same. Both of them do a reset, which will of course reset the expander.

But we can disable the most pernicious reset in sd with the following line in sd.conf:

allow-bus-device-reset=0;

This will allow bus-wide resets to occur, but it will most specifically disable the reset in response to generic hardware and media errors. The relevant section of code in sd.c is this:

if ((un->un_reset_retry_count != 0) &&
(xp->xb_retry_count == un->un_reset_retry_count)) {
mutex_exit(SD_MUTEX(un));
/* Do NOT do a RESET_ALL here: too intrusive. (4112858) */
if (un->un_f_allow_bus_device_reset == TRUE) {

boolean_t try_resetting_target = B_TRUE;

/*
* We need to be able to handle specific ASC when we are
* handling a KEY_HARDWARE_ERROR. In particular
* taking the default action of resetting the target may
* not be the appropriate way to attempt recovery.
* Resetting a target because of a single LUN failure
* victimizes all LUNs on that target.
*
* This is true for the LSI arrays, if an LSI
* array controller returns an ASC of 0x84 (LUN Dead) we
* should trust it.
*/

if (sense_key == KEY_HARDWARE_ERROR) {
switch (asc) {
case 0x84:
if (SD_IS_LSI(un)) {
try_resetting_target = B_FALSE;
}
break;
default:
break;
}
}

if (try_resetting_target == B_TRUE) {
int reset_retval = 0;
if (un->un_f_lun_reset_enabled == TRUE) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_LUN\n");
reset_retval =
scsi_reset(SD_ADDRESS(un),
RESET_LUN);
}
if (reset_retval == 0) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_TARGET\n");
(void) scsi_reset(SD_ADDRESS(un),
RESET_TARGET);
}
}
}

The savy folks here might notice that this is a wide setting, which is true. You can set it on a specific instance of sd, which requires more effort. There is also a better way to do this, by setting the reset_retry_count property to zero. However, setting the sd.conf property for that properly is considerably more complex, because of the byzantine syntax that sd uses to set up target-specific property values.

So, I still recommend avoiding these SATA expanders. But if you have no choice, then using this sd.conf tunable may be a reasonable workaround.

At the same time, I'm investigating the possibility of having this disabled by default for all of Nexenta's customers -- and possibly even in illumos. If you're a SCSI expert and have opinions on the matter, please let me know.

9 comments:

Ray Van Dolson said...

Garrett, great info... obviously a lot of vendors sell drives that internally are SATA but have SAS interconnects (Dell for example). We have many such systems in production in our data centers with no (apparent) issues (note that these systems aren't typically running Solaris-based OSes).

I presume the problem could exist there as well though as the issue pops up when we must convert between the SATA protocol and SAS? Regardless of whether or not the drive or backplane is doing the conversion.... the issue wouldn't be limited to only LSI based SAS controllers either I assume?

Our issue manifested itself mostly with SSD's we were using as ZIL even though we had 22 other 1TB SATA drives on the same expander, I am guessing the extra high IOPS the ZIL SSD's saw triggered the problem there. We potentially could have seen the same on the 1TB SATA drives as well had the correct workload conditions been met.

Thanks again.

Ray Van Dolson said...

Garrett, great info... obviously a lot of vendors sell drives that internally are SATA but have SAS interconnects (Dell for example). We have many such systems in production in our data centers with no (apparent) issues (note that these systems aren't typically running Solaris-based OSes).

I presume the problem could exist there as well though as the issue pops up when we must convert between the SATA protocol and SAS? Regardless of whether or not the drive or backplane is doing the conversion.... the issue wouldn't be limited to only LSI based SAS controllers either I assume?

Our issue manifested itself mostly with SSD's we were using as ZIL even though we had 22 other 1TB SATA drives on the same expander, I am guessing the extra high IOPS the ZIL SSD's saw triggered the problem there. We potentially could have seen the same on the 1TB SATA drives as well had the correct workload conditions been met.

Thanks again.

Garrett D'Amore said...

I think yes, the very high IOPS is what makes this problem so tragic. Given a more reasonable workload, you'd probably only notice some resets, and maybe a modest degradation in performance that would self-correct.

As far as LSI vs others? I'm not sure -- I've not investigated fully.

I do think we were being too free with the resets, and I think having multiple devices sitting behind an expander has a lot to do with the penalties involved.

I'm hoping to provide some better long term answers here soon.

Craig said...

Garrett, another possibility to explore is indirectly related to the high IOPS of SATA SSDs behind the expanders, namely that the point-to-point channel nature of comms to SATA drives combined with a high IOPS workload to same may conspire to 'hog' the bus … what other devices and expander firmware will do in this occasion is suspect … certainly routinely overreaching and issuing a bus reset rather than a more targetted target reset is certainly a possibility.

It would be interesting to deploy a consistent config with an alternate OS if we could monitor at the protocol layer, there are certainly a few companies over here which have the necessary equipment to achieve protocol tracing at the right level to identify this sort of operation.

Richard and I had a few more ideas last week, will drop you a line privately ...

Ravi said...

The problem is SAS is a connection oriented protocol and expander simply forwards primitives/frames to the disks. May be some day we will see more powerful expanders which terminates SAS connections and handles error/recovery without confusing HBA. As SAS topology gets large, simply forwarding connections is not enough. The SAS flow control (such as RRDY) have typical timeout of 1ms so lengthy cable and daisy chaining expanders would create more problems.

nadav said...

Hello Garrett,
I this (sd.conf) something that can be currently used in NexentaStor 3.0.4 to fix SATA->SAS interposer problems with SATA SSDs connected to SAS backplanes (for ZIL and L2ARC)?
If yes, how?

Thanks,

Garrett D'Amore said...

Not entirely. You really want the fixed drive firmware for the complete fix. I still don't like SATA drives on expanders, to be fully honest.

Donald said...

How can you tell if you are running into a reset storm? What log would show you this information?

I've got SAS only expanders with SAS disks throughout- except for our SSD's which are X25-E's on AAMUX's in the first disk shelf. I've been running into a lot of oddball problems recently and would love to track down any potential problems.

IstarUSA said...

SAS controller can take both SATA and SAS drives. Some higher end SAS controller support SAS-expander.

SAS Expander