Update on SATA Expanders
So we've done some more research, largely following up on work done by Richard Elling, and I have an update on the SAS/SATA expander problem. There is at least some good news here.
The problems that we've had in the past with these have centered around "reset storms", where a single reset expands into a great number of resets, and I/O throughput quickly diminishes to zero.
The problem is that when a reset occurs on an expander, it aborts any in-flight operations, and they fail. Unfortunately, the *way* in which they fail is to generate a generic "hardware error". The problem is that the sd(7d) driver's response to this is to ... issue another reset, in a futile effort to hopefully correct things.
Now the problem is that this behavior is also performed, by default, for media errors as well. E.g. if you have a disk that has a bad sector on it. Of course, if your disk is mostly idle, it won't be a problem. But if you have a lot of I/O going on, its going to result mostly in a melt-down.
There is good news though, because of the way LSI's drivers are designed.
The LSI mptsas driver at least (and I suspect mpt as well, though I don't have code to look at it) treats "bus-level" resets and "target-level" resets as the same. Both of them do a reset, which will of course reset the expander.
But we can disable the most pernicious reset in sd with the following line in sd.conf:
This will allow bus-wide resets to occur, but it will most specifically disable the reset in response to generic hardware and media errors. The relevant section of code in sd.c is this:
The savy folks here might notice that this is a wide setting, which is true. You can set it on a specific instance of sd, which requires more effort. There is also a better way to do this, by setting the reset_retry_count property to zero. However, setting the sd.conf property for that properly is considerably more complex, because of the byzantine syntax that sd uses to set up target-specific property values.
So, I still recommend avoiding these SATA expanders. But if you have no choice, then using this sd.conf tunable may be a reasonable workaround.
At the same time, I'm investigating the possibility of having this disabled by default for all of Nexenta's customers -- and possibly even in illumos. If you're a SCSI expert and have opinions on the matter, please let me know.
The problems that we've had in the past with these have centered around "reset storms", where a single reset expands into a great number of resets, and I/O throughput quickly diminishes to zero.
The problem is that when a reset occurs on an expander, it aborts any in-flight operations, and they fail. Unfortunately, the *way* in which they fail is to generate a generic "hardware error". The problem is that the sd(7d) driver's response to this is to ... issue another reset, in a futile effort to hopefully correct things.
Now the problem is that this behavior is also performed, by default, for media errors as well. E.g. if you have a disk that has a bad sector on it. Of course, if your disk is mostly idle, it won't be a problem. But if you have a lot of I/O going on, its going to result mostly in a melt-down.
There is good news though, because of the way LSI's drivers are designed.
The LSI mptsas driver at least (and I suspect mpt as well, though I don't have code to look at it) treats "bus-level" resets and "target-level" resets as the same. Both of them do a reset, which will of course reset the expander.
But we can disable the most pernicious reset in sd with the following line in sd.conf:
allow-bus-device-reset=0;
This will allow bus-wide resets to occur, but it will most specifically disable the reset in response to generic hardware and media errors. The relevant section of code in sd.c is this:
if ((un->un_reset_retry_count != 0) &&
(xp->xb_retry_count == un->un_reset_retry_count)) {
mutex_exit(SD_MUTEX(un));
/* Do NOT do a RESET_ALL here: too intrusive. (4112858) */
if (un->un_f_allow_bus_device_reset == TRUE) {
boolean_t try_resetting_target = B_TRUE;
/*
* We need to be able to handle specific ASC when we are
* handling a KEY_HARDWARE_ERROR. In particular
* taking the default action of resetting the target may
* not be the appropriate way to attempt recovery.
* Resetting a target because of a single LUN failure
* victimizes all LUNs on that target.
*
* This is true for the LSI arrays, if an LSI
* array controller returns an ASC of 0x84 (LUN Dead) we
* should trust it.
*/
if (sense_key == KEY_HARDWARE_ERROR) {
switch (asc) {
case 0x84:
if (SD_IS_LSI(un)) {
try_resetting_target = B_FALSE;
}
break;
default:
break;
}
}
if (try_resetting_target == B_TRUE) {
int reset_retval = 0;
if (un->un_f_lun_reset_enabled == TRUE) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_LUN\n");
reset_retval =
scsi_reset(SD_ADDRESS(un),
RESET_LUN);
}
if (reset_retval == 0) {
SD_TRACE(SD_LOG_IO_CORE, un,
"sd_sense_key_medium_or_hardware_"
"error: issuing RESET_TARGET\n");
(void) scsi_reset(SD_ADDRESS(un),
RESET_TARGET);
}
}
}
The savy folks here might notice that this is a wide setting, which is true. You can set it on a specific instance of sd, which requires more effort. There is also a better way to do this, by setting the reset_retry_count property to zero. However, setting the sd.conf property for that properly is considerably more complex, because of the byzantine syntax that sd uses to set up target-specific property values.
So, I still recommend avoiding these SATA expanders. But if you have no choice, then using this sd.conf tunable may be a reasonable workaround.
At the same time, I'm investigating the possibility of having this disabled by default for all of Nexenta's customers -- and possibly even in illumos. If you're a SCSI expert and have opinions on the matter, please let me know.
Comments
I presume the problem could exist there as well though as the issue pops up when we must convert between the SATA protocol and SAS? Regardless of whether or not the drive or backplane is doing the conversion.... the issue wouldn't be limited to only LSI based SAS controllers either I assume?
Our issue manifested itself mostly with SSD's we were using as ZIL even though we had 22 other 1TB SATA drives on the same expander, I am guessing the extra high IOPS the ZIL SSD's saw triggered the problem there. We potentially could have seen the same on the 1TB SATA drives as well had the correct workload conditions been met.
Thanks again.
I presume the problem could exist there as well though as the issue pops up when we must convert between the SATA protocol and SAS? Regardless of whether or not the drive or backplane is doing the conversion.... the issue wouldn't be limited to only LSI based SAS controllers either I assume?
Our issue manifested itself mostly with SSD's we were using as ZIL even though we had 22 other 1TB SATA drives on the same expander, I am guessing the extra high IOPS the ZIL SSD's saw triggered the problem there. We potentially could have seen the same on the 1TB SATA drives as well had the correct workload conditions been met.
Thanks again.
As far as LSI vs others? I'm not sure -- I've not investigated fully.
I do think we were being too free with the resets, and I think having multiple devices sitting behind an expander has a lot to do with the penalties involved.
I'm hoping to provide some better long term answers here soon.
It would be interesting to deploy a consistent config with an alternate OS if we could monitor at the protocol layer, there are certainly a few companies over here which have the necessary equipment to achieve protocol tracing at the right level to identify this sort of operation.
Richard and I had a few more ideas last week, will drop you a line privately ...
I this (sd.conf) something that can be currently used in NexentaStor 3.0.4 to fix SATA->SAS interposer problems with SATA SSDs connected to SAS backplanes (for ZIL and L2ARC)?
If yes, how?
Thanks,
I've got SAS only expanders with SAS disks throughout- except for our SSD's which are X25-E's on AAMUX's in the first disk shelf. I've been running into a lot of oddball problems recently and would love to track down any potential problems.
SAS Expander