Bug 194522 - smartmontools misbehaving (self-test timestamps do not match reality)
Summary: smartmontools misbehaving (self-test timestamps do not match reality)
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.2-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-bugs (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-22 00:35 UTC by Jeremy Chadwick
Modified: 2014-11-27 22:57 UTC (History)
0 users

See Also:


Attachments
dmesg.txt (9.64 KB, text/plain)
2014-10-22 00:35 UTC, Jeremy Chadwick
no flags Details
pciconf -lvbc output (11.37 KB, text/plain)
2014-10-22 00:35 UTC, Jeremy Chadwick
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jeremy Chadwick 2014-10-22 00:35:04 UTC
Created attachment 148554 [details]
dmesg.txt

NOTE: FreeBSD version involved is technically 9.3-STABLE, but the Bugzilla GUI only offers up to 9.3-RELEASE.  Rephrased: I run base/stable/9.

I'm filing this PR in combination or relation to an open ticket I filed with the smartmontools folks:

http://www.smartmontools.org/ticket/466

Full details, including all output and lots of examples, are there.

Basically, there is something going on within FreeBSD (or possibly within smartmontools, although rebuilding smartmontools 6.2 and using that shows the same problems) where certain SMART attribute values do not seem to appear/behave correctly when compared to values in the SMART extended self-test log.  The issue is 100% reproducible, and cannot be reproduced on Windows.

Specifically, a self-test issued at a certain Power_On_Hours count does not appear that way -- e.g. if Power_On_Hours is 12345 and a self-test is run, the SMART extensive self-test log might show the test completed at time 12189, or possibly sometime in the future.

This issue affects multiple models of MHDDs and SSDs, and makes data recovery and diagnostics extremely difficult given the behaviour.

The ATA CDBs being submit to the drive + full response payload I can try to get using CAM debugging, but I at least wanted to file a PR on the matter because at this point it's looking to be FreeBSD-centric in some way.

I'll attach dmesg output, as well as pciconf -lvbc output.
Comment 1 Jeremy Chadwick 2014-10-22 00:35:19 UTC
Created attachment 148555 [details]
pciconf -lvbc output
Comment 2 Jeremy Chadwick 2014-10-29 08:34:55 UTC
After system was power-cycled, issue can no longer be reproduced.  My opinion is that the AHCI controller involved (ICH9) may have some kind of option ROM bug or general issue that was rectified via a power-cycle.  See my most recent/final comment in the smartmontools ticket for details.
Comment 3 Jeremy Chadwick 2014-11-27 19:08:26 UTC
Reopening -- this issue has returned.  Again to recap: rebooting (not power-cycling) the system into Linux alleviates the problem, but it appears again once FreeBSD is running.  So the issue does appear specific to FreeBSD, and is affecting all drives of different versions/models so it isn't a drive-firmware-specific thing.

My gut feeling is that one of these commits is responsible for the problem, particularly r249850, but I have no actual proof/evidence to back that up, it's purely speculation on my part.

http://www.freshbsd.org/commit/freebsd/r249850
http://www.freshbsd.org/commit/freebsd/r251874
http://www.freshbsd.org/commit/freebsd/r251897

Can someone CC mav@ on this ticket?  I'd like to get his insights into this.
Comment 4 Jeremy Chadwick 2014-11-27 22:57:04 UTC
Okay, strike that.  It appears I was witnessing two separate issues simultaneously, which sounds crazy but seems to be the case after some further analysis.

The first issue related to SMART attributes and self-test log entries affecting all drives in the system (Samsung SSD and several WD MHDD drives).  That issue was rectified by power-cycling the system.  This seems to imply the AHCI controller was experiencing some oddity/bug/quirk brought on by who-knows-what.  After power-cycling, everything seemed fine -- except for the Samsung SSD.  Which leads me to:

The second issue relates to the Samsung drive and SMART attribute 9 vs. what the "LifeTime(hours)" field shows in a SMART self-test log entry.  And I believe I've figured out what's going on there[1]:

I recently updated my Samsung 840 EVO's firmware to fix a performance-related bug[2][3][4] that Samsung and many review sites found.  It appears that after doing the firmware update, SMART attributes did not get reset back to zero/factory defaults, but "internal counters" used for calculating the power-on hour count within self-tests **did** get reset to 0.  So there's now a permanent delta between SMART attribute 9 and the hour shown in a self-test log entry because of this mistake on Samsung's part.

[1]: http://www.dslreports.com/forum/r29694516-
[2]: http://www.computerworld.com/article/2836082/samsung-delivers-fix-for-ssd-slowdowns.html
[3]: http://www.anandtech.com/show/8617/samsung-releases-firmware-update-to-fix-the-ssd-840-evo-read-performance-bug
[4]: http://beta.slashdot.org/story/208795

So, we can close this out.  There's nothing I can do about this newly-induced quirk in the Samsung firmware, and it obviously has nothing to do with FreeBSD.