Hello, Adding hardware CRC32C support to FreeBSD (#216467), we found that iSCSI CRC errors may not be correctly handled. When one of the patch #216467 versions generated bad CRC, here is how the iSCSI test disk behaved : disk connected successfully (here the CRC were certainly right), but as soon as there were some trafic, disk hung, and it was not possible to recover it, reboot needed. From /var/log/messages : kernel: WARNING: 192.168.2.1 (iqn.2012-06.srv1:rT1): no ping reply (NOP-In) after 5 seconds; reconnecting srv1 last message repeated 31 times srv1 last message repeated 42 times No digest failed message. Should we try to reproduce / investigate on this ? Many thanks ! Ben
I believe this was due to an infinite loop (bug) in that CRC code, not bad checksums. However, if someone wants to verify this one way or another, I'd suggest adding a fail point (fail(9)) in iSCSI to forcibly fail checksums in testing.
Ben, can you reproduce this?
Reproduce I would say no, as the issue was thrown by a bug in #216467 patch (which of course has been corrected). Conrad says he believes this was due to an infinite loop. But don't you think it would be interesting to be sure iSCSI correctly handles bad checksums ? To be sure it correctly behaves, without hanging disks, without having to reboot the server... ? Not sure however how to force bad checksums to perform this test... Thank you !
The code has been tested at some point (before it got first committed), it's quite straighforward, and I see no reason to belive it somehow got broken since then.