| Summary: | ata device driver fails to abort queued commands when device disappears | ||
|---|---|---|---|
| Product: | Base System | Reporter: | Tor Egge <tegge> |
| Component: | kern | Assignee: | Søren Schmidt <sos> |
| Status: | Closed FIXED | ||
| Severity: | Affects Only Me | ||
| Priority: | Normal | ||
| Version: | 4.2-RELEASE | ||
| Hardware: | Any | ||
| OS: | Any | ||
Responsible Changed From-To: freebsd-bugs->sos Over to ata(4) maintainer. State Changed From-To: open->closed This is fixed in -current.. |
When a drive completely hangs, it apparently partially disappears from the ata configuration while resetting the devices. ad0: READ command timeout tag=0 serv=0 - resetting ata0: resetting devices .. done ad0: READ command timeout tag=0 serv=0 - resetting ata0: resetting devices .. done ad0: READ command timeout tag=0 serv=0 - resetting ad0: READ command timeout tag=0 serv=0 - resetting ad0: READ command timeout tag=0 serv=0 - resetting ata0-master: timeout waiting for command=ef s=00 e=00 ad0: trying fallback to PIO mode ad0: READ command timeout tag=0 serv=0 - resetting ad0: READ command timeout tag=0 serv=0 - resetting ad0: READ command timeout tag=0 serv=0 - resetting vinum2.p0.s0: fatal read I/O error vinum: vinum2.p0.s0 is crashed by force vinum: vinum2.p0 is faulty (kgdb) print *(struct ata_softc *) (((device_t) ata_devclass->devices[0])->softc) $47 = {dev = 0xc3286780, channel = 0, r_io = 0xc3282700, r_altio = 0xc3282680, r_bmio = 0xc3286804, r_irq = 0xc3282600, ih = 0xc0e6e920, ioaddr = 496, altioaddr = 1014, bmaddr = 65440, chiptype = 1896972422, alignment = 1, dev_param = {0xc3284200, 0x0}, dev_softc = {0xc331b400, 0x0}, mode = {0, 0}, flags = 16, devices = 0, status = 0 '\000', error = 0 '\000', active = 0, ata_queue = {tqh_first = 0x0, tqh_last = 0xc32866d8}, atapi_queue = { tqh_first = 0x0, tqh_last = 0xc32866e0}, running = 0x0} (kgdb) print *((struct ad_softc *) ((struct ata_softc *) (((device_t) ata_devclass->devices[0])->softc))->dev_softc[0]) $48 = {controller = 0xc3286680, unit = 0, lun = 0, total_secs = 60036480, heads = 16 '\020', sectors = 63 '?', transfersize = 8192, num_tags = 0, flags = 2, tags = {0x0 <repeats 32 times>}, outstanding = -41300580, queue = {queue = {tqh_first = 0xc4d5ae20, tqh_last = 0xc4d59638}, last_pblkno = 21131160, insert_point = 0x0, switch_point = 0xcb30a2d8}, stats = {dev_links = {stqe_next = 0xc331b0b8}, device_number = 1, device_name = "ad", '\000' <repeats 13 times>, unit_number = 0, bytes_read = 176262452224, bytes_written = 14137288704, bytes_freed = 0, num_reads = 40563316, num_writes = 737265, num_frees = 0, num_other = 0, busy_count = 7, block_size = 512, tag_types = {0, 0, 0}, dev_creation_time = {tv_sec = 0, tv_usec = 30618}, busy_time = { tv_sec = 204575, tv_usec = 389409}, start_time = {tv_sec = 1625444, tv_usec = 220943}, last_comp_time = {tv_sec = 1626835, tv_usec = 368335}, flags = DEVSTAT_NO_ORDERED_TAGS, device_type = DEVSTAT_TYPE_IF_IDE, priority = DEVSTAT_PRIORITY_DISK}, disk = {d_flags = 0, d_dsflags = 0, d_devsw = 0xc02d8720, d_dev = 0xc3318380, d_slice = 0xc3349800, d_label = {d_magic = 0, d_type = 0, d_subtype = 0, d_typename = '\000' <repeats 15 times>, d_un = {un_d_packname = '\000' <repeats 15 times>, un_b = { un_d_boot0 = 0x0, un_d_boot1 = 0x0}}, d_secsize = 512, d_nsectors = 63, d_ntracks = 16, d_ncylinders = 59560, d_secpercyl = 1008, d_secperunit = 60036480, d_sparespertrack = 0, d_sparespercyl = 0, d_acylinders = 0, d_rpm = 0, d_interleave = 0, d_trackskew = 0, d_cylskew = 0, d_headswitch = 0, d_trkseek = 0, d_flags = 0, d_drivedata = {0, 0, 0, 0, 0}, d_spare = {0, 0, 0, 0, 0}, d_magic2 = 0, d_checksum = 0, d_npartitions = 0, d_bbsize = 0, d_sbsize = 0, d_partitions = {{p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}, {p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}, {p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}, {p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}, {p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}, {p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}, {p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}, {p_size = 0, p_offset = 0, p_fsize = 0, p_fstype = 0 '\000', p_frag = 0 '\000', __partition_u1 = {cpg = 0, sgs = 0}}}}}, dev1 = 0xc3318480, dev2 = 0xc3318380} Note that devices is now 0 on ata0, thus queued commands for ad0 are never removed from the queue. This is bad, since access to other vinum drives on the same physical disk will now never fail, just block infinitely. bash-2.04$ ps axl -N kernel.4 -M vmcore.4 UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND 0 238 1 1 3 0 928 0 ttyin Is+ #C5 0:00.01 (getty) 0 245 1 1 3 0 928 0 ttyin Is+ #C2 0:00.01 (getty) 0 244 1 0 3 0 928 0 ttyin Is+ #C2 0:00.01 (getty) 0 243 1 0 3 0 928 0 ttyin Is+ #C2 0:00.01 (getty) 0 242 1 1 3 0 928 0 ttyin Is+ #C2 0:00.02 (getty) 0 241 1 1 3 0 928 0 ttyin Is+ #C2 0:00.02 (getty) 0 240 1 0 3 0 928 0 ttyin Is+ #C2 0:00.02 (getty) 0 239 1 0 3 0 928 0 ttyin Is+ #C2 0:00.02 (getty) 0 0 0 0 -18 0 0 0 sched DLs ?? 0:02.45 (swapper) 0 1 0 0 10 0 528 0 wait ILs ?? 0:00.43 (init) 0 2 0 0 -18 0 0 0 psleep DL ?? 13:22.04 (pagedaemon) 0 3 0 0 18 0 0 0 psleep DL ?? 0:00.00 (vmdaemon) 0 4 0 0 -18 0 0 0 psleep DL ?? 0:04.34 (bufdaemon) 0 5 0 0 -2 0 0 0 getblk DL ?? 105:11.54 (syncer) 0 63 1 0 -6 0 596 0 biowr DLs ?? 0:00.01 (vinum) 0 166 1 0 -2 0 924 0 ffsfsn Ds ?? 0:26.47 (syslogd) 0 173 1 0 -6 -12 1260 0 biord D<s ?? 3:38.62 (ntpd) 0 194 1 0 -6 0 972 0 biord Ds ?? 0:19.50 (cron) 0 197 1 0 -18 0 2096 0 spread DLs ?? 0:19.77 (sshd) 1001 232 1 0 -6 0 1636 0 biord Ds ?? 0:35.23 (cvsupd) 1001 53151 232 28 -6 0 2740 0 biord D ?? 2:12.81 (cvsupd) 1001 53155 232 1 -6 0 3352 0 biord D ?? 0:34.44 (cvsupd) 1001 53159 232 2 -14 0 3084 0 inode D ?? 0:05.40 (cvsupd) Fix: Change ata_reinit to check for scp->devices being changed during the ata_reset call and flush the request queues for the 'gone' devices by setting b_error to ENXIO, setting the B_ERROR bit in b_flags and calling biodone. Change adstrategy to check for the device being 'gone' and return ENXIO at once if so. How-To-Repeat: Use 2 IDE disks (one slightly bad), use disk partitioning to get more than one vinum logical drive on each physical disk.