Bug 166193 - [hang] FB 8.0 freeze during the kernel dump
Summary: [hang] FB 8.0 freeze during the kernel dump
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: Unspecified
Hardware: Any Any
: Normal Affects Only Me
Assignee: Andriy Gapon
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-03-17 04:00 UTC by Zhouyi Zhou
Modified: 2012-07-16 12:15 UTC (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Zhouyi Zhou 2012-03-17 04:00:23 UTC
FreeBSD 8.0 will freeze during the kernel panic memory dump.

Fix: 

I use self developed  instrument code to prevent dead lock and find the cpu which is used for dumping is seized by ehci_interrupt which is locked up.

The fix is to add a critical_enter() in function doadump.


btw: following is my dirty and quick instrument code:
#define TRACELEVEL 5
void
lapic_handle_intr(int vector, struct trapframe *frame)
{
        struct intsrc *isrc;
#ifdef TRACELEVEL
        char tfip[20];
        if (1/*vector == 0x30*/){
          int i = 0;
          int j = 0;
          int cpuid = PCPU_GET(cpuid);
          struct amd64_frame * frame1;
          if (!INKERNEL(frame->tf_rip))
            goto out;
          frame1 = (struct amd64_frame *) (frame->tf_rbp);
          sprintf(tfip, "%x\n", frame->tf_rip);
          for (i = 0; i < 8; i++){
                  *(((unsigned char *)0xffffffff800b8000) + cpuid*300  + j*9*2 + i*2) = tfip[i];
                  if (*((unsigned char *)0xffffffff800b8001 + cpuid*300 + j*9*2 + i*2) != 121)
                          *((unsigned char *)0xffffffff800b8001 + cpuid*300 + j*9*2 + i*2) = 121;
                  else
                          *((unsigned char *)0xffffffff800b8001 + cpuid*300 + j*9*2 + i*2) = 120;
          }
          *(((unsigned char *)0xffffffff800b8000) + cpuid*300 + j*9*2 + 8*2) = ' ';
          *((unsigned char *)0xffffffff800b8001 + cpuid*300 + j*9*2 + 8*2) = 121;
          j = 1;
          while (j  <= TRACELEVEL){
                  if (!INKERNEL((long)frame1))
                          goto out;
                  sprintf(tfip, "%x\n", frame1->f_retaddr);
                  for (i = 0; i < 8; i++){
                          *(((unsigned char *)0xffffffff800b8000) +cpuid*300 + j*9*2 + i*2) = tfip[i];
                          if (*((unsigned char *)0xffffffff800b8001 + cpuid*300 +j*9*2 + i*2) != 121)
                                  *((unsigned char *)0xffffffff800b8001 + cpuid*300 +j*9*2 + i*2) = 121;
                          else
                                  *((unsigned char *)0xffffffff800b8001 + cpuid*300 +j*9*2 + i*2) = 120;
                  }
                  *(((unsigned char *)0xffffffff800b8000) + cpuid*300 +j*9*2 + 8*2) = ' ';
                  *((unsigned char *)0xffffffff800b8001 +cpuid*300 + j*9*2 + 8*2) = 121;
                  frame1 = frame1->f_frame;
                  j++;
          }

        }
        }
        out:
#endif

        if (vector == -1)
                panic("Couldn't get vector from ISR!");
        isrc = intr_lookup_source(apic_idt_to_irq(PCPU_GET(apic_id),
            vector));
        intr_execute_handlers(isrc, frame);
}
How-To-Repeat: Allocate a large memory, trigger a kernel panic, and let it dump
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2012-03-17 04:41:07 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).  Apparently the fix is simple (patch doadump).
Comment 2 Andriy Gapon freebsd_committer freebsd_triage 2012-03-21 10:22:25 UTC
Responsible Changed
From-To: freebsd-fs->avg

This PR looks like a duplicate of PR 139614.
Comment 3 Andriy Gapon freebsd_committer freebsd_triage 2012-03-21 10:28:26 UTC
Do you have a patch? Have you tested it? Can you explain how it works?

Can you also review the following PR
http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/139614 and commits referenced
by it? - Do those match your problem?

-- 
Andriy Gapon
Comment 4 Zhouyi Zhou 2012-03-23 04:19:03 UTC
Thanks for attention
On Wed, Mar 21, 2012 at 6:28 PM, Andriy Gapon <avg@freebsd.org> wrote:
> Do you have a patch?
The following is the patch
---    kern_shutdown.c~
+++ kern_shutdown.c
@@ -242,6 +242,7 @@
        }

        savectx(&dumppcb);
+       critical_enter();
        dumptid = curthread->td_tid;
        dumping++;
 #ifdef DDB
@@ -263,6 +264,9 @@
        return (0);
 }
>Have you tested it?
test it many times, after my patch, dumping is no longer freeze
>Can you explain how it works?
1) how the patch works
first let's assume the panic cpu id is 0, soon after cpu 0 is begin to dump,
kernel scheduler preempt cpu 0 to execute other thread which soon
locked up (usb subsystem).
my patch is to prevent scheduler to preempt the dumping thread.
2)how my instrument code works
on each interrupt, print the current exeuction stack for each cpu in
the system to vga memory
3)what I find in dumping freeze scenery
every time after the dumping freezes, the instrument code told me  the
usb subsystem is locked up
4)either my patch or disable usb in bios will prevent the dumping freeze
>
> Can you also review the following PR
> http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/139614 and commits referenced
> by it? - Do those match your problem?
1)my PR is a sub problem of PR 139614, the commits settles the dumping
freeze problem and dumping mistake problem
altogether.
2)I have reviewed current code before submit the PR and find the
stopping other CPUs and scheduler treatment, and submit the PR for
goodness of who do want patch heavily on their current work, and my
patch don't settle the dumping mistake under heavy interrupt
condition.
>
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2012-05-24 07:43:18 UTC
on 23/03/2012 06:19 Zhouyi Zhou said the following:
> Thanks for attention

Sorry for taking so long to reply...

> On Wed, Mar 21, 2012 at 6:28 PM, Andriy Gapon <avg@freebsd.org> wrote:
>> Do you have a patch?
> The following is the patch
> ---    kern_shutdown.c~
> +++ kern_shutdown.c
> @@ -242,6 +242,7 @@
>         }
> 
>         savectx(&dumppcb);
> +       critical_enter();
>         dumptid = curthread->td_tid;
>         dumping++;
>  #ifdef DDB
> @@ -263,6 +264,9 @@
>         return (0);
>  }
>> Have you tested it?
> test it many times, after my patch, dumping is no longer freeze
>> Can you explain how it works?
> 1) how the patch works
> first let's assume the panic cpu id is 0, soon after cpu 0 is begin to dump,
> kernel scheduler preempt cpu 0 to execute other thread which soon
> locked up (usb subsystem).
> my patch is to prevent scheduler to preempt the dumping thread.

OK.  Now I see what your patch does and I think that this is a good workaround.
Although it won't help all the cases - e.g. if a thread running on a different
CPU does memory-related operations then that can still confuse the dumping code.
But at least the panic-ing/dumping CPU won't get indefinitely stuck.

[snip]

>> Can you also review the following PR
>> http://www.freebsd.org/cgi/query-pr.cgi?pr=amd64/139614 and commits referenced
>> by it? - Do those match your problem?
> 1)my PR is a sub problem of PR 139614, the commits settles the dumping
> freeze problem and dumping mistake problem
> altogether.
> 2)I have reviewed current code before submit the PR and find the
> stopping other CPUs and scheduler treatment, and submit the PR for
> goodness of who do want patch heavily on their current work, and my
> patch don't settle the dumping mistake under heavy interrupt
> condition.

Yes, thank you for the investigation and the patch.
And sorry for taking too long to act on your report.

Now, I have MFCed the main part of the CPU/scheduler stopping commits to
stable/8.  The new behavior is disabled by default, but could be enabled via a
tunable.  In stable/9 the changes are fully MFCed and enabled bydefault.
What is your opinion - should that be good enough or is your patch still needed?

Assuming that you can try stable/8, could you please test if the latest code
there is able to correctly handle your environment (hardware, interrupt load, etc)?

Thank you!

-- 
Andriy Gapon
Comment 6 Andriy Gapon freebsd_committer freebsd_triage 2012-06-07 09:12:59 UTC
State Changed
From-To: open->patched

Update to the state of of PR 139614
Comment 7 Zhouyi Zhou 2012-06-23 04:20:43 UTC
Andriy,

   I cvsuped the FB8 stable, and sysctl kern.stop_scheduler_on_panic=1
Dumping is always successfully in 3 rounds of tries (2000M memory dump)).
  Many thanks
Best Wishes
Zhouyi
Comment 8 Andriy Gapon freebsd_committer freebsd_triage 2012-07-16 12:14:51 UTC
State Changed
From-To: patched->closed

See PR 139614.