System is a uniprocessor 2GHz Pentium-M base Thinkpad T43. After upgrade of dbus last weekend, hald fails to run. Here is the tail end of the output of 'hald --verbose=yes --daemon=no': 11:50:54.104 [I] hald.c:108: Added device to GDL; udi=/org/freedesktop/Hal/devices/usb_device_0_0_noserial Run started hald-probe-usb2-interface (20000) (0) ! full path is '/usr/local/libexec/hald-probe-usb2-interface', program_dir is '/usr/local/libexec' 11:50:54.129 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.class 11:50:54.129 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.subclass 11:50:54.130 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.protocol 11:50:54.130 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.description 11:50:54.130 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.freebsd.devname 11:50:54.131 [I] hald_runner.c:110: runner_server_message_handler: destination=(null) obj_path=/org/freedesktop/DBus/Local interface=org.freedesktop.DBus.Local method=Disconnected 11:50:54.131 [I] hald_runner.c:151: runner process disconnected *** [DIE] hald_runner.c:runner_died():204 : Runner died I can make the full log available, but it looks prettty normal to me. On exactly one of 6 attempts, hald did run, so there may be a race involved. dbus does appear to be running normally. 94475 ?? Is 0:00.44 /usr/local/bin/dbus-daemon --system 94550 ?? Is 0:00.12 /usr/local/bin/dbus-daemon --fork --print-pid 5 --pri 94526 v0 I 0:00.00 ck-launch-session dbus-launch --exit-with-session gno 94549 v0 I 0:00.00 dbus-launch --exit-with-session gnome-session Fix: Roll back to dbus-1.2.24_2 How-To-Repeat: Upgrade to dbus-1.4.0. Start dbus and then hald.
Responsible Changed From-To: freebsd-ports-bugs->gnome Over to maintainer(s).
It seems that I _might_ have the same or related problem. I have both dbus and hald enabled for automatic startup via rc system. After reboot I see that dbus daemon is running, but hald is not. I start hald (via its rc script) manually, but it still doesn't run. I repeat that step again and finally hald runs. Due to some unrelated problems I had to reboot a few times today and all the times it was like described above. Unfortunately, because of those unrelated problems I haven't had a chance yet to debug this situation properly. -- Andriy Gapon
> Date: Fri, 05 Nov 2010 15:54:39 +0200 > From: Andriy Gapon <avg@icyb.net.ua> > > > It seems that I _might_ have the same or related problem. > I have both dbus and hald enabled for automatic startup via rc system. > After reboot I see that dbus daemon is running, but hald is not. > I start hald (via its rc script) manually, but it still doesn't run. > I repeat that step again and finally hald runs. > > Due to some unrelated problems I had to reboot a few times today and > all the times it was like described above. Unfortunately, because of > those unrelated problems I haven't had a chance yet to debug this > situation properly. Andriy, Can you try running hald with debug and see if you get the same failure I have reported? '/usr/local/sbin/hald --daemon=no --debug=yes' See if the failure is with hal-runner exiting during the USB scan. That will confirm that it is the same issue. I need to have some time to do debug to see where this is failing and I'm rather swamped both at work and at home. (My home Windows system died last week and the replacement just arrived yesterday, so I am busy getting software installed, moving data files from the old disk and learning what does not work. (E.g. No USB support for Palm devices on 64-bit Windows. I really hate Windows, but I need software that has no other support. I do hope that I can work on it this weekend after I get the new system into reasonable shape and get about a yard of redwood bark spread on the hill in my back yard. One of Joe's suggestions was to unplug everything from the USB, but I have the problem with nothing plugged in. There are only two internal devices, the fingerprint reader and an IBM serial converter which FreeBSD simply does not recognize. Biometric Coprocessor STMicroelectronics product 0x4482 vendor 0x04b3 (This is the IBM serial converter) BIOS has no options to disable these. -- R. Kevin Oberman, Network Engineer Energy Sciences Network (ESnet) Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab) E-mail: oberman@es.net Phone: +1 510 486-8634 Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751
on 05/11/2010 17:13 Kevin Oberman said the following: > Can you try running hald with debug and see if you get the same failure > I have reported? '/usr/local/sbin/hald --daemon=no --debug=yes' > See if the failure is with hal-runner exiting during the USB scan. That > will confirm that it is the same issue. Here's output on my system: http://people.freebsd.org/~avg/hald.script.txt Please note that this is a desktop with only a mouse connected by USB. -- Andriy Gapon
on 05/11/2010 18:34 Andriy Gapon said the following: > Run started hald-probe-scsi (20000) (0) > ! full path is '/usr/local/libexec/hald-probe-scsi', program_dir is > '/usr/local/libexec' > 18:30:33.136 [I] hald_runner.c:110: runner_server_message_handler: > destination=(null) obj_path=/org/freedesktop/DBus/Local > interface=org.freedesktop.DBus.Local method=Disconnected > 18:30:33.136 [I] hald_runner.c:151: runner process disconnected > *** [DIE] hald_runner.c:runner_died():204 : Runner died > Not sure what is the problem, perhaps someone who knows the code better can come up with an explanation, but the following informally specified patch seems to fix the issue for me (verified with 24 hald start-ups): In hald-runner/runner.c, function run_request_run(): g_free (program_dir); -if (r->input) { +if (r->input && strlen(r->input) > 0) { if (write(stdin_v, r->input, strlen(r->input)) != (ssize_t) strlen(r->input)) printf("Warning: Error while writing r->input (%s) to stdin_v.\n", r->input); close(stdin_v); } I hope that it's easy to apply this change by hand. I will have a proper patch once I clean out all my debugging modifications from the sources. It looked like the issue was caused by the above write() call after a sufficiently quick child process had already exited. -- Andriy Gapon
on 05/11/2010 19:24 Andriy Gapon said the following: > It looked like the issue was caused by the above write() call after a sufficiently > quick child process had already exited. The proper patch. --- hald-runner/runner.c.orig 2010-11-05 19:26:05.673393765 +0200 +++ hald-runner/runner.c 2010-11-05 19:26:33.471393058 +0200 @@ -281,7 +281,7 @@ run_request_run (run_request *r, DBusCon } g_free (program_dir); - if (r->input) { + if (r->input && strlen(r->input) > 0) { if (write(stdin_v, r->input, strlen(r->input)) != (ssize_t) strlen(r->input)) printf("Warning: Error while writing r->input (%s) to stdin_v.\n", r->input); close(stdin_v); -- Andriy Gapon
> Date: Fri, 05 Nov 2010 19:33:08 +0200 > From: Andriy Gapon <avg@freebsd.org> > > on 05/11/2010 19:24 Andriy Gapon said the following: > > It looked like the issue was caused by the above write() call after a sufficiently > > quick child process had already exited. > > The proper patch. > > --- hald-runner/runner.c.orig 2010-11-05 19:26:05.673393765 +0200 > +++ hald-runner/runner.c 2010-11-05 19:26:33.471393058 +0200 > @@ -281,7 +281,7 @@ run_request_run (run_request *r, DBusCon > } > g_free (program_dir); > > - if (r->input) { > + if (r->input && strlen(r->input) > 0) { > if (write(stdin_v, r->input, strlen(r->input)) != (ssize_t) strlen(r->input)) > printf("Warning: Error while writing r->input (%s) to stdin_v.\n", r->input); > close(stdin_v); Looks like you have it! My system now runs hald with the dbus-1.4.0 without any problems. Thanks very much, Andriy! -- R. Kevin Oberman, Network Engineer Energy Sciences Network (ESnet) Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab) E-mail: oberman@es.net Phone: +1 510 486-8634 Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751
on 05/11/2010 19:24 Andriy Gapon said the following: > It looked like the issue was caused by the above write() call after a sufficiently > quick child process had already exited. The problem could be that zero-sized write() into a closed pipe (after child's exit) still generates SIGPIPE. This is what POSIX have to say about this: "If nbyte is zero and the file is not a regular file, the results are unspecified." So hald code should take into account this possibility and not write anything into the pipe unless it is sure that the child process does expect input (and obviously it can't expect "null" input) and thus won't exit before the input is fed to it. -- Andriy Gapon
On 11/5/10 1:33 PM, Andriy Gapon wrote: > on 05/11/2010 19:24 Andriy Gapon said the following: >> It looked like the issue was caused by the above write() call after a sufficiently >> quick child process had already exited. > > The proper patch. While I don't really have a problem with this patch, I'm am a bit curious as to why it's now needed. Could either of you downgrade to dbus-1.2, then add something like: g_warning ("XXX: Trying to write '%s' to stdin", r->input); At line 285 (right below the "if"). I'd love to know what is being written there normally to try and see if there is a hidden problem with the dbus upgrade. Thanks. Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome
> Date: Mon, 08 Nov 2010 19:02:13 -0500 > From: Joe Marcus Clarke <marcus@freebsd.org> > > On 11/5/10 1:33 PM, Andriy Gapon wrote: > > on 05/11/2010 19:24 Andriy Gapon said the following: > >> It looked like the issue was caused by the above write() call after a sufficiently > >> quick child process had already exited. > > > > The proper patch. > > While I don't really have a problem with this patch, I'm am a bit > curious as to why it's now needed. Could either of you downgrade to > dbus-1.2, then add something like: > > g_warning ("XXX: Trying to write '%s' to stdin", r->input); > > At line 285 (right below the "if"). I'd love to know what is being > written there normally to try and see if there is a hidden problem with > the dbus upgrade. Thanks. I'll try this as soon as I can. I'm not too sure that it will happen as I think that this is somehow timing related. I suspect that the entry is disappearing too quickly with 1.4 in some cases but is not a problem with 1.2. Perhaps some optimization? I suggest this because on at least rare occasion, 1.4 did run successfully, not because I have any clue what was happening under the covers. The systems involved are VERY different as his is a fast 64-bit SMP while mine is a slow (2 GHz) 32-bit uniprocessor. At least in my case I typically have only the two internal USB connected devices, neither of which have drivers. Not sure what USB devices Andriy might have. I'll let you know the results ASAP, but I in the middle of brining up DNS anycast for my employer and it is eating all of my free cycles until it is on-line tomorrow. -- R. Kevin Oberman, Network Engineer Energy Sciences Network (ESnet) Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab) E-mail: oberman@es.net Phone: +1 510 486-8634 Key fingerprint:059B 2DDF 031C 9BA3 14A4 EADA 927D EBB3 987B 3751
on 09/11/2010 02:14 Kevin Oberman said the following: > I'll try this as soon as I can. I'm not too sure that it will happen as > I think that this is somehow timing related. I suspect that the entry is > disappearing too quickly with 1.4 in some cases but is not a problem > with 1.2. Perhaps some optimization? > > I suggest this because on at least rare occasion, 1.4 did run > successfully, not because I have any clue what was happening under the > covers. I guess that I already explained this part. The problem happened because we tried to write something (even if it's just zero sized something) into stdin of a child process that already exited. Sometimes the child process was quicker, sometimes the parent process was quicker, hence the non-determinism. -- Andriy Gapon
On 11/9/10 12:36 AM, Andriy Gapon wrote: > on 09/11/2010 02:14 Kevin Oberman said the following: >> I'll try this as soon as I can. I'm not too sure that it will happen as >> I think that this is somehow timing related. I suspect that the entry is >> disappearing too quickly with 1.4 in some cases but is not a problem >> with 1.2. Perhaps some optimization? >> >> I suggest this because on at least rare occasion, 1.4 did run >> successfully, not because I have any clue what was happening under the >> covers. > > I guess that I already explained this part. > The problem happened because we tried to write something (even if it's just zero > sized something) into stdin of a child process that already exited. > Sometimes the child process was quicker, sometimes the parent process was > quicker, hence the non-determinism. > Ah, I missed that. I wonder if it would be safer then to ignore SIGPIPE around the write block. Joe -- Joe Marcus Clarke FreeBSD GNOME Team :: gnome@FreeBSD.org FreeNode / #freebsd-gnome http://www.FreeBSD.org/gnome
on 09/11/2010 07:47 Joe Marcus Clarke said the following: > On 11/9/10 12:36 AM, Andriy Gapon wrote: >> on 09/11/2010 02:14 Kevin Oberman said the following: >>> I'll try this as soon as I can. I'm not too sure that it will happen as >>> I think that this is somehow timing related. I suspect that the entry is >>> disappearing too quickly with 1.4 in some cases but is not a problem >>> with 1.2. Perhaps some optimization? >>> >>> I suggest this because on at least rare occasion, 1.4 did run >>> successfully, not because I have any clue what was happening under the >>> covers. >> >> I guess that I already explained this part. >> The problem happened because we tried to write something (even if it's just zero >> sized something) into stdin of a child process that already exited. >> Sometimes the child process was quicker, sometimes the parent process was >> quicker, hence the non-determinism. >> > > Ah, I missed that. I wonder if it would be safer then to ignore SIGPIPE > around the write block. Maybe. But not calling write(2) when we don't have anything to write (zero length) also looks like a good solution (for me personally). My point is: zero-sized write in nothing but testing OS implementation details of handling zero-sized writes, it doesn't perform any useful function. OTOH, if a child process is supposed to get any actual input, then it won't exit prematurely, but would block reading from its stdin until the input arrives. But I think I am starting to repeat what I have already wrote before. -- Andriy Gapon
on 09/11/2010 07:52 Andriy Gapon said the following: > on 09/11/2010 07:47 Joe Marcus Clarke said the following: >> On 11/9/10 12:36 AM, Andriy Gapon wrote: >>> on 09/11/2010 02:14 Kevin Oberman said the following: >>>> I'll try this as soon as I can. I'm not too sure that it will happen as >>>> I think that this is somehow timing related. I suspect that the entry is >>>> disappearing too quickly with 1.4 in some cases but is not a problem >>>> with 1.2. Perhaps some optimization? >>>> >>>> I suggest this because on at least rare occasion, 1.4 did run >>>> successfully, not because I have any clue what was happening under the >>>> covers. >>> >>> I guess that I already explained this part. >>> The problem happened because we tried to write something (even if it's just zero >>> sized something) into stdin of a child process that already exited. >>> Sometimes the child process was quicker, sometimes the parent process was >>> quicker, hence the non-determinism. >>> >> >> Ah, I missed that. I wonder if it would be safer then to ignore SIGPIPE >> around the write block. > > Maybe. Actually, please read the above as "probably no". If a child process that is supposed to get input would crash, then such a change would obfuscate diagnostics. > But not calling write(2) when we don't have anything to write (zero > length) also looks like a good solution (for me personally). > > My point is: zero-sized write in nothing but testing OS implementation details > of handling zero-sized writes, it doesn't perform any useful function. > OTOH, if a child process is supposed to get any actual input, then it won't exit > prematurely, but would block reading from its stdin until the input arrives. > > But I think I am starting to repeat what I have already wrote before. > -- Andriy Gapon
marcus 2010-11-09 06:12:05 UTC FreeBSD ports repository Modified files: sysutils/hal Makefile Added files: sysutils/hal/files patch-hald-runner_runner.c Log: Fix a potential race with helper processes terminating too quickly. When this happens, hald dies. PR: 151725 Submitted by: avg Revision Changes Path 1.65 +1 -1 ports/sysutils/hal/Makefile 1.1 +11 -0 ports/sysutils/hal/files/patch-hald-runner_runner.c (new) _______________________________________________ cvs-all@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/cvs-all To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
State Changed From-To: open->closed Andriy's fix has been committed.