Bug 151725

Summary: sysutils/hal: hald fails to start with dbus-1.4
Product: Ports & Packages Reporter: Kevin Oberman <oberman>
Component: Individual Port(s)Assignee: freebsd-gnome (Nobody) <gnome>
Status: Closed FIXED    
Severity: Affects Only Me    
Priority: Normal    
Version: Latest   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
patch-hald-runner_runner.c none

Description Kevin Oberman 2010-10-25 21:00:24 UTC
	

System is a uniprocessor 2GHz Pentium-M base Thinkpad T43.

After upgrade of dbus last weekend, hald fails to run. Here is the
tail end of the output of 'hald --verbose=yes --daemon=no':
11:50:54.104 [I] hald.c:108: Added device to GDL; udi=/org/freedesktop/Hal/devices/usb_device_0_0_noserial
Run started hald-probe-usb2-interface (20000) (0) 
!  full path is '/usr/local/libexec/hald-probe-usb2-interface', program_dir is '/usr/local/libexec'
11:50:54.129 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.class
11:50:54.129 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.subclass
11:50:54.130 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.protocol
11:50:54.130 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.interface.description
11:50:54.130 [D] hald_dbus.c:1659: udi=/org/freedesktop/Hal/devices/temp/42, key=usb.freebsd.devname
11:50:54.131 [I] hald_runner.c:110: runner_server_message_handler: destination=(null) obj_path=/org/freedesktop/DBus/Local interface=org.freedesktop.DBus.Local method=Disconnected
11:50:54.131 [I] hald_runner.c:151: runner process disconnected
*** [DIE] hald_runner.c:runner_died():204 : Runner died

I can make the full log available, but it looks prettty normal to me.

On exactly one of 6 attempts, hald did run, so there may be a race
involved. dbus does appear to be running normally.
94475  ??  Is     0:00.44 /usr/local/bin/dbus-daemon --system
94550  ??  Is     0:00.12 /usr/local/bin/dbus-daemon --fork --print-pid 5 --pri
94526  v0  I      0:00.00 ck-launch-session dbus-launch --exit-with-session gno
94549  v0  I      0:00.00 dbus-launch --exit-with-session gnome-session

Fix: 

Roll back to dbus-1.2.24_2
How-To-Repeat: 	
Upgrade to dbus-1.4.0. Start dbus and then hald.
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2010-10-31 08:21:14 UTC
Responsible Changed
From-To: freebsd-ports-bugs->gnome

Over to maintainer(s).
Comment 2 Andriy Gapon 2010-11-05 13:54:39 UTC
It seems that I _might_ have the same or related problem.
I have both dbus and hald enabled for automatic startup via rc system.
After reboot I see that dbus daemon is running, but hald is not.
I start hald (via its rc script) manually, but it still doesn't run.
I repeat that step again and finally hald runs.

Due to some unrelated problems I had to reboot a few times today and all the times
it was like described above.  Unfortunately, because of those unrelated problems I
haven't had a chance yet to debug this situation properly.

-- 
Andriy Gapon
Comment 3 Kevin Oberman 2010-11-05 15:13:52 UTC
> Date: Fri, 05 Nov 2010 15:54:39 +0200
> From: Andriy Gapon <avg@icyb.net.ua>
> 
> 
> It seems that I _might_ have the same or related problem.
> I have both dbus and hald enabled for automatic startup via rc system.
> After reboot I see that dbus daemon is running, but hald is not.
> I start hald (via its rc script) manually, but it still doesn't run.
> I repeat that step again and finally hald runs.
> 
> Due to some unrelated problems I had to reboot a few times today and
> all the times it was like described above.  Unfortunately, because of
> those unrelated problems I haven't had a chance yet to debug this
> situation properly.

Andriy,

Can you try running hald with debug and see if you get the same failure
I have reported? '/usr/local/sbin/hald --daemon=no --debug=yes'
See if the failure is with hal-runner exiting during the USB scan. That
will confirm that it is the same issue.

I need to have some time to do debug to see where this is failing and
I'm rather swamped both at work and at home. (My home Windows system
died last week and the replacement just arrived yesterday, so I am busy
getting software installed, moving data files from the old disk and
learning what does not work. (E.g. No USB support for Palm devices on
64-bit Windows. I really hate Windows, but I need software that has no
other support.

I do hope that I can work on it this weekend after I get the new system
into reasonable shape and get about a yard of redwood bark spread on the
hill in my back yard.

One of Joe's suggestions was to unplug everything from the USB, but I
have the problem with nothing plugged in. There are only two internal
devices, the fingerprint reader and an IBM serial converter which
FreeBSD simply does not recognize. 
Biometric Coprocessor STMicroelectronics
product 0x4482 vendor 0x04b3 (This is the IBM serial converter)
BIOS has no options to disable these.
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman@es.net			Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751
Comment 4 Andriy Gapon 2010-11-05 15:57:09 UTC
on 05/11/2010 17:13 Kevin Oberman said the following:
> Can you try running hald with debug and see if you get the same failure
> I have reported? '/usr/local/sbin/hald --daemon=no --debug=yes'
> See if the failure is with hal-runner exiting during the USB scan. That
> will confirm that it is the same issue.

Here's output on my system:
http://people.freebsd.org/~avg/hald.script.txt
Please note that this is a desktop with only a mouse connected by USB.

-- 
Andriy Gapon
Comment 5 Andriy Gapon freebsd_committer freebsd_triage 2010-11-05 17:24:10 UTC
on 05/11/2010 18:34 Andriy Gapon said the following:
> Run started hald-probe-scsi (20000) (0)
> !  full path is '/usr/local/libexec/hald-probe-scsi', program_dir is
> '/usr/local/libexec'
> 18:30:33.136 [I] hald_runner.c:110: runner_server_message_handler:
> destination=(null) obj_path=/org/freedesktop/DBus/Local
> interface=org.freedesktop.DBus.Local method=Disconnected
> 18:30:33.136 [I] hald_runner.c:151: runner process disconnected
> *** [DIE] hald_runner.c:runner_died():204 : Runner died
> 

Not sure what is the problem, perhaps someone who knows the code better can come
up with an explanation, but the following informally specified patch seems to fix
the issue for me (verified with 24 hald start-ups):

In hald-runner/runner.c, function run_request_run():

g_free (program_dir);

-if (r->input) {
+if (r->input && strlen(r->input) > 0) {
        if (write(stdin_v, r->input, strlen(r->input)) != (ssize_t) strlen(r->input))
                printf("Warning: Error while writing r->input (%s) to stdin_v.\n",
r->input);
        close(stdin_v);
}

I hope that it's easy to apply this change by hand.
I will have a proper patch once I clean out all my debugging modifications from
the sources.

It looked like the issue was caused by the above write() call after a sufficiently
quick child process had already exited.

-- 
Andriy Gapon
Comment 6 Andriy Gapon freebsd_committer freebsd_triage 2010-11-05 17:33:08 UTC
on 05/11/2010 19:24 Andriy Gapon said the following:
> It looked like the issue was caused by the above write() call after a sufficiently
> quick child process had already exited.

The proper patch.

--- hald-runner/runner.c.orig	2010-11-05 19:26:05.673393765 +0200
+++ hald-runner/runner.c	2010-11-05 19:26:33.471393058 +0200
@@ -281,7 +281,7 @@ run_request_run (run_request *r, DBusCon
 	}
 	g_free (program_dir);

-	if (r->input) {
+	if (r->input && strlen(r->input) > 0) {
 		if (write(stdin_v, r->input, strlen(r->input)) != (ssize_t) strlen(r->input))
 			printf("Warning: Error while writing r->input (%s) to stdin_v.\n", r->input);
 		close(stdin_v);


-- 
Andriy Gapon
Comment 7 Kevin Oberman 2010-11-05 20:32:43 UTC
> Date: Fri, 05 Nov 2010 19:33:08 +0200
> From: Andriy Gapon <avg@freebsd.org>
> 
> on 05/11/2010 19:24 Andriy Gapon said the following:
> > It looked like the issue was caused by the above write() call after a sufficiently
> > quick child process had already exited.
> 
> The proper patch.
> 
> --- hald-runner/runner.c.orig	2010-11-05 19:26:05.673393765 +0200
> +++ hald-runner/runner.c	2010-11-05 19:26:33.471393058 +0200
> @@ -281,7 +281,7 @@ run_request_run (run_request *r, DBusCon
>  	}
>  	g_free (program_dir);
> 
> -	if (r->input) {
> +	if (r->input && strlen(r->input) > 0) {
>  		if (write(stdin_v, r->input, strlen(r->input)) != (ssize_t) strlen(r->input))
>  			printf("Warning: Error while writing r->input (%s) to stdin_v.\n", r->input);
>  		close(stdin_v);

Looks like you have it! My system now runs hald with the dbus-1.4.0
without any problems.

Thanks very much, Andriy!
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman@es.net			Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751
Comment 8 Andriy Gapon freebsd_committer freebsd_triage 2010-11-06 11:06:23 UTC
on 05/11/2010 19:24 Andriy Gapon said the following:
> It looked like the issue was caused by the above write() call after a sufficiently
> quick child process had already exited.

The problem could be that zero-sized write() into a closed pipe (after child's
exit) still generates SIGPIPE.

This is what POSIX have to say about this:
"If nbyte is zero and the file is not a regular file, the results are unspecified."

So hald code should take into account this possibility and not write anything
into the pipe unless it is sure that the child process does expect input (and
obviously it can't expect "null" input) and thus won't exit before the input is
fed to it.

-- 
Andriy Gapon
Comment 9 Joe Marcus Clarke freebsd_committer freebsd_triage 2010-11-09 00:02:13 UTC
On 11/5/10 1:33 PM, Andriy Gapon wrote:
> on 05/11/2010 19:24 Andriy Gapon said the following:
>> It looked like the issue was caused by the above write() call after a sufficiently
>> quick child process had already exited.
> 
> The proper patch.

While I don't really have a problem with this patch, I'm am a bit
curious as to why it's now needed.  Could either of you downgrade to
dbus-1.2, then add something like:

g_warning ("XXX: Trying to write '%s' to stdin", r->input);

At line 285 (right below the "if").  I'd love to know what is being
written there normally to try and see if there is a hidden problem with
the dbus upgrade.  Thanks.

Joe

-- 
Joe Marcus Clarke
FreeBSD GNOME Team	::	gnome@FreeBSD.org
FreeNode / #freebsd-gnome
http://www.FreeBSD.org/gnome
Comment 10 Kevin Oberman 2010-11-09 00:14:42 UTC
> Date: Mon, 08 Nov 2010 19:02:13 -0500
> From: Joe Marcus Clarke <marcus@freebsd.org>
> 
> On 11/5/10 1:33 PM, Andriy Gapon wrote:
> > on 05/11/2010 19:24 Andriy Gapon said the following:
> >> It looked like the issue was caused by the above write() call after a sufficiently
> >> quick child process had already exited.
> > 
> > The proper patch.
> 
> While I don't really have a problem with this patch, I'm am a bit
> curious as to why it's now needed.  Could either of you downgrade to
> dbus-1.2, then add something like:
> 
> g_warning ("XXX: Trying to write '%s' to stdin", r->input);
> 
> At line 285 (right below the "if").  I'd love to know what is being
> written there normally to try and see if there is a hidden problem with
> the dbus upgrade.  Thanks.

I'll try this as soon as I can. I'm not too sure that it will happen as
I think that this is somehow timing related. I suspect that the entry is
disappearing too quickly with 1.4 in some cases but is not a problem
with 1.2. Perhaps some optimization? 

I suggest this because on at least rare occasion, 1.4 did run
successfully, not because I have any clue what was happening under the
covers. 

The systems involved are VERY different as his is a fast 64-bit SMP
while mine is a slow (2 GHz) 32-bit uniprocessor. At least in my case I
typically have only the two internal USB connected devices, neither of
which have drivers. Not sure what USB devices Andriy might have.

I'll let you know the results ASAP, but I in the middle of brining up
DNS anycast for my employer and it is eating all of my free cycles until
it is on-line tomorrow.
-- 
R. Kevin Oberman, Network Engineer
Energy Sciences Network (ESnet)
Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab)
E-mail: oberman@es.net			Phone: +1 510 486-8634
Key fingerprint:059B 2DDF 031C 9BA3 14A4  EADA 927D EBB3 987B 3751
Comment 11 Andriy Gapon freebsd_committer freebsd_triage 2010-11-09 05:36:13 UTC
on 09/11/2010 02:14 Kevin Oberman said the following:
> I'll try this as soon as I can. I'm not too sure that it will happen as
> I think that this is somehow timing related. I suspect that the entry is
> disappearing too quickly with 1.4 in some cases but is not a problem
> with 1.2. Perhaps some optimization? 
> 
> I suggest this because on at least rare occasion, 1.4 did run
> successfully, not because I have any clue what was happening under the
> covers. 

I guess that I already explained this part.
The problem happened because we tried to write something (even if it's just zero
sized something) into stdin of a child process that already exited.
Sometimes the child process was quicker, sometimes the parent process was
quicker, hence the non-determinism.

-- 
Andriy Gapon
Comment 12 Joe Marcus Clarke freebsd_committer freebsd_triage 2010-11-09 05:47:33 UTC
On 11/9/10 12:36 AM, Andriy Gapon wrote:
> on 09/11/2010 02:14 Kevin Oberman said the following:
>> I'll try this as soon as I can. I'm not too sure that it will happen as
>> I think that this is somehow timing related. I suspect that the entry is
>> disappearing too quickly with 1.4 in some cases but is not a problem
>> with 1.2. Perhaps some optimization? 
>>
>> I suggest this because on at least rare occasion, 1.4 did run
>> successfully, not because I have any clue what was happening under the
>> covers. 
> 
> I guess that I already explained this part.
> The problem happened because we tried to write something (even if it's just zero
> sized something) into stdin of a child process that already exited.
> Sometimes the child process was quicker, sometimes the parent process was
> quicker, hence the non-determinism.
> 

Ah, I missed that.  I wonder if it would be safer then to ignore SIGPIPE
around the write block.

Joe

-- 
Joe Marcus Clarke
FreeBSD GNOME Team	::	gnome@FreeBSD.org
FreeNode / #freebsd-gnome
http://www.FreeBSD.org/gnome
Comment 13 Andriy Gapon freebsd_committer freebsd_triage 2010-11-09 05:52:48 UTC
on 09/11/2010 07:47 Joe Marcus Clarke said the following:
> On 11/9/10 12:36 AM, Andriy Gapon wrote:
>> on 09/11/2010 02:14 Kevin Oberman said the following:
>>> I'll try this as soon as I can. I'm not too sure that it will happen as
>>> I think that this is somehow timing related. I suspect that the entry is
>>> disappearing too quickly with 1.4 in some cases but is not a problem
>>> with 1.2. Perhaps some optimization? 
>>>
>>> I suggest this because on at least rare occasion, 1.4 did run
>>> successfully, not because I have any clue what was happening under the
>>> covers. 
>>
>> I guess that I already explained this part.
>> The problem happened because we tried to write something (even if it's just zero
>> sized something) into stdin of a child process that already exited.
>> Sometimes the child process was quicker, sometimes the parent process was
>> quicker, hence the non-determinism.
>>
> 
> Ah, I missed that.  I wonder if it would be safer then to ignore SIGPIPE
> around the write block.

Maybe.  But not calling write(2) when we don't have anything to write (zero
length) also looks like a good solution (for me personally).

My point is: zero-sized write in nothing but testing OS implementation details
of handling zero-sized writes, it doesn't perform any useful function.
OTOH, if a child process is supposed to get any actual input, then it won't exit
prematurely, but would block reading from its stdin until the input arrives.

But I think I am starting to repeat what I have already wrote before.

-- 
Andriy Gapon
Comment 14 Andriy Gapon freebsd_committer freebsd_triage 2010-11-09 05:54:32 UTC
on 09/11/2010 07:52 Andriy Gapon said the following:
> on 09/11/2010 07:47 Joe Marcus Clarke said the following:
>> On 11/9/10 12:36 AM, Andriy Gapon wrote:
>>> on 09/11/2010 02:14 Kevin Oberman said the following:
>>>> I'll try this as soon as I can. I'm not too sure that it will happen as
>>>> I think that this is somehow timing related. I suspect that the entry is
>>>> disappearing too quickly with 1.4 in some cases but is not a problem
>>>> with 1.2. Perhaps some optimization? 
>>>>
>>>> I suggest this because on at least rare occasion, 1.4 did run
>>>> successfully, not because I have any clue what was happening under the
>>>> covers. 
>>>
>>> I guess that I already explained this part.
>>> The problem happened because we tried to write something (even if it's just zero
>>> sized something) into stdin of a child process that already exited.
>>> Sometimes the child process was quicker, sometimes the parent process was
>>> quicker, hence the non-determinism.
>>>
>>
>> Ah, I missed that.  I wonder if it would be safer then to ignore SIGPIPE
>> around the write block.
> 
> Maybe.

Actually, please read the above as "probably no".
If a child process that is supposed to get input would crash, then such a change
would obfuscate diagnostics.

> But not calling write(2) when we don't have anything to write (zero
> length) also looks like a good solution (for me personally).
> 
> My point is: zero-sized write in nothing but testing OS implementation details
> of handling zero-sized writes, it doesn't perform any useful function.
> OTOH, if a child process is supposed to get any actual input, then it won't exit
> prematurely, but would block reading from its stdin until the input arrives.
> 
> But I think I am starting to repeat what I have already wrote before.
> 


-- 
Andriy Gapon
Comment 15 dfilter service freebsd_committer freebsd_triage 2010-11-09 06:12:10 UTC
marcus      2010-11-09 06:12:05 UTC

  FreeBSD ports repository

  Modified files:
    sysutils/hal         Makefile 
  Added files:
    sysutils/hal/files   patch-hald-runner_runner.c 
  Log:
  Fix a potential race with helper processes terminating too quickly.  When
  this happens, hald dies.
  
  PR:             151725
  Submitted by:   avg
  
  Revision  Changes    Path
  1.65      +1 -1      ports/sysutils/hal/Makefile
  1.1       +11 -0     ports/sysutils/hal/files/patch-hald-runner_runner.c (new)
_______________________________________________
cvs-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/cvs-all
To unsubscribe, send any mail to "cvs-all-unsubscribe@freebsd.org"
Comment 16 Joe Marcus Clarke freebsd_committer freebsd_triage 2010-11-09 06:12:11 UTC
State Changed
From-To: open->closed

Andriy's fix has been committed.