Bug 165360

Summary: net-im/ejabberd: ejabberdctl fails to communicate with ejabberd process
Product: Ports & Packages Reporter: neil
Component: Individual Port(s)Assignee: Ashish SHUKLA <ashish>
Status: Closed Overcome By Events    
Severity: Affects Only Me CC: cs, miwi
Priority: Normal    
Version: Latest   
Hardware: Any   
OS: Any   

Description neil 2012-02-21 13:00:20 UTC
With ejabberd-2.1.9 installed and configured it is possible to start ejabberd with /usr/local/etc/rc.d/ejabberd start but /usr/local/etc/rc.d/ejabberd stop fails to stop ejabberd with a RPC error. Similarly /usr/local/etc/rc.d/ejabberd status indicates the node is not running when it clearly is.

/etc/rc.conf.local:
ejabberd_enable="YES"
ejabberd_node="ejabberd@bumblebee"

/etc/hosts:
192.168.1.2             bumblebee.darlow.co.uk bumblebee

hostname -s
bumblebee

How-To-Repeat: 1) Start ejabberd
bumblebee# /usr/local/etc/rc.d/ejabberd start
Starting ejabberd.

bumblebee # ps ax
 2185  ??  S      0:00.00 /usr/local/lib/erlang/erts-5.8.5/bin/epmd -daemon
 2187  ??  S      0:03.18 [beam.smp]
 2188  ??  Ss     0:00.02 /usr/local/lib/erlang/lib/ssl-4.1.6/priv/bin/ssl_esoc
 2189  ??  Is     0:00.01 inet_gethost 4
 2190  ??  I      0:00.00 inet_gethost 4

2) Query ejabberd status
bumblebee# /usr/local/etc/rc.d/ejabberd status
ejabberd is not running.

bumblebee# ps ax
 2185  ??  S      0:00.00 /usr/local/lib/erlang/erts-5.8.5/bin/epmd -daemon
 2187  ??  S      0:03.21 [beam.smp]
 2188  ??  Ss     0:00.02 /usr/local/lib/erlang/lib/ssl-4.1.6/priv/bin/ssl_esoc
 2189  ??  Is     0:00.01 inet_gethost 4
 2190  ??  I      0:00.00 inet_gethost 4

3) Attempt to stop ejabberd
bumblebee# /usr/local/etc/rc.d/ejabberd stop
Stopping ejabberd.
Failed RPC connection to the node ejabberd@bumblebee: nodedown

Commands to start an ejabberd node:
  start  Start an ejabberd node in server mode
  debug  Attach an interactive Erlang shell to a running ejabberd node
  live   Start an ejabberd node in live (interactive) mode

Optional parameters when starting an ejabberd node:
  --config-dir dir   Config ejabberd:    /usr/local/etc/ejabberd
  --config file      Config ejabberd:    /usr/local/etc/ejabberd/ejabberd.cfg
  --ctl-config file  Config ejabberdctl: /usr/local/etc/ejabberd/ejabberdctl.cfg
  --logs dir         Directory for logs: /var/log/ejabberd
  --spool dir        Database spool dir: /var/spool/ejabberd
  --node nodename    ejabberd node name: ejabberd@bumblebee

ejabberd not running?

4) Verify that ejabberd is operational
bumblebee# netstat -an
tcp4       0      0 *.5280                 *.*                    LISTEN
tcp4       0      0 *.5269                 *.*                    LISTEN
tcp4       0      0 *.5222                 *.*                    LISTEN
Comment 1 Mark Linimon freebsd_committer freebsd_triage 2012-02-22 03:42:44 UTC
Responsible Changed
From-To: freebsd-ports-bugs->ashish

Fix synopsis and assign.
Comment 2 Ashish SHUKLA freebsd_committer freebsd_triage 2012-05-18 23:24:48 UTC
Hi,

I've updated net-im/ejabberd port to 2.1.11. Could you please check if it
solves your problem?

Thanks
-- 
Ashish SHUKLA      | GPG: F682 CDCC 39DC 0FEA E116  20B6 C746 CFA9 E74F A4B0
freebsd.org!ashish | http://people.freebsd.org/~ashish/

Sent from my Emacs
Comment 3 neil 2012-05-22 11:04:02 UTC
Hi Ashish,

No, the update to ejabberd-2.1.11 has no effect. The problem still persists.

Regards,
Neil Darlow
Comment 4 Ashish SHUKLA freebsd_committer freebsd_triage 2012-05-22 12:28:22 UTC
Neil Darlow writes:
> Hi Ashish,


> No, the update to ejabberd-2.1.11 has no effect. The problem still persists.


Hi Neil,

Okay, thanks for the update. I'm away from my FreeBSD host. I'll look
at it when I'm back next week.

Thanks
-- 
Ashish SHUKLA      | GPG: F682 CDCC 39DC 0FEA E116  20B6 C746 CFA9 E74F A4B0
freebsd.org!ashish | http://people.freebsd.org/~ashish/

Sent from my Emacs
Comment 5 neil 2013-03-28 11:02:37 UTC
The problem with RPC failure appears to be related to hostname resolution.

Rather than using the ejabberd_epmd_address of 127.0.0.1, it works when this 
is set to the IP address that your ejabberd@hostname nodename resolves to (the 
hostname bit) in my case 192.168.1.2

There are two problems I noticed at shutdown:

1) ejabberd spawns epmd if it is not already running but it does not kill it 
at shutdown.

This can be fixed by using erlang's epmd script (add epmd_enable="YES" to a 
preferred rc.conf file). To ensure correct startup and shutdown ordering it is 
necessary to change ejabberd's # REQUIRE: line as follows:

# REQUIRE: DAEMON epmd

2) epmd will not respond to the -kill option in its shutdown command until all 
registered Names are removed (the -relaxed_command_check option is not used 
when epmd is started).

This is compounded because, once ejabberd has been signalled to stop by 
ejabberdctl, the epmd script shutdown command can be executed before ejabberd 
has unregistered its Name with epmd. This causes epmd to fail shutdown.

A solution is to query epmd for ejabberd's Name, during shutdown, repeatedly 
for up to 10 seconds (on my 1.6GHz server it takes about 6 seconds) waiting 
for ejabberd to unregister itself, after which epmd will shutdown cleanly.

So, in summary:
1) Add epmd_enable="YES" to a preferred rc.conf file
2) Add ejabberd_epmd_address="a.real.ip.address" to a preferred rc.conf file
3) The ejabberd control script requires changes as per this after-install 
patch:
--- /usr/local/etc/rc.d/ejabberd.orig   2013-03-28 10:19:25.000000000 +0000
+++ /usr/local/etc/rc.d/ejabberd        2013-03-28 10:20:36.000000000 +0000
@@ -2,7 +2,7 @@
 # $FreeBSD: ports/net-im/ejabberd/files/ejabberd.in,v 1.10 2012/11/17 06:00:26 
svnexp Exp $
 
 # PROVIDE: ejabberd
-# REQUIRE: DAEMON
+# REQUIRE: DAEMON epmd
 # BEFORE: LOGIN
 # KEYWORD: shutdown
 
@@ -58,8 +58,12 @@
 {
     echo "Stopping $name."
     if su $EJABBERDUSER -c "env ERL_EPMD_ADDRESS=\"${ejabberd_epmd_address}\" 
$EJABBERDCTL --node $ejabberd_node stop"; then
-#        sleep 2
-#        killall -u ejabberd -kill
+        SECS=0
+        while /usr/local/bin/epmd -names |grep 'ejabberd' >/dev/null; do
+                sleep 1
+                SECS=$(expr $SECS + 1)
+                [ $SECS -eq 10 ] && exit
+        done
     else
         _run_rc_notrunning
     fi

The timeout implementation could probably be cleaned-up but the logic works 
well for me.

Regards,
Neil Darlow
Comment 6 Ashish SHUKLA freebsd_committer freebsd_triage 2013-03-29 13:29:56 UTC
Hi,

First of all, thank you for your time in investigating this.

On Thu, 28 Mar 2013 11:10:01 GMT, Neil Darlow <neil@darlow.co.uk> said:

>  The problem with RPC failure appears to be related to hostname resolution.

 
>  Rather than using the ejabberd_epmd_address of 127.0.0.1, it works when this 
>  is set to the IP address that your ejabberd@hostname nodename resolves to (the 
>  hostname bit) in my case 192.168.1.2


I'm not sure if that's the real reason. From what I noticed in my testing few
minutes ago with ejabberd 2.1.11, and erlang 16.b[1], that if I run system
epmd (i.e. $PREFIX/rc.d/epmd), it's able to register its name intermittently,
i.e.

#v+
% sudo /usr/local/etc/rc.d/epmd start
% sudo /usr/local/etc/rc.d/ejabberd start
% sudo /usr/local/etc/rc.d/ejabberd status
#v-

Two instances of epmd were running, system one (the one running under root
user) has bound itself to *:4369, whereas the ejabberd one has bound itself to
127.0.0.1.

Sometimes above steps succeed (ejabberd is running), and sometimes not
(ejabberd is not running), whereas when I stopped all epmd, and then started
ejabberd only, i.e.:

#v+
% sudo pkill epmd
% sudo /usr/local/etc/rc.d/ejabberd start
% sudo /usr/local/etc/rc.d/ejabberd status
#v-

Then it succeeded. And now only one epmd was running, which was spawned by
ejabberd.

So from what I observed is that because two processes are listening on same
port with overlapping addresses, it adds a level of non-determinism on which
process is going to handle connection for 127.0.0.1:4369. Probably if we can
instruct ejabberd to not start another epmd if one is already running, then
may be we can achieve some level of determinism with this thing.

It might be possible that epmd shipped with erlang 16b has some fixes which
makes things slightly more reliable.

>  There are two problems I noticed at shutdown:

 
>  1) ejabberd spawns epmd if it is not already running but it does not kill it 
>  at shutdown.

 
>  This can be fixed by using erlang's epmd script (add epmd_enable="YES" to a 
>  preferred rc.conf file). To ensure correct startup and shutdown ordering it is 
>  necessary to change ejabberd's # REQUIRE: line as follows:

 
>  # REQUIRE: DAEMON epmd


Right, but I'm not sure if it'll fix this:

#v+
root      90075   0.0  0.0  14228   1620 ??  S     1:06PM      0:00.00 /usr/local/bin/epmd -daemon
ejabberd  90115   0.0  0.0  14228   1632 ??  S     1:06PM      0:00.00 /usr/local/lib/erlang/erts-5.10.1/bin/epmd -daemon
#v-

 
>  2) epmd will not respond to the -kill option in its shutdown command until all 
>  registered Names are removed (the -relaxed_command_check option is not used 
>  when epmd is started).

 
>  This is compounded because, once ejabberd has been signalled to stop by 
>  ejabberdctl, the epmd script shutdown command can be executed before ejabberd 
>  has unregistered its Name with epmd. This causes epmd to fail shutdown.

 
>  A solution is to query epmd for ejabberd's Name, during shutdown, repeatedly 
>  for up to 10 seconds (on my 1.6GHz server it takes about 6 seconds) waiting 
>  for ejabberd to unregister itself, after which epmd will shutdown cleanly.

 
>  So, in summary:
>  1) Add epmd_enable="YES" to a preferred rc.conf file
>  2) Add ejabberd_epmd_address="a.real.ip.address" to a preferred rc.conf file
>  3) The ejabberd control script requires changes as per this after-install 
>  patch:
>  --- /usr/local/etc/rc.d/ejabberd.orig   2013-03-28 10:19:25.000000000 +0000
>  +++ /usr/local/etc/rc.d/ejabberd        2013-03-28 10:20:36.000000000 +0000
>  @@ -2,7 +2,7 @@
>   # $FreeBSD: ports/net-im/ejabberd/files/ejabberd.in,v 1.10 2012/11/17 06:00:26 
>  svnexp Exp $

  
>   # PROVIDE: ejabberd
>  -# REQUIRE: DAEMON
>  +# REQUIRE: DAEMON epmd
>   # BEFORE: LOGIN
>   # KEYWORD: shutdown

  
>  @@ -58,8 +58,12 @@
>   {
>       echo "Stopping $name."
>       if su $EJABBERDUSER -c "env ERL_EPMD_ADDRESS=\"${ejabberd_epmd_address}\" 
>  $EJABBERDCTL --node $ejabberd_node stop"; then
>  -#        sleep 2
>  -#        killall -u ejabberd -kill
>  +        SECS=0
>  +        while /usr/local/bin/epmd -names |grep 'ejabberd' >/dev/null; do
>  +                sleep 1
>  +                SECS=$(expr $SECS + 1)
>  +                [ $SECS -eq 10 ] && exit
>  +        done
>       else
>           _run_rc_notrunning
>       fi


The patch seems good, though I've not tried yet. I'll try it and provide a
feedback probably next week, though it can't be committed until ports freeze
is over.

References:
[1]  https://svn.redports.org/olgeni/lang/erlang/

Thanks
-- 
Ashish SHUKLA      | GPG: F682 CDCC 39DC 0FEA E116  20B6 C746 CFA9 E74F A4B0
Sent from my Emacs
Comment 7 neil 2013-03-30 08:36:08 UTC
On Friday 29 March 2013 18:59:56 Ashish SHUKLA wrote:
> First of all, thank you for your time in investigating this.

You are welcome. I am aware that maintainers may not be able to reproduce 
specific problems experienced by users so it is only right that I should assist 
wherever possible.

> I'm not sure if that's the real reason. From what I noticed in my testing
> few minutes ago with ejabberd 2.1.11, and erlang 16.b[1], that if I run
> system epmd (i.e. $PREFIX/rc.d/epmd), it's able to register its name
> intermittently, i.e.
> 
> #v+
> % sudo /usr/local/etc/rc.d/epmd start
> % sudo /usr/local/etc/rc.d/ejabberd start
> % sudo /usr/local/etc/rc.d/ejabberd status
> #v-
> 
> Two instances of epmd were running, system one (the one running under root
> user) has bound itself to *:4369, whereas the ejabberd one has bound itself
> to 127.0.0.1.

I found this also, until I changed "ejabberd_epmd_address" to the physical IP 
address of my network interface (which is also what my hostname resolves to - 
coincidence?).

> So from what I observed is that because two processes are listening on same
> port with overlapping addresses, it adds a level of non-determinism on which
> process is going to handle connection for 127.0.0.1:4369. Probably if we
> can instruct ejabberd to not start another epmd if one is already running,
> then may be we can achieve some level of determinism with this thing.

Do you have faster hardware than my 1.6GHz system? Is it possible that epmd 
has not started before ejabberd starts and it does not see it as running? That 
might explain why you see two instances of epmd.

I must admit, in the heavy testing I performed, this situation was not 
arising. I did repeated "epmd start; ejabberd start", "ejabberd stop; epmd 
stop" and reboot sequences.

They all resulted in the following:
  786 ??  S       0:01.54 /usr/local/bin/epmd -daemon
  788 ??  S       0:32.13 /usr/local/sbin/winbindd -s /usr/local/etc/smb.conf
  789 ??  S       0:00.62 /usr/local/sbin/smbd -D -s /usr/local/etc/smb.conf
  803 ??  I       0:24.54 /usr/local/lib/erlang/erts-5.9.3.1/bin/beam.smp -P 
250000 -- -root /usr/local/lib/erlang -progname erl -- -home 
/var/spool/ejabberd -- -sname ejabberd@bumblebee -noshell -noinput -noshell -
noinput -pa /usr/local/lib/erlang/lib/ejabberd-2.1.11/ebin -mnesia dir 
"/var/spool/ejabberd" -kernel inet_dist_use_interface {192,168,1,2} -s 
ejabberd -sasl sasl_error_logger {file,"/var/log/ejabberd/erlang.log"} -smp 
auto start start
  807 ??  Is      0:00.02 inet_gethost 4
  808 ??  I       0:00.08 inet_gethost 4

This is from a boot invokation. Manual startup results in beam.smp immediately 
following epmd in the process listing, as you would expect.

> Right, but I'm not sure if it'll fix this:
> 
> #v+
> root      90075   0.0  0.0  14228   1620 ??  S     1:06PM      0:00.00
> /usr/local/bin/epmd -daemon ejabberd  90115   0.0  0.0  14228   1632 ??  S 
>    1:06PM      0:00.00 /usr/local/lib/erlang/erts-5.10.1/bin/epmd -daemon
> #v-

I have not seen two instances of epmd running since I set 
ejabberd_epmd_address from its default value of 127.0.0.1. If you have done 
that then there must be another cause like I mentioned earlier.

> The patch seems good, though I've not tried yet. I'll try it and provide a
> feedback probably next week, though it can't be committed until ports freeze
> is over.

Well, we still need to resolve the issue you experience that I currently do 
not before that commit also.

Regards,
Neil Darlow
Comment 8 Carlo Strub freebsd_committer freebsd_triage 2014-09-11 20:14:49 UTC
Is this PR still relevant?
Comment 9 neil 2014-09-12 09:36:58 UTC
It still appears to be relevant. I apply my patch at each update.
Comment 10 Carlo Strub freebsd_committer freebsd_triage 2014-09-13 13:27:16 UTC
Can you please add your patch as an attachment?
Comment 11 neil 2014-09-13 19:05:39 UTC
It is actually part of Comment 5.

I cannot usefully attach it as a patch because it modifies files/ejabberd.in

It really needs the maintainer to decide whether he wants to adopt the change and if he agrees with the logic behind it.

Regards,
Neil Darlow
Comment 12 Carlo Strub freebsd_committer freebsd_triage 2014-09-13 19:15:30 UTC
Ok. Make sure you hammer the maintainer. If he does not respond, please get back to me.
Comment 13 Ashish SHUKLA freebsd_committer freebsd_triage 2014-09-14 02:21:29 UTC
(In reply to neil from comment #11)
> It is actually part of Comment 5.
> 
> I cannot usefully attach it as a patch because it modifies files/ejabberd.in
> 
> It really needs the maintainer to decide whether he wants to adopt the
> change and if he agrees with the logic behind it.
> 
> Regards,
> Neil Darlow

Hi Neil,

Sorry for the extreme delay on part. The patch works as intended if its pre-requisite conditions (epmd enabled in rc.conf, and ejabberd_epmd_address) are set, which we don't control.

So I prepared following diff with the help of your changeset, and prepared following diff, which explicitly kills epmd after all names are unregistered, and does not require erlang's epmd:

Index: ejabberd.in
===================================================================
--- ejabberd.in	(revision 29885)
+++ ejabberd.in	(working copy)
@@ -58,8 +58,13 @@
 {
     echo "Stopping $name."
     if su $EJABBERDUSER -c "env ERL_EPMD_ADDRESS=\"${ejabberd_epmd_address}\" $EJABBERDCTL --node $ejabberd_node stop"; then
-#        sleep 2
-#        killall -u ejabberd -kill
+        SECS=0
+        while %%LOCALBASE%%/bin/epmd -names |fgrep -q ejabberd; do
+                sleep 1
+                SECS=$(expr $SECS + 1)
+                [ $SECS -eq 10 ] && exit
+        done
+        pkill -j none -u $EJABBERDUSER epmd
     else
         _run_rc_notrunning
     fi

How does this look to you ?

Thanks,
Ashish

P.S. I'll only be able to commit it on/after Wednesday September 24, 2014.
Comment 14 neil 2014-09-28 09:58:47 UTC
Hi,

The problem I can see with that solution is that it assumes ejabberd spawned the epmd process.

What if another erlang-based port is using epmd launched from the epmd RC script?

I do not think it is a good idea that ejabberd's RC kills epmd unless it can be sure it is killing an instance launched by itself.
Comment 15 Ashish SHUKLA freebsd_committer freebsd_triage 2014-09-28 16:05:00 UTC
(In reply to neil from comment #14)
> Hi,
> 
> The problem I can see with that solution is that it assumes ejabberd spawned
> the epmd process.
> 
> What if another erlang-based port is using epmd launched from the epmd RC
> script?
> 
> I do not think it is a good idea that ejabberd's RC kills epmd unless it can
> be sure it is killing an instance launched by itself.

Although, it's not guaranteed that it'll only kill the 'epmd' process spawned by ejabberd, but it'll only kill the `epmd' process owned by 'ejabberd' user.

The only place this won't work is when user runs multiple 'ejabberd' instances under same 'ejabberd' user[1] on same host. And since this rc.d script is used to  manage single 'ejabberd' instance, so this would be useless anyways.

Do you still see problem?

References:
[1] No ideas if it's really possible though, courtesy: $HOME/.erlang.cookie

Thanks!
Ashish
Comment 16 Ashish SHUKLA freebsd_committer freebsd_triage 2014-10-06 16:22:28 UTC
(In reply to Ashish SHUKLA from comment #15)
> (In reply to neil from comment #14)
> > Hi,
> > 
> > The problem I can see with that solution is that it assumes ejabberd spawned
> > the epmd process.
> > 
> > What if another erlang-based port is using epmd launched from the epmd RC
> > script?
> > 
> > I do not think it is a good idea that ejabberd's RC kills epmd unless it can
> > be sure it is killing an instance launched by itself.
> 
> Although, it's not guaranteed that it'll only kill the 'epmd' process
> spawned by ejabberd, but it'll only kill the `epmd' process owned by
> 'ejabberd' user.
> 
> The only place this won't work is when user runs multiple 'ejabberd'
> instances under same 'ejabberd' user[1] on same host. And since this rc.d
> script is used to  manage single 'ejabberd' instance, so this would be
> useless anyways.
> 
> Do you still see problem?

Hi Neil,

Sorry to bother you. I'm wondering if you got chance to look at it, and it's fine to commit this diff ?

Thanks,
Ashish
Comment 17 neil 2014-10-07 10:49:24 UTC
Hi Ashish,

I just had a quick look at your patch. Unfortunately it still presents the RPC error message and leaves epmd and ejabberd running.

The method I developed still seems to be the most reliable at shutting down ejabberd. Although it won't survive an Erlang update - you have to stop ejabberd before updating Erlang because ERTS version gets encoded into the runtime path.

Regards,
Neil Darlow
Comment 18 Ashish SHUKLA freebsd_committer freebsd_triage 2014-10-07 11:23:05 UTC
(In reply to neil from comment #17)
> Hi Ashish,
> 
> I just had a quick look at your patch. Unfortunately it still presents the
> RPC error message and leaves epmd and ejabberd running.
> 
> The method I developed still seems to be the most reliable at shutting down
> ejabberd. Although it won't survive an Erlang update - you have to stop
> ejabberd before updating Erlang because ERTS version gets encoded into the
> runtime path.
> 
> Regards,
> Neil Darlow

Okay, I'm able to reproduce it now as well. I was missing `ejabberd_node' variable from rc.conf(5). I'll see if I can incorporate your solution cleanly in the rc.d script.

Ashish
Comment 19 Martin Wilke freebsd_committer freebsd_triage 2016-01-18 06:42:59 UTC
Hi,

Any progress here?
Comment 20 neil 2016-01-19 11:42:23 UTC
I think we can close this as long as the following proviso is understood.

The erts version is encoded into paths used by executables. This means that when an underlying erlang port update changes the erts version it is necessary to stop ejabberd before erlang and ejabberd are updated. If this is not done then the epmd and beam processes will have to be terminated manually. Actually, the epmd process spawned by ejabberd is uncontrolled anyway and requires manual killing whenever a restart of it is required.
Comment 21 Martin Wilke freebsd_committer freebsd_triage 2016-01-19 11:48:38 UTC
Close per request. Thank you.