Bug 224079

Summary: java/openjdk8: Elasticsearch won't start after OpenJDK upgrade
Product: Ports & Packages Reporter: nagy.attila
Component: Individual Port(s)Assignee: freebsd-java (Nobody) <java>
Status: Closed FIXED    
Severity: Affects Some People CC: 000.fbsd, elastic, feld, girgen, glewis, john
Priority: --- Flags: bugzilla: maintainer-feedback? (java)
Version: Latest   
Hardware: Any   
OS: Any   
Attachments:
Description Flags
java/openjdk8: Preserve OS-supplied IPv6 interface scope IDs
john: maintainer-approval?
DTrace script to monitor bind(2) calls
none
Program to demonstrate getifaddr(3) behavior
none
Program to demonstrate bind(2) behavior none

Description nagy.attila 2017-12-04 12:53:58 UTC
When I start Elasticsearch (5.6.4, with the default configuration) with OpenJDK8 8.152.16, it dies with:
[2017-12-04T13:49:57,971][INFO ][o.e.n.Node               ] initialized
[2017-12-04T13:49:57,972][INFO ][o.e.n.Node               ] [amnEplp] starting ...
[2017-12-04T13:49:58,317][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: BindTransportException[Failed to bind to [9300-9400]]; nested: BindException[Can't assign requested address];
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:134) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.6.4.jar:5.6.4]
Caused by: org.elasticsearch.transport.BindTransportException: Failed to bind to [9300-9400]
	at org.elasticsearch.transport.TcpTransport.bindToPort(TcpTransport.java:783) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.transport.TcpTransport.bindServer(TcpTransport.java:748) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.transport.netty4.Netty4Transport.doStart(Netty4Transport.java:174) ~[?:?]
	at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.transport.TransportService.doStart(TransportService.java:219) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:69) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.node.Node.start(Node.java:694) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.bootstrap.Bootstrap.start(Bootstrap.java:278) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:351) ~[elasticsearch-5.6.4.jar:5.6.4]
	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132) ~[elasticsearch-5.6.4.jar:5.6.4]
	... 6 more
Caused by: java.net.BindException: Can't assign requested address
	at sun.nio.ch.Net.bind0(Native Method) ~[?:?]
	at sun.nio.ch.Net.bind(Net.java:433) ~[?:?]
	at sun.nio.ch.Net.bind(Net.java:425) ~[?:?]
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) ~[?:?]
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:554) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1258) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:980) ~[?:?]
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:250) ~[?:?]
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:365) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_152]
[2017-12-04T13:49:59,027][INFO ][o.e.n.Node               ] [amnEplp] stopping ...

Taking a look at it with ktrace shows that this is where it stops:
 99579 java     CALL  bind(0x79,0x7fffdeeec278,0x1c)
 99579 java     STRU  struct sockaddr { AF_INET6, [::1%2]:9300 }
 99579 java     RET   bind -1 errno 49 Can't assign requested address

Elasticsearch works fine with the previous OpenJDK version: 8.144.1
Comment 1 Palle Girgensohn freebsd_committer freebsd_triage 2017-12-04 14:56:01 UTC
Hi!

just a couple of things to check. It fails to open a socket. This could be because it tries IPv6 first. Could you try changing the /etc/hosts entry for localhost between 

127.0.0.1 localhost
::1       localhost


and just

127.0.0.1 localhost
# ::1     localhost

and just see if this is the culprit?

this indicates that it is trying IPv6 and fails:

99579 java     CALL  bind(0x79,0x7fffdeeec278,0x1c)
99579 java     STRU  struct sockaddr { AF_INET6, [::1%2]:9300 }
99579 java     RET   bind -1 errno 49 Can't assign requested address

perhaps running in IPv4 would be OK?




also, make sure, using `sockstat|grep 9300`, that no other process hogs the socket.

Palle
Comment 2 nagy.attila 2017-12-04 17:29:42 UTC
Hi,

Of course nothing listens on that port. I should've tell what I did so far.
Listening on directly specified addresses work (be it IPv4 or IPv6, setting network.host to a given address in elasticsearch.yml).

Only the default doesn't work, with the given scoped address.
OpenJDK 8.144 is fine, 152 is not. I'm not sure whether it's FreeBSD related or not.
Comment 3 Miroslav Lachman 2017-12-16 12:04:44 UTC
Exception
org.elasticsearch.transport.BindTransportException: Failed to bind to [9300-9400]

I can confirm this bug with:
elasticsearch5-5.3.0_1
elasticsearch5-x-pack-5.3.0
openjdk8-8.152.16_1

Elastic was working fine with previous Java version.

I had this in elasticsearch.yml
network.host: [ _local_, _site_ ]

It does not work anymore. I changed it to 
network.host: [ 127.0.0.1, 172.28.4.20, 172.28.4.28 ]


Removing ::1  localhost  from /etc/hosts did not help.

My Java knowledge is close to zero. Can anybody look in to OpenJDK FreeBSD version of port what is causing this bug? 
It causes unwanted downtime after pkg upgrade.
Comment 4 John W. O'Brien 2017-12-16 19:32:02 UTC
Created attachment 188887 [details]
java/openjdk8: Preserve OS-supplied IPv6 interface scope IDs

The problem is in the way u152 started handling IPv6 scope IDs in the java.net.NetworkInterface class. This patch corrects that defect and allows elasticsearch (ES) to start. Read on for details of my investigation and analysis.

My attention was drawn to "[::1%2]" in the ktrace output. This looked wrong to me.

I adapted the soconnect.d DTrace script from Gregg and Mauro [0] to get a look at the bind(2) calls (sobind.d).

On a machine with u144 where ES starts:

PID    PROCESS          FAM ADDR                                    SCOPE    PORT
44187  java             28  fe80::1                                 3        9300
44187  java             28  ::1                                     0        9300
44187  java             28  127.0.0.1                               0        9300
44187  java             28  fe80::1                                 3        9200
44187  java             28  ::1                                     0        9200
44187  java             28  127.0.0.1                               0        9200


On a machine with u152 where ES fails:

PID    PROCESS          FAM ADDR                                    SCOPE    PORT
65851  java             28  fe80::1                                 3        9300
65851  java             28  ::1                                     3        9300
65851  java             28  ::1                                     3        9301
65851  java             28  ::1                                     3        9302
65851  java             28  ::1                                     3        9303
65851  java             28  ::1                                     3        9304
[repeat through PORT 9400]

Next, I wrote a short C program to exercise getifaddr(3) and ran it on a few different systems. I also wrote a C program to try calling bind(2) on an arbitrary IPv6 address, scope ID, and port. This is the simplest, most direct way I could think of to demonstrate the problem and confirm my understanding of the applicable APIs.

FreeBSD 10.4-RELEASE-p3:

$ ./gifa lo0
iface flags       af addr                                           scope ifidx
lo0   0x00008049  18                                                   -1     3
lo0   0x00008049  28 ::1                                                0     3
lo0   0x00008049  28 fe80::1                                            3     3
lo0   0x00008049   2 127.0.0.1                                         -1     3
$ ./trybind ::1 0 9300 && echo OK
OK
$ ./trybind ::1 3 9300 && echo OK
Could not bind: Can't assign requested address
$ ./trybind ::1 999 9300 && echo OK
Could not bind: Can't assign requested address


RedHat Enterprise Linux 6.9:

$ ./gifa lo
iface flags       af addr                                           scope ifidx
lo    0x00010049  17                                                   -1     1
lo    0x00010049   2 127.0.0.1                                         -1     1
lo    0x00010049  10 ::1                                                0     1
$ ./trybind ::1 0 9300 && echo OK
OK
$ ./trybind ::1 1 9300 && echo OK
OK
$ ./trybind ::1 999 9300 && echo OK
OK


macOS Sierra 10.12.6:

$ ./gifa lo0
iface flags       af addr                                           scope ifidx
lo0   0x00008049  18                                                   -1     1
lo0   0x00008049   2 127.0.0.1                                         -1     1
lo0   0x00008049  30 ::1                                                0     1
lo0   0x00008049  30 fe80::1                                            1     1
$ ./trybind ::1 0 9300 && echo OK
OK
$ ./trybind ::1 1 9300 && echo OK
OK
$ ./trybind ::1 999 9300 && echo OK
OK


From these results I infer that RHEL and macOS ignore sin6_scope_id unless it's needed to disambiguate an address known to be scoped, while FreeBSD always considers the scope ID part of the address and treats scope 0 as the unscoped scope.

For reference, the POSIX spec [1] states:

"The sin6_scope_id field is a 32-bit integer that identifies a set of interfaces as appropriate for the scope of the address carried in the sin6_addr field. For a link scope sin6_addr, the application shall ensure that sin6_scope_id is a link index. For a site scope sin6_addr, the application shall ensure that sin6_scope_id is a site index. The mapping of sin6_scope_id to an interface or set of interfaces is implementation-defined."

Is the loopback address scoped? According to RFC-4007 [2], "::1, is treated as having link-local scope".

The OpenJDK patch that introduced the breakage is a changeset [3] that modifies the java.net.NetworkInterface class to unconditionally jam the interface index into sin6_scope_id (see diff lines 1.926, 1.1274, and 1.1662). This is unnecessary for any address that is scoped, because the OS will have already populated sin6_scope_id with the correct link index. This is also incorrect for any address that is not scoped, because it constitutes a JDK-defined mapping of scope ID to an interface or set of interfaces, whereas the OS is entitled to define that mapping.

I have found nowhere else where OpenJDK depends upon finding the interface index in the sin6_scope_id field.

[0] https://github.com/brendangregg/DTrace-book-scripts/blob/master/Chap6/soconnect.d
[1] http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html
[2] https://tools.ietf.org/html/rfc4007#section-4
[3] http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/rev/3dc438e0c8e1
Comment 5 John W. O'Brien 2017-12-16 19:32:45 UTC
Created attachment 188888 [details]
DTrace script to monitor bind(2) calls
Comment 6 John W. O'Brien 2017-12-16 19:34:37 UTC
Created attachment 188889 [details]
Program to demonstrate getifaddr(3) behavior

To use:
$ cc -o gifa gifa.c
$ ./gifa            # all interfaces
$ ./gifa lo0        # only lo0
Comment 7 John W. O'Brien 2017-12-16 19:36:55 UTC
Created attachment 188890 [details]
Program to demonstrate bind(2) behavior

To use:
$ cc -o trybind trybind.c
$ ./trybind ::1 0 9300    # attempt to bind a TCP socket on [::1%0]:9300

Exit status is 0 on success, 1 on failure.
Comment 8 commit-hook freebsd_committer freebsd_triage 2017-12-25 06:26:54 UTC
A commit references this bug:

Author: glewis
Date: Mon Dec 25 06:25:53 UTC 2017
New revision: 457209
URL: https://svnweb.freebsd.org/changeset/ports/457209

Log:
  . Don't incorrectly set the scopeId of an IPv6 interface on BSD.

    This fixes binding sockets for applications running on IPv6.

  PR:		224079
  Submitted by:	John W. O'Brien <john@saltant.com>

Changes:
  head/java/openjdk8/Makefile
  head/java/openjdk8/files/patch-jdk-src-solaris-classes-native-java-net-NetworkInterface.c
Comment 9 Greg Lewis freebsd_committer freebsd_triage 2017-12-25 06:27:37 UTC
John, are you willing to sign an Oracle Contributor Agreement so this can be pushed back into the source repo?
Comment 10 John W. O'Brien 2017-12-25 17:54:21 UTC
(In reply to Greg Lewis from comment #9)

Yes, I am willing. I have contacted the OCA email account for instructions for securely submitting the OCA (OpenPGP or fax).
Comment 11 John W. O'Brien 2018-01-13 13:32:08 UTC
(In reply to Greg Lewis from comment #9)

I have received no response to my OCA inquiry. How do you suggest I proceed?
Comment 12 Greg Lewis freebsd_committer freebsd_triage 2018-03-04 17:38:13 UTC
Can you send me the details of the email that you sent?  I can take it up with someone at Oracle directly.
Comment 13 John W. O'Brien 2018-03-04 18:37:11 UTC
(In reply to Greg Lewis from comment #12)

I just forwarded to you my initial and follow-up emails to Oracle.
Comment 14 Greg Lewis freebsd_committer freebsd_triage 2019-08-02 04:26:21 UTC
Closing this because the issue has fixed and development has moved out of the OpenJDK mercurial repos and into Github.