Bug 282217 - databases/couchdb3: Crashes after a few minutes
Summary: databases/couchdb3: Crashes after a few minutes
Status: In Progress
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Many People
Assignee: Dave Cottlehuber
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-10-20 01:24 UTC by Erik Nordstrøm
Modified: 2024-11-22 17:33 UTC (History)
1 user (show)

See Also:
bugzilla: maintainer-feedback? (dch)


Attachments
/usr/local/etc/couchdb3/local.ini config file (3.82 KB, text/plain)
2024-10-20 01:47 UTC, Erik Nordstrøm
no flags Details
/usr/local/etc/couchdb3/vm.args (3.82 KB, text/plain)
2024-10-23 23:24 UTC, Erik Nordstrøm
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Erik Nordstrøm 2024-10-20 01:24:27 UTC
I am running CouchDB 3.3.3, installed via latest package available in quarterly package repo.

Package version couchdb3-3.3.3_1

Host OS: FreeBSD 14.1-RELEASE-p5

I created a jail, with a config that looks like this:

    db-4 {
      host.hostname = "db-4";

      vnet;
      vnet.interface = "e12b_db_4";
      vnet.interface += "e13b_db_4";

      exec.poststop += "ifconfig e12a_db_4 destroy";
      exec.poststop += "ifconfig e13a_gh_4 destroy";
      exec.poststop += "ifconfig epair12 create";
      exec.poststop += "ifconfig epair13 create";

      sysvsem = "new";
      sysvmsg = "new";
      sysvshm = "new";
    }

The jail has the same version of FreeBSD as the host.

I installed CouchDB 3.3.3 with `pkg install couchdb3`.

I edited the `/usr/local/etc/couchdb3/local.ini` config file according to package message, and set it to listen to IPv6 global address ::

I enabled the couchdb3 service and started it.

I sent the HTTP PUT requests as instructed:

    curl -X PUT 'http://admin:notmyactualpasswordobviously@[::]:5984/_users'

    curl -X PUT 'http://admin:notmyactualpasswordobviously@[::]:5984/_replicator'

    curl -X PUT 'http://admin:notmyactualpasswordobviously@[::]:5984/_global_changes'

The first PUT command succeeded. The second and third ones returned error responses similar to:

    {"error":"unknown_error","reason":"badarg","ref":1926143316}

After a few minutes, CouchDB stops running, as can be observed by there no longer being any process matches for `ps wwwaux | grep couchdb3`.

I turned on logging to file at level info.

The first few lines of the CouchDB log file look like the following:

    [info] 2024-10-20T00:46:44.941085Z couchdb@127.0.0.1 <0.254.0> -------- Preflight check: Checking For Monsters

    [info] 2024-10-20T00:46:44.941137Z couchdb@127.0.0.1 <0.254.0> -------- Preflight check: Asserting Admin Account

    [info] 2024-10-20T00:46:44.943129Z couchdb@127.0.0.1 <0.254.0> -------- Apache CouchDB 3.3.3 is starting.

    [info] 2024-10-20T00:46:44.943167Z couchdb@127.0.0.1 <0.255.0> -------- Starting couch_sup
    [info] 2024-10-20T00:46:44.974439Z couchdb@127.0.0.1 <0.254.0> -------- Apache CouchDB has started. Time to relax.

    [notice] 2024-10-20T00:46:44.978854Z couchdb@127.0.0.1 <0.329.0> -------- rexi_server : started servers
    [notice] 2024-10-20T00:46:44.979659Z couchdb@127.0.0.1 <0.333.0> -------- rexi_buffer : started servers
    [error] 2024-10-20T00:46:45.011115Z couchdb@127.0.0.1 emulator -------- Error in process <0.353.0> on node 'couchdb@127.0.0.1' with exit value:
    {badarg,[{erlang,list_to_integer,[[48,48,48,13,48,13,1,8],16],[{error_info,#{module => erl_erts_errors}}]},{mem3_util,'-build_shards_by_node/2-fun-0-',5,[{file,"src/mem3_util.erl"},{line,228}]},{lists,map,2,[{file,"lists.erl"},{line,1315}]},{lists,flatmap_1,2,[{file,"lists.erl"},{line,1335}]},{mem3_shards,fold_fun,2,[{file,"src/mem3_shards.erl"},{line,332}]},{couch_bt_engine,drop_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1161}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1151}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,1167}]}]}

    [error] 2024-10-20T00:46:45.011204Z couchdb@127.0.0.1 emulator -------- Error in process <0.353.0> on node 'couchdb@127.0.0.1' with exit value:
    {badarg,[{erlang,list_to_integer,[[48,48,48,13,48,13,1,8],16],[{error_info,#{module => erl_erts_errors}}]},{mem3_util,'-build_shards_by_node/2-fun-0-',5,[{file,"src/mem3_util.erl"},{line,228}]},{lists,map,2,[{file,"lists.erl"},{line,1315}]},{lists,flatmap_1,2,[{file,"lists.erl"},{line,1335}]},{mem3_shards,fold_fun,2,[{file,"src/mem3_shards.erl"},{line,332}]},{couch_bt_engine,drop_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1161}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1151}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,1167}]}]}

    [error] 2024-10-20T00:46:45.011323Z couchdb@127.0.0.1 emulator -------- Error in process <0.363.0> on node 'couchdb@127.0.0.1' with exit value:
    {badarg,[{erlang,list_to_integer,[[48,48,48,13,48,13,1,8],16],[{error_info,#{module => erl_erts_errors}}]},{mem3_util,'-build_shards_by_node/2-fun-0-',5,[{file,"src/mem3_util.erl"},{line,228}]},{lists,map,2,[{file,"lists.erl"},{line,1315}]},{lists,flatmap_1,2,[{file,"lists.erl"},{line,1335}]},{mem3_shards,fold_fun,2,[{file,"src/mem3_shards.erl"},{line,332}]},{couch_bt_engine,drop_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1161}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1151}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,1167}]}]}

    [error] 2024-10-20T00:46:45.011405Z couchdb@127.0.0.1 emulator -------- Error in process <0.363.0> on node 'couchdb@127.0.0.1' with exit value:
    {badarg,[{erlang,list_to_integer,[[48,48,48,13,48,13,1,8],16],[{error_info,#{module => erl_erts_errors}}]},{mem3_util,'-build_shards_by_node/2-fun-0-',5,[{file,"src/mem3_util.erl"},{line,228}]},{lists,map,2,[{file,"lists.erl"},{line,1315}]},{lists,flatmap_1,2,[{file,"lists.erl"},{line,1335}]},{mem3_shards,fold_fun,2,[{file,"src/mem3_shards.erl"},{line,332}]},{couch_bt_engine,drop_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1161}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1151}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,1167}]}]}

The log file is quickly filled with a lot of error messages, and the log file grows very quickly to gigabytes in size in just a few minutes.

One of the error messages that keeps showing up says "custodian shard checker died":

    [notice] 2024-10-20T01:19:03.987934Z couchdb@127.0.0.1 <0.2139.0> -------- custodian shard checker died {badarg,[{erlang,list_to_integer,[[48,48,48,13,48,13,1,8],16],[{error_info,#{module => erl_erts_errors}}]},{mem3_util,'-build_shards_by_node/2-fun-0-',5,[{file,"src/mem3_util.erl"},{line,228}]},{lists,map,2,[{file,"lists.erl"},{line,1315}]},{lists,flatmap_1,2,[{file,"lists.erl"},{line,1335}]},{custodian_util,fold_dbs1,2,[{file,"src/custodian_util.erl"},{line,83}]},{couch_bt_engine,drop_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1161}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1151}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,1167}]}]}

And aside from that one there are similar looking ones to the ones from the beginning of the log.

Would appreciate some help here, as I cannot tell from these log files what is actually going wrong.
Comment 1 Erik Nordstrøm 2024-10-20 01:47:26 UTC
Created attachment 254379 [details]
/usr/local/etc/couchdb3/local.ini config file

Attaching full `/usr/local/etc/couchdb3/local.ini` config file.

(Logging to file is outcommented here again, because I don't want to fill my disk with gigabytes of log files outside of when I am actively looking at the logs.)
Comment 2 Dave Cottlehuber freebsd_committer freebsd_triage 2024-10-22 10:49:29 UTC
thanks Erik, thats indeed odd, I can replicate it here too. Can you attach
your vm.args file as well please? redact the cookie ofc.
Comment 3 Erik Nordstrøm 2024-10-23 23:24:09 UTC
Created attachment 254470 [details]
/usr/local/etc/couchdb3/vm.args

Here is the vm.args file from within the jail. I haven't made any edits to it so whatever is in here is just whatever was there by the pkg install or by CouchDB or whoever creates/edits this file.
Comment 4 Erik Nordstrøm 2024-10-23 23:26:51 UTC
I am wondering however, if the default 127.0.0.1 values from that vm.args file could be causing any problems since as mentioned I configured CouchDB to listen on the IPv6 global address `::`. Could that be the case?
Comment 5 Erik Nordstrøm 2024-10-30 23:03:30 UTC
Hi again Dave, do you have any tips or advice for steps that I could take to pinpoint what is causing this problem?
Comment 6 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-13 22:34:26 UTC
thanks Erik

I can replicate it here also on 3.4.1 & also next release 3.4.2

- it's nothing to do with jails
- if built from sources (within same jail) this issue doesn't occur

Next stage is a tiresome diff of all configs and files to see whats
different or missing.
Comment 7 benoitc 2024-11-15 14:57:11 UTC
i managed to reproduce it even with a fresh new port built from source. I will try to reproduce from fresh source as well.
Comment 8 benoitc 2024-11-15 14:57:22 UTC
i managed to reproduce it even with a fresh new port built from source. I will try to reproduce from fresh source as well.
Comment 9 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-16 01:30:16 UTC
Erik, I think something is going wrong inside snappy, please test with:

```
; add to /usr/local/etc/couchdb3/local.ini
[couchdb]
file_compression = snappy
```

and report back, this fixes it for me.

Benoit, please share more details of how you build this please, I didn't get a
reproduction from a plain sources build, so I assume this is a C-related
setting that is used in ports tree, but not in production build.

My poudriere build is https://pkg.skunkwerks.at/poudriere/data/14_1_x64-default/2024-11-14_11h51m41s/logs/couchdb3-3.4.2.log

reasoning:

shards doc looks good going in at creation time:

_db_create:create_shard_files/2 --> ok

0:15:47.446410 <0.2049.0> fabric_db_create:create_shard_db_doc({doc,<<"_users">>,
     {0,[]},
     {[{<<"shard_suffix">>,".1731716147"},
       {<<"changelog">>,
        [[<<"add">>,<<"00000000-7fffffff">>,<<"couchdb@127.0.0.1">>],
         [<<"add">>,<<"80000000-ffffffff">>,<<"couchdb@127.0.0.1">>]]},
       {<<"by_node">>,
        {[{<<"couchdb@127.0.0.1">>,
           [<<"00000000-7fffffff">>,<<"80000000-ffffffff">>]}]}},
       {<<"by_range">>,
        {[{<<"00000000-7fffffff">>,[<<"couchdb@127.0.0.1">>]},
          {<<"80000000-ffffffff">>,[<<"couchdb@127.0.0.1">>]}]}},
       {<<"props">>,{[]}}]},
     [],false,[]})


and then comes out as garbage:

0:15:47.485786 <0.334.0> mem3_shards:changes_callback({change,{[{<<"seq">>,1},
          {<<"id">>,<<"_users">>},
          {<<"changes">>,
           [{[{<<"rev">>,<<"1-39ee832b1fb208136e77da705cca37c3">>}]}]},
          {doc,{[{<<"_id">>,<<"_users">>},
                 {<<"_rev">>,<<"1-39ee832b1fb208136e77da705cca37c3">>},
                 {<<"shard_suffix">>,".1731716147"},
                 {<<"changelog">>,
                  [[<<"add">>,
                    <<48,48,48,13,48,13,1,8,45,55,102,102,102,9,102,9,1>>,
                    <<"couchdb@127.0.0.1">>],
                   [<<"add">>,
                    <<56,48,48,13,48,13,1,8,45,102,102,9,102,9,1,102,1>>,
                    <<"couchdb@127.0.0.1">>]]},
                 {<<"by_node">>,
                  {[{<<"couchdb@127.0.0.1">>,
                     [<<48,48,48,13,48,13,1,8,45,55,102,102,102,9,102,9,1>>,
                      <<56,48,48,13,48,13,1,8,45,102,102,9,102,9,1,102,1>>]}]}},
                 {<<"by_range">>,
                  {[{<<48,48,13,48,13,1,8,8,45,55,102,102,102,9,102,9,1>>,
                     [<<"couchdb@127.0.0.1">>]},
                    {<<56,48,48,13,48,13,1,8,45,102,102,9,102,9,1,102,1>>,
                     [<<"couchdb@127.0.0.1">>]}]}},
                 {<<"props">>,{[]}}]}}]},
        <<>>}, "continuous")


looking at snappy next:

0:44:49.399321 <0.582.0> snappy:decompress/1 --> {ok,<<131,104,1,108,0,0,0,5,104,2,109,0,0,0,12,115,104,97,114,100,95,115,117,
      102,102,105,120,107,0,11,46,49,55,51,49,55,49,55,56,56,57,104,2,109,0,0,
      0,9,99,104,97,110,103,101,108,111,103,108,0,0,0,2,108,0,0,0,3,109,0,0,0,
      3,97,100,100,109,0,0,0,17,48,48,48,13,48,13,1,8,45,55,102,102,102,9,102,
      9,1,109,0,0,0,17,99,111,117,99,104,100,98,64,49,50,55,46,48,46,48,46,49,
      106,108,0,0,0,3,109,0,0,0,3,97,100,100,109,0,0,0,17,56,48,48,13,48,13,1,
      8,45,102,102,9,102,9,1,102,1,109,0,0,0,17,99,111,117,99,104,100,98,64,
      49,50,55,46,48,46,48,46,49,106,106,104,2,109,0,0,0,7,98,121,95,110,111,
      100,101,104,1,108,0,0,0,1,104,2,109,0,0,0,17,99,111,117,99,104,100,98,
      64,49,50,55,46,48,46,48,46,49,108,0,0,0,2,109,0,0,0,17,48,48,48,13,48, <---- here 
      13,1,8,45,55,102,102,102,9,102,9,1,109,0,0,0,17,56,48,48,13,48,13,1,8,
      45,102,102,9,102,9,1,102,1,106,106,104,2,109,0,0,0,8,98,121,95,114,97,
      110,103,101,104,1,108,0,0,0,2,104,2,109,0,0,0,17,48,48,13,48,13,1,8,8,
      45,55,102,102,102,9,102,9,1,108,0,0,0,1,109,0,0,0,17,99,111,117,99,104,
      100,98,64,49,50,55,46,48,46,48,46,49,106,104,2,109,0,0,0,17,56,48,48,13,  <---- here
      48,13,1,8,45,102,102,9,102,9,1,102,1,108,0,0,0,1,109,0,0,0,17,99,111,
      117,99,104,100,98,64,49,50,55,46,48,46,48,46,49,106,106,104,2,109,0,0,0,
      5,112,114,111,112,115,104,1,106,106>>}

0:44:49.400392 <0.582.0> mem3_util:build_shards_by_node(<<"_users">>, [{<<"shard_suffix">>,".1731717889"},
 {<<"changelog">>,
  [[<<"add">>,
    <<48,48,48,13,48,13,1,8,45,55,102,102,102,9,102,9,1>>, <---- now visible here 
    <<"couchdb@127.0.0.1">>], ...
Comment 10 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-16 01:35:29 UTC
You can also try this approach:

- ensure `file_compression=snappy` again

Now, build a couchdb release normally, and copy `rel/couchdb/lib/snappy-1.0.5/`
over top of the ports-built version also fixes the issue for me:

$ ./configure --skip-deps --user couchdb --rebar /usr/local/bin/rebar --spidermonkey-version 91
$ gmake release
$ cp -av rel/couchdb/lib/snappy-1.0.5/ /usr/local/libexec/couchdb3/lib/snappy-1.0.5/

so - I assume some build-time CFLAGS or similar is causing problems for snappy.
Comment 11 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-16 15:45:27 UTC
looks like problems in snappy

1> S = fun(Doc) ->
	{ok, C} = snappy:compress(erlang:term_to_binary(Doc)),
 	true = snappy:is_valid(C),
	{ok, U} = snappy:decompress(C),
	New = erlang:binary_to_term(U),
	New == Doc
	end.

2> S([<<"00000000-7fffffff">>]).
false
3> S([<<"7fffffff">>]).
true
4> S([<<"00000000">>]).
true
5> S([<<"00000000-">>]).
true
6> S([<<"00000000-7">>]).
true
7> S([<<"00000000-7f">>]).
true
8> S([<<"00000000-7ff">>]).
true
9> S([<<"00000000-7fff">>]).
true
10> S([<<"00000000-7ffff">>]).
true
11> S([<<"00000000-7fffff">>]).
true
12> S([<<"00000000-7ffffff">>]).
true
13> S([<<"00000000-7fffffff">>]).
false
Comment 12 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-17 19:25:38 UTC
looks like `-O3` (used in ports tree build, but not source tree build) causes this
behaviour, which probably means its a bug in snappy itself of some sort.

If you run couchdb interactively[1], you can test this:

S = fun(Doc) ->
	{ok, C} = snappy:compress(erlang:term_to_binary(Doc)),
 	true = snappy:is_valid(C),
	{ok, U} = snappy:decompress(C),
	New = erlang:binary_to_term(U),
	New == Doc
	end.

S([<<"00000000-7fffffff">>]).
false
S([<<"zzzzzzzz-7fffffff">>]).
false
S([<<"00000000-7ffffff">>]).
true

I have 2 choices going forwards:

- build the port with CXXFLAGS to include `-O0` (see attached diff)
- update snappy to 1.1.8 or newer, where this issue has been fixed

The former is preferable for me, as bumping snappy should be done
within CouchDB project as a whole, rather than in a single down
stream project.
Comment 13 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-17 19:26:38 UTC
See https://git.sr.ht/~dch/ports/commit/235bd5f2bafa01bea2596ca685bf6d2a10522d72 for 
patch to test, this bumps couchdb to 3.4.2 also, and removes runtime dependency on
the lang/erlang* packages, just as a full couchdb release.
Comment 14 commit-hook freebsd_committer freebsd_triage 2024-11-22 17:28:31 UTC
A commit in branch main references this bug:

URL: https://cgit.FreeBSD.org/ports/commit/?id=6924dfdf7ea3b4b511c7f987727cb6933175765c

commit 6924dfdf7ea3b4b511c7f987727cb6933175765c
Author:     Dave Cottlehuber <dch@FreeBSD.org>
AuthorDate: 2024-11-22 15:02:32 +0000
Commit:     Dave Cottlehuber <dch@FreeBSD.org>
CommitDate: 2024-11-22 17:27:52 +0000

    databases/couchdb3: update to 3.4.2, resolve PR282217

    - include QuickJS engine
    - support xxHash for data integrity checksums
    - improved mango keys-only indexes
    - https://blog.couchdb.org/2024/10/22/3-4-2/
    - https://docs.couchdb.org/en/stable/whatsnew/3.4.html

    - ensure snappy NIF is compiled without optimisations to avoid
        corrupted documents under newer clang versions, used in
        in 14.1-RELEASE (see PR282217 for details)

    Sponsored by:   SkunkWerks, GmbH

    PR:             282217

 UPDATING                                           |  18 +
 databases/couchdb3/Makefile                        |  34 +-
 databases/couchdb3/distinfo                        |   6 +-
 databases/couchdb3/files/couchdb3.in               |  12 +-
 databases/couchdb3/files/patch-configure           |  21 +-
 .../couchdb3/files/patch-rel_overlay_etc_vm.args   |   4 +-
 .../couchdb3/files/patch-rel_reltool.config (gone) |  10 -
 .../files/patch-src_snappy_rebar.config (new)      |  11 +
 databases/couchdb3/pkg-plist                       | 954 +++++++++++++++++----
 9 files changed, 849 insertions(+), 221 deletions(-)
Comment 15 Dave Cottlehuber freebsd_committer freebsd_triage 2024-11-22 17:33:08 UTC
would appreciate some community feedback on this (binary packages should be available in 3 days on latest package branch), before I push it back down to quarterly branch.