Bug 242400 - databases/lmdb: cyrus-imapd30 database locking issues
Summary: databases/lmdb: cyrus-imapd30 database locking issues
Status: Closed Overcome By Events
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Individual Port(s) (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Xin LI
URL:
Keywords: regression
Depends on:
Blocks:
 
Reported: 2019-12-03 11:27 UTC by Volodymyr Kostyrko
Modified: 2019-12-19 20:26 UTC (History)
0 users

See Also:
bugzilla: maintainer-feedback? (delphij)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Volodymyr Kostyrko 2019-12-03 11:27:36 UTC
I'll try to summarize what I'm facing since the last lmdb update:

Dec  3 13:07:45 limbo httpd[65465]: unable to tell master 1: Broken pipe
Dec  3 13:08:33 limbo ctl_cyrusdb[76556]: skiplist: clean shutdown file missing, updating recovery stamp
Dec  3 13:08:33 limbo ctl_cyrusdb[76556]: recovering cyrus databases
Dec  3 13:08:33 limbo ctl_cyrusdb[76556]: cyrusdb_lmdb(/var/imap/quotas.db): No such file or directory
Dec  3 13:08:33 limbo ctl_cyrusdb[76556]: cyrusdb_lmdb(/var/imap/quotas.db): No such file or directory
Dec  3 13:08:33 limbo ctl_cyrusdb[76556]: DBERROR: opening /var/imap/quotas.db: cyrusdb error
Dec  3 13:08:33 limbo ctl_cyrusdb[76556]: cyrusdb_lmdb: closing stray database /var/imap/mailboxes.db

This is happening for some time to me now. I checked databases, nagged cyrus people on IRC but generally without any results. The whole problem is somewhere around LMDB usage. During night when backups start the database gets locked and is not released. All processes spawned by master died. The only ones staying alive are 'master' and 'idled'. Starting processes gives file not found error. The file is indeed in place, it's just not detected or not properly locked...

Also when the processes are holding there I can't restart 'master' to make everything work, I need to also kill all stray processes.

Currently I have only LMDB to blame so I reverted to previous version. Moving quota from LMDB also works, but that's kindda funny cause whole quota db is like 8k only.

I'll try to update if I find something else.
Comment 1 Volodymyr Kostyrko 2019-12-09 15:57:19 UTC
Some updates:

Rolling back one or two changes back doesn't fix anything, it even made it worse:

Dec  9 12:04:08 limbo kernel: Failed to fully fault in a core file segment at VA 0x802947000 with size 0x6000 to be written at offset 0xf8e000 for process lmtpd
Dec  9 12:04:08 limbo kernel: Failed to fully fault in a core file segment at VA 0x802a00000 with size 0x20000000 to be written at offset 0x102c000 for process lmtpd
Dec  9 12:04:08 limbo kernel: Failed to fully fault in a core file segment at VA 0x822e00000 with size 0x20000000 to be written at offset 0x213f6000 for process lmtpd
Dec  9 12:04:09 limbo kernel: Failed to fully fault in a core file segment at VA 0x843800000 with size 0x20000000 to be written at offset 0x41d38000 for process lmtpd
Dec  9 12:04:09 limbo kernel: Failed to fully fault in a core file segment at VA 0x863c00000 with size 0x20000000 to be written at offset 0x6207a000 for process lmtpd
Dec  9 12:04:09 limbo kernel: Failed to fully fault in a core file segment at VA 0x883c00000 with size 0x308000 to be written at offset 0x8207a000 for process lmtpd
Dec  9 12:04:09 limbo kernel: Failed to fully fault in a core file segment at VA 0x884200000 with size 0x20000000 to be written at offset 0x82382000 for process lmtpd
Dec  9 12:04:09 limbo kernel: Failed to fully fault in a core file segment at VA 0x8a4200000 with size 0x94a4000 to be written at offset 0xa2382000 for process lmtpd
Dec  9 12:04:09 limbo kernel: pid 81192 (lmtpd), jid 0, uid 60: exited on signal 6 (core dumped)

It looks like lmdb databases were damaged, and instead of giving errors when trying to opening them services were crashing. Probably just one issue with services could cause initial database issue, and after that services were always failing when trying to work with that file.

I'll continue comparing lmdb with nonlmdb setups to make sure that was an issue just in case someone else will report the same issue.
Comment 2 Volodymyr Kostyrko 2019-12-12 16:21:40 UTC
Okay, I managed to separate database corruption from lmdb issues. The problem is the latest update, probably the part about robust mutexes. Everything works flawlessly on verified databases until that revision, but when I update to it I start getting db corruption:

Dec 12 15:30:10 limbo imaps[2968]: cyrusdb_lmdb(/var/imap/user/a/arcade.conversations): Invalid argument

Soon database becomes severely corrupted and cyrus starts crashing.
Comment 3 Volodymyr Kostyrko 2019-12-19 20:26:58 UTC
I guess this can be closed as cyrus actually dropped lmdb support upstream.