248804 – sysutils/zfstools: Doesn't clean up old snapshots that were in use

Bug 248804 - sysutils/zfstools: Doesn't clean up old snapshots that were in use

Summary: sysutils/zfstools: Doesn't clean up old snapshots that were in use

Status:	New

Alias:	None

Product:	Ports & Packages
Classification:	Unclassified
Component:	Individual Port(s) (show other bugs)
Version:	Latest
Hardware:	Any Any

Importance:	--- Affects Many People
Assignee:	Bryan Drewery

URL:
Keywords:

Depends on:
Blocks:

Reported:	2020-08-21 08:35 UTC by Danny McGrath
Modified:	2021-03-29 19:23 UTC (History)
CC List:	0 users

See Also:

Flags:	bugzilla: maintainer-feedback? (bdrewery)

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Danny McGrath 2020-08-21 08:35:31 UTC

Hi,

This is a bit of an old bug that I've been dealing with for quite some time, and I am not 100% sure on the cause, but have some theories. I am using zfstools 0.3.6_1.

This bug is that old snapshots that were locked from delete (in use) during backups, I presume, prevent the script from cleanup up the snapshot. The problem is that the system never goes back to all of the past ones that it missed to clean them up. The result is that you are left with a bunch of snapshots laying around from years ago, in some cases, for something like an hourly snapshot.

eg: data/ezjail/my.jail.name.org@zfs-auto-snap_hourly-2019-08-02-08h00

IMHO, this tool should be considered to "own" the entire namespace of "zfs-auto-snap_$timestamp" etc, and thus no harm in going over the entire history to at least clean up older stuff that sticks around an extra moment.

Any ideas if this is a bug or something that can be considered implementing? I know it wouldn't be the first time that our systems ran low on space due to this issue.

Thanks!

Comment 1 Danny McGrath 2020-09-15 08:34:47 UTC

Anyone there?

Comment 2 Danny McGrath 2021-03-29 17:19:06 UTC

Still bottoming out our storage because of this bug. Is there any progress on it yet?

Thanks

Comment 3 Bryan Drewery freebsd_committer

2021-03-29 17:58:34 UTC

This is normal operation.
- It does own the entire namespace.
- There's no per-snapshot metadata in terms of zfstools. It doesn't mark one as having been checked or ignored or anything like that.
- It doesn't keep a time period of snapshots per type, it keeps a number of snapshots per type. The frequency is defined by how often that type is snapshotted from crontab.

I think what you're asking for is a feature to limit a type by date range. Imagine having a dataset that you have not modified in 3 years, and it has 10 snapshots on it. Should a hypothetical limit of 1 year modify this dataset's snapshots? Should it delete all of them, keep the last, or something else? If it deletes all then it just killed all snapshots which seems surprising to me. If we go look at that dataset and see no snapshots it would seem the tool is not working. Likewise if we kept an arbitrary number of them then it would appear to not be keeping up with its intended number-per-type limit.

I think this is something you'll need to implement manually in a separate crontab.

Comment 4 Danny McGrath 2021-03-29 19:23:04 UTC

Hi,

Thanks for the quick reply!

Indeed I can see a bit of the problem you mention in regards to keep track of the number of snapshots for an older, stagnant, system.

> It doesn't keep a time period of snapshots per type, it keeps a number of snapshots per type. The frequency is defined by how often that type is snapshotted from crontab.

By this definition it should be able to determine that there are snapshots that missed their prune. If I told it to keep the last 10 snapshots for a type, why would it leave anything past the last 10? Admittedly there could be some ambiguity with what I said in regards to whether or not that 10 should, or shouldn't include, snapshots past the calculated end date.

> I think what you're asking for is a feature to limit a type by date range.

Not quite. What I am asking for is for it to properly clean up the missed snapshots that were in use and not like terabytes of space tied up in the system all the time from weeks, months and years ago, that were missed from the garbage collection because they were temporarily help in use by the system, for whatever reason.

>Imagine having a dataset that you have not modified in 3 years, and it has 10 snapshots on it.

But if I had a dataset that was being actively snapshotted by zfstools, it would have, for example, the last 12 monthly snapshots that I told it to keep, which would be all identical due to not seeing any changes.

> Should a hypothetical limit of 1 year modify this dataset's snapshots? Should it delete all of them, keep the last, or something else?

It should do what it does when something doesn't get stuck: delete the tail end snapshots and create a new one, IMHO. Perhaps I would need to read over the manual extremely careful and pay attention to the exact wording here.

> If it deletes all then it just killed all snapshots which seems surprising to me.

Why would it delete all snapshots though? It should only delete: @zfs-auto-snap_$TYPE-$TIMESTAMP that are outside of the range of the deterministic list of valid snapshots, where valid is defined as the list of the Nth most recent, consecutive, snapshots. 

eg: I define a max of 7 daily snapshots on a dataset. My belief is that zfstools should calculate what the 7 snapshots should be. Say: March 10th..17th. From here it should consider anything outside of this range as cruft to be pruned. So, a snapshot from March 1st should be considered an anomaly, provided that it is in the format "@zfs-auto-snap_daily-$TIMESTAMP".

> I think this is something you'll need to implement manually in a separate crontab.

Is there a function or way to call something that maybe already exists that can do this cleanup? I was thinking about implementing something similar myself, but wanted to see if this was an actual bug, first, or if I was missing something.