Bug 178388 - [zfs] [patch] allow up to 8MB recordsize
Summary: [zfs] [patch] allow up to 8MB recordsize
Status: In Progress
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 9.1-STABLE
Hardware: Any Any
: Normal Affects Only Me
Assignee: freebsd-fs mailing list
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-07 10:40 UTC by nowak
Modified: 2017-08-29 14:08 UTC (History)
1 user (show)

See Also:


Attachments
file.diff (19.50 KB, patch)
2013-05-07 10:40 UTC, nowak
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description nowak 2013-05-07 10:40:00 UTC
Currently zfs recordsize is limited to 128k. This is very small for large raidz arrays.

Attached patch increases recordsize limit to 8M while keeping default recordsize at 128k for compatibility with other systems.

After applying the patch zfs datasets will remain compatible with non-patched systems as long as recordsize is not increased (with zfs set recordsize=x) over 128k. Change only affects file data, dataset and pool metadata always remains compatible. Going back is also possible by setting recordsize to 128k or below and deleting all files created with increased recordsize or destroying entire dataset. Recreating the pool is not necessary.

Possible issues:
1) booting is only supported up to 128k recordsize (boot stages and loader), zfs pool however can be shared with increased recordsize datasets as long as / or /boot remains at or below 128k recordsize,
2) accessing increased recordsize datasets on unpatched systems will likely cause kernel panics, should probably introduce feature flag for this to prevent pool import.

Fix: Patch attached with submission follows:
Comment 1 Matthew Rezny 2013-05-07 15:11:50 UTC
The proposed patch is rather ugly. Is there some reason to not simply
change the definition of SPA_MAXBLOCKSIZE?

The point of defining a constant is it can then be changed in the place
it's defined rather than in every place it's used. Having to go change
every reference to it is error prone as missing a single reference could
wreck havoc.

Specifically, I call into question the effect this has on the
definition of SPA_BLOCKSIZES. The reference to SPA_MAXBLOCKSIZE was not
replaced by SPA_BIGBLOCKSIZE and thus SPA_BLOCKSIZES is insufficiently
sized to represent all the possible block sizes that could be used.

That one jumped out at me when I skimmed over the patch. I have not
reviewed all the ZFS code to look for other unchanged references that
are not part of the patch context.
Comment 2 nowak 2013-05-07 17:46:05 UTC
On 2013-05-07 16:11, Matthew Rezny wrote:
> The proposed patch is rather ugly. Is there some reason to not simply
> change the definition of SPA_MAXBLOCKSIZE?

Yes. Altering the value of SPA_MAXBLOCKSIZE will change the sizes of 
certain metadata objects that will break compatibility with non-patched 
systems. Just importing the pool on system with modified 
SPA_MAXBLOCKSIZE would result in this pool being inaccessible in 
non-patched systems - forever. It will also prevent booting from zfs 
pools as there is not enough memory available in the bootloader to 
support large block sizes for metadata or the loader files.

> The point of defining a constant is it can then be changed in the place
> it's defined rather than in every place it's used. Having to go change
> every reference to it is error prone as missing a single reference could
> wreck havoc.

SPA_MAXBLOCKSIZE is used for far more than just a limit - in many places 
it is used as a default block size. I'm introducing SPA_BIGBLOCKSIZE 
because of the above compatibility problems and using it only in places 
that are essential to supporting large block sizes for file or zvol data 
leaving default block sizes unmodified (especially for pool metadata). 
The changed block size is only in effect when recordsize dataset 
property is modified by explicit action of the administrator. Existing 
and new datasets created post patch default to backwards compatible 128k 
block size.

SPA_BIGBLOCKSIZE is used for asserts on the size of read/written block, 
ARC cache, recordsize property bounds checks and block size calculation 
logic.

The names of the constants could probably be changed:
current SPA_MAXBLOCKSIZE to SPA_DEFAULTBLOCKSIZE
and the new SPA_BIGBLOCKSIZE to SPA_MAXBLOCKSIZE.

> Specifically, I call into question the effect this has on the
> definition of SPA_BLOCKSIZES. The reference to SPA_MAXBLOCKSIZE was not
> replaced by SPA_BIGBLOCKSIZE and thus SPA_BLOCKSIZES is insufficiently
> sized to represent all the possible block sizes that could be used.

The SPA_BLOCKSIZES define is never used in the code and should probably 
be removed.

> That one jumped out at me when I skimmed over the patch. I have not
> reviewed all the ZFS code to look for other unchanged references that
> are not part of the patch context.

Keep in mind that I have been using this for two months now on 3 
systems, 5 zpools and a total of over 50TB data written post-patch with 
varying record sizes (128k, 1MB, 4MB, 8MB). All systems boot directly 
from the big pools using unmodified (128k limited) bootloader.
Comment 3 Matthew Rezny 2013-05-08 12:09:44 UTC
On Tue, 07 May 2013 18:46:05 +0200
Adam Nowacki <nowak@tepeserwery.pl> wrote:

> On 2013-05-07 16:11, Matthew Rezny wrote:
> > The proposed patch is rather ugly. Is there some reason to not
> > simply change the definition of SPA_MAXBLOCKSIZE?
> 
> Yes. Altering the value of SPA_MAXBLOCKSIZE will change the sizes of 
> certain metadata objects that will break compatibility with
> non-patched systems. Just importing the pool on system with modified 
> SPA_MAXBLOCKSIZE would result in this pool being inaccessible in 
> non-patched systems - forever. It will also prevent booting from zfs 
> pools as there is not enough memory available in the bootloader to 
> support large block sizes for metadata or the loader files.
> 
That is understandable and something I had thought about but not
verified if it were the case.

> > The point of defining a constant is it can then be changed in the
> > place it's defined rather than in every place it's used. Having to
> > go change every reference to it is error prone as missing a single
> > reference could wreck havoc.
> 
> SPA_MAXBLOCKSIZE is used for far more than just a limit - in many
> places it is used as a default block size. I'm introducing
> SPA_BIGBLOCKSIZE because of the above compatibility problems and
> using it only in places that are essential to supporting large block
> sizes for file or zvol data leaving default block sizes unmodified
> (especially for pool metadata). The changed block size is only in
> effect when recordsize dataset property is modified by explicit
> action of the administrator. Existing and new datasets created post
> patch default to backwards compatible 128k block size.
> 
> SPA_BIGBLOCKSIZE is used for asserts on the size of read/written
> block, ARC cache, recordsize property bounds checks and block size
> calculation logic.
> 
> The names of the constants could probably be changed:
> current SPA_MAXBLOCKSIZE to SPA_DEFAULTBLOCKSIZE
> and the new SPA_BIGBLOCKSIZE to SPA_MAXBLOCKSIZE.
>
Changing the value of SPA_MAXBLOCKSIZE while defining
SPA_DEFAULTBLOCKSIZE (or SPA_MAXCOMPATBLOCKSIZE as I almost
suggested) with the prior value would be clearer in terms of both naming
and patch readability. I will venture so say that the number of
references to the SPA_DEFAULTBLOCKSIZE will be fewer than references to
SPA_MAXBLOCKSIZE.

> > Specifically, I call into question the effect this has on the
> > definition of SPA_BLOCKSIZES. The reference to SPA_MAXBLOCKSIZE was
> > not replaced by SPA_BIGBLOCKSIZE and thus SPA_BLOCKSIZES is
> > insufficiently sized to represent all the possible block sizes that
> > could be used.
> 
> The SPA_BLOCKSIZES define is never used in the code and should
> probably be removed.
> 
In that case, please kill it now.

> > That one jumped out at me when I skimmed over the patch. I have not
> > reviewed all the ZFS code to look for other unchanged references
> > that are not part of the patch context.
> 
> Keep in mind that I have been using this for two months now on 3 
> systems, 5 zpools and a total of over 50TB data written post-patch
> with varying record sizes (128k, 1MB, 4MB, 8MB). All systems boot
> directly from the big pools using unmodified (128k limited)
> bootloader.
Thank you for your work in this area and the extensive testing you have
done. Do you have any performance data from these tests that you can
share? Do you have some reason for not going beyond 8MB record size
(diminishing returns, etc)?

I have put some thought to additional compression algorithms in ZFS,
e.g. LZMA, which would require large record size to see significant
gains over existing gzip support. High strength compression on large
data chunks would be slow for data frequently written, but for an
archival filesystem where data is written once and then read
periodically it would be quite useful.
Comment 4 Mark Linimon freebsd_committer 2013-05-08 22:34:09 UTC
Responsible Changed
From-To: freebsd-bugs->freebsd-fs

Over to maintainer(s).
Comment 5 Steven Hartland freebsd_committer 2013-05-08 23:12:04 UTC
Seems interesting but it's really something that needs to be
reviewed and submitted upstream (illumos).
Comment 6 Fabian Keil 2017-08-29 14:08:07 UTC
Looks like this can be closed.

On pools with feature@large_blocks enabled, OpenZFS supports block sizes
up to 16MB "now" (if one sets vfs.zfs.max_recordsize accordingly and ignores the
Dragon warning in the  sysctl description).