Bug 254988 - make fetchindex should switch to use xz compressed format
Summary: make fetchindex should switch to use xz compressed format
Status: New
Alias: None
Product: Ports & Packages
Classification: Unclassified
Component: Package Infrastructure (show other bugs)
Version: Latest
Hardware: Any Any
: --- Affects Only Me
Assignee: Port Management Team
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-12 01:12 UTC by Xin LI
Modified: 2021-04-12 14:45 UTC (History)
2 users (show)

See Also:


Attachments
Proposed patch for ports/Makefile after server have switched (1.17 KB, patch)
2021-04-12 01:12 UTC, Xin LI
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Xin LI freebsd_committer 2021-04-12 01:12:21 UTC
Created attachment 224038 [details]
Proposed patch for ports/Makefile after server have switched

Currently, 'make fetchindex' would fetch from https://www.FreeBSD.org/ports/INDEX-${OSREL}.bz2.

xz compressed file would be more than 20% smaller for today's INDEX-14.bz2 (1840916 vs 2331970).  When using xz -9e, the output would be even smaller (1603536 vs 2331970).

Decompress of bzip2 is also significantly slower:

x bzcat
+ xzdec
+------------------------------------------------------------+
|+                                                           |
|+                                                           |
|+                                                 x  xx  x x|
|A                                                  |__MA__| |
+------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5          2.29          2.59          2.42         2.444    0.11545562
+   3          0.58          0.59          0.58    0.58333333  0.0057735027
Difference at 95.0% confidence
	-1.86067 +/- 0.168568
	-76.132% +/- 1.66315%
	(Student's t, pooled s = 0.094328)

Note that this would increase memory footprint for INDEX generation, but I think it's worth the cost based on the data above.
Comment 1 Adam Weinberger freebsd_committer 2021-04-12 02:24:17 UTC
This seems like a really simple and effective change. Memory footprint doesn't strike me as so big of an issue these days. Resource-constrained systems aren't as likely to be building their own ports on that system, and INDEX has no particular role on pkg-only systems.

We seem to be switching more and more things to zstd these days. Would there be any benefit to choosing zstd --ultra -22 over xz -9?
Comment 2 Xin LI freebsd_committer 2021-04-12 03:23:09 UTC
(In reply to Adam Weinberger from comment #1)
That's a good question.

I have done some quick experiments comparing xz -9e vs zstd -ultra -22:

xz -9e wins on compression side: it was faster and generates smaller output (1603536 vs 1641687, or 2.3% smaller).

x xz-compress
+ zstd-compress
+------------------------------------------------------------+
| x                                                       +  |
| x                                                       +  |
|xx x                                                     + +|
||A|                                                     |A| |
+------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5         12.63          13.4         12.77         12.88    0.30116441
+   4         28.29         28.96         28.37       28.4975    0.31063108
Difference at 95.0% confidence
	15.6175 +/- 0.484288
	121.254% +/- 6.69091%
	(Student's t, pooled s = 0.305258)

Decompression side, zstd is significantly faster:

x xz-decompress
+ zstd-decompress
+------------------------------------------------------------+
| +                                                         x|
| +                                                         x|
| +                                                         x|
| +    +                                                    x|
||MA_|                                                      A|
+------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5          0.18          0.18          0.18          0.18             0
+   5          0.05          0.06          0.05         0.052   0.004472136
Difference at 95.0% confidence
	-0.128 +/- 0.004612
	-71.1111% +/- 2.56222%
	(Student's t, pooled s = 0.00316228)

Note that because the file is small, the difference in decompression here is not really that big in real life.

I think both would be fine solution compared to bzip2; personally, I'm inclined to use xz instead of zstd because:

1) xz is bundled with base system 10 years ago, and we can expect it to be there while zstd is only available for 12.0 and greater.

We could do, e.g. publish xz for 11.x and zstd for newer, or have fetchindex install zstd from port, however.

2) xz would be slightly faster in compression and generate smaller file.  For decompression, even though zstd would be more than 3x faster, the INDEX files are generally small and the difference would be small in real world (about 0.2 second vs 0.05 second).

BTW.  It appears that -T helped neither xz nor zstd in a meaningful way, possibly because the INDEX-14 file were too small.  I haven't looked into the details, however.
Comment 3 Adam Weinberger freebsd_committer 2021-04-12 14:40:12 UTC
Makes sense. Thanks for the breakdown!

Do we have a feel for what tooling would need to be updated for this?
Comment 4 Baptiste Daroussin freebsd_committer 2021-04-12 14:45:39 UTC
I think we should keep the INDEX.bz2 some tools rely on and introduce a new INDEX.xz

with a KNOB to be able to list the architecture we want to produce

that way we will able to deal with the cluster side of the thing more easily and deal with the transition