Bug 258623 - cxgbe(4): Slow routing performance: 2 numa domains vs single numa domain
Summary: cxgbe(4): Slow routing performance: 2 numa domains vs single numa domain
Status: Open
Alias: None
Product: Base System
Classification: Unclassified
Component: kern (show other bugs)
Version: 13.0-STABLE
Hardware: amd64 Any
: --- Affects Only Me
Assignee: freebsd-net (Nobody)
URL:
Keywords: needs-qa, performance
Depends on:
Blocks:
 
Reported: 2021-09-20 10:24 UTC by Konrad
Modified: 2021-09-27 11:52 UTC (History)
4 users (show)

See Also:
koobs: maintainer-feedback? (np)
koobs: maintainer-feedback? (jhb)
koobs: mfc-stable13?


Attachments
lagg0_16Mpps.svg (73.62 KB, image/svg+xml)
2021-09-20 10:46 UTC, Kubilay Kocak
no flags Details
single_16Mpps.svg (59.94 KB, image/svg+xml)
2021-09-20 10:46 UTC, Kubilay Kocak
no flags Details
dmesg.boot (16.85 KB, text/plain)
2021-09-20 11:22 UTC, Konrad
no flags Details
/boot/loader.conf (838 bytes, text/plain)
2021-09-20 11:23 UTC, Konrad
no flags Details
customer kernel configuration (14.66 KB, text/plain)
2021-09-20 11:24 UTC, Konrad
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Konrad 2021-09-20 10:24:43 UTC
Server: Dell R630, 2x CPU E5-2667 v4 - 2 numa domains, 64GB Ram
NIC: 2x T62100-SO-CR - each connected to a separate numa domain


* 2 numa domain test

I use chelsio_affinity to assign irq to correct CPU

cfg:

ifconfig_cc0="up"
ifconfig_cc1="up"
ifconfig_cc2="up"
ifconfig_cc3="up"


#LAGG LACP
ifconfig_lagg0="laggproto lacp laggport cc0 laggport cc2 -wol -vlanhwtso -tso -lro -hwrxtstmp -txtls use_flowid use_numa up"

ifconfig_vlan2020="vlan 2020 vlandev lagg0"
ifconfig_vlan2002="vlan 2002 vlandev lagg0"


+--------+         +--------+      +---------+
|        +---------+        +------+         |
| Router |  lagg0  | switch |      |  gen    |
|        +---------+        +------+         |
+--------+         +--------+      +---------+


I can achieve around 14Mpps without drop. Above this level, drops appear on the ccX/lagg0 interfaces. It looks like a CPU some free resources:

# netstat -i -I lagg0 1
            input          lagg0           output
   packets  errs idrops      bytes    packets  errs      bytes colls
  15939431     0 555822 2246265134   15381955     0 2167675870     0
  16600413     0 612946 2339414686   15978803     0 2253137798     0
  15259699     0 575481 2150765886   14693013     0 2070319352     0
  15935269     0 512558 2245569909   15382551     0 2167518240     0
  16159627     0 616404 2277463695   15563046     0 2195364136     0
  14841125     0 322695 1605926868   14540305     0 1562096456     0

# top -PSH
last pid:  9745;  load averages:  6.46,  2.02,  0.76                                                                                                                                                                                                                                                                                               up 0+00:02:06  20:25:17
580 threads:   25 running, 471 sleeping, 84 waiting
CPU 0:   0.0% user,  0.0% nice,  0.0% system, 59.2% interrupt, 40.8% idle
CPU 1:   0.0% user,  0.0% nice,  0.0% system, 57.7% interrupt, 42.3% idle
CPU 2:   0.0% user,  0.0% nice,  0.0% system, 57.7% interrupt, 42.3% idle
CPU 3:   0.0% user,  0.0% nice,  0.0% system, 60.6% interrupt, 39.4% idle
CPU 4:   0.0% user,  0.0% nice,  0.0% system, 56.3% interrupt, 43.7% idle
CPU 5:   0.0% user,  0.0% nice,  0.0% system, 62.0% interrupt, 38.0% idle
CPU 6:   0.0% user,  0.0% nice,  0.0% system, 59.2% interrupt, 40.8% idle
CPU 7:   0.0% user,  0.0% nice,  0.0% system, 53.5% interrupt, 46.5% idle
CPU 8:   0.0% user,  0.0% nice,  1.4% system, 62.0% interrupt, 36.6% idle
CPU 9:   0.0% user,  0.0% nice,  0.0% system, 67.6% interrupt, 32.4% idle
CPU 10:  0.0% user,  0.0% nice,  0.0% system, 69.0% interrupt, 31.0% idle
CPU 11:  0.0% user,  0.0% nice,  0.0% system, 66.2% interrupt, 33.8% idle
CPU 12:  0.0% user,  0.0% nice,  0.0% system, 63.4% interrupt, 36.6% idle
CPU 13:  0.0% user,  0.0% nice,  0.0% system, 62.0% interrupt, 38.0% idle
CPU 14:  0.0% user,  0.0% nice,  0.0% system, 63.4% interrupt, 36.6% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system, 63.4% interrupt, 36.6% idle
Mem: 536M Active, 29M Inact, 1528M Wired, 60G Free
ARC: 114M Total, 22M MFU, 88M MRU, 693K Header, 3231K Other
     30M Compressed, 92M Uncompressed, 3.12:1 Ratio
Swap: 32G Total, 32G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   12 root        -92    -     0B  1472K CPU8     8   0:37  60.50% intr{irq152: t6nex1:0a0}
   12 root        -92    -     0B  1472K CPU10   10   0:36  60.42% intr{irq154: t6nex1:0a2}
   12 root        -92    -     0B  1472K CPU11   11   0:36  60.27% intr{irq155: t6nex1:0a3}
   12 root        -92    -     0B  1472K CPU14   14   0:36  60.26% intr{irq158: t6nex1:0a6}
   12 root        -92    -     0B  1472K CPU12   12   0:36  60.24% intr{irq156: t6nex1:0a4}
   12 root        -92    -     0B  1472K CPU9     9   0:36  60.15% intr{irq153: t6nex1:0a1}
   12 root        -92    -     0B  1472K CPU13   13   0:36  59.88% intr{irq157: t6nex1:0a5}
   12 root        -92    -     0B  1472K CPU15   15   0:36  59.41% intr{irq159: t6nex1:0a7}
   12 root        -92    -     0B  1472K WAIT     0   0:37  58.49% intr{irq98: t6nex0:0a0}
   12 root        -92    -     0B  1472K WAIT     1   0:37  57.89% intr{irq99: t6nex0:0a1}
   12 root        -92    -     0B  1472K WAIT     4   0:37  57.39% intr{irq102: t6nex0:0a4}
   12 root        -92    -     0B  1472K WAIT     5   0:36  57.35% intr{irq103: t6nex0:0a5}
   12 root        -92    -     0B  1472K WAIT     3   0:36  57.32% intr{irq101: t6nex0:0a3}
   12 root        -92    -     0B  1472K WAIT     6   0:36  57.12% intr{irq104: t6nex0:0a6}
   12 root        -92    -     0B  1472K WAIT     2   0:36  56.98% intr{irq100: t6nex0:0a2}
   12 root        -92    -     0B  1472K WAIT     7   0:36  56.85% intr{irq105: t6nex0:0a7}


# pcm-numa.x

Time elapsed: 1064 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses
   0   1.31       4195 M     3203 M      3382 K                48 K
   1   1.32       4211 M     3199 M      3241 K                27 K
   2   1.33       4238 M     3196 M      3146 K                48 K
   3   1.33       4238 M     3197 M      3143 K                26 K
   4   1.32       4228 M     3197 M      3241 K                47 K
   5   1.33       4243 M     3198 M      3046 K                29 K
   6   1.33       4247 M     3195 M      3169 K                47 K
   7   1.33       4264 M     3196 M      3180 K                20 K
   8   1.29       4159 M     3224 M      2948 K                77 K
   9   1.29       4172 M     3224 M      2865 K                92 K
  10   1.29       4199 M     3247 M      3263 K                76 K
  11   1.30       4237 M     3259 M      2892 K                91 K
  12   1.30       4261 M     3274 M      3069 K                73 K
  13   1.30       4231 M     3246 M      2959 K               104 K
  14   1.30       4291 M     3291 M      3353 K                74 K
  15   1.31       4221 M     3227 M      3008 K                85 K


pmcstat-S cpu_clk_unhalted.thread flamegraph - https://files.fm/u/enhy23ffr

--------------------

* single domain test

In this scenario I create vlans on single cc0 (use one numa domian)

ifconfig_vlan2020="vlan 2020 vlandev cc0"
ifconfig_vlan2002="vlan 2002 vlandev cc0"



+--------+         +--------+      +---------+
|        +---------+        +------+         |
| Router |   cc0   | switch |      |  gen    |
|        |         |        +------+         |
+--------+         +--------+      +---------+


Using cc0 I can achieve 16Mpps without drops:

# netstat -i -I cc0 1
            input            cc0           output
   packets  errs idrops      bytes    packets  errs      bytes colls
  15934346     0     0 2245565269   15933728     0 2245477291     0
  15927621     0     0 2244617740   15928235     0 2244704202     0
  15934688     0     0 2245613662   15934213     0 2245546449     0
  15931155     0     0 2245115588   15931208     0 2245120654     0
  15926995     0     0 2244529583   15927391     0 2244585093     0
  15931114     0     0 2245109534   15931145     0 2245115823     0

# top -PSH
last pid:  9976;  load averages:  6.57,  2.51,  1.00                                                                                                                                                                                                                                                                                               up 0+00:03:23  20:16:17
579 threads:   25 running, 470 sleeping, 84 waiting
CPU 0:   0.0% user,  0.0% nice,  0.0% system, 95.4% interrupt,  4.6% idle
CPU 1:   0.0% user,  0.0% nice,  0.0% system, 95.4% interrupt,  4.6% idle
CPU 2:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 3:   0.0% user,  0.0% nice,  0.0% system, 93.9% interrupt,  6.1% idle
CPU 4:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 5:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 6:   0.0% user,  0.0% nice,  0.0% system, 94.7% interrupt,  5.3% idle
CPU 7:   0.0% user,  0.0% nice,  0.0% system, 93.1% interrupt,  6.9% idle
CPU 8:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 9:   0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 10:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 11:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 12:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 13:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 14:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 537M Active, 30M Inact, 1260M Wired, 60G Free
ARC: 115M Total, 22M MFU, 89M MRU, 695K Header, 3260K Other
     30M Compressed, 93M Uncompressed, 3.10:1 Ratio
Swap: 32G Total, 32G Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   12 root        -92    -     0B  1472K CPU3     3   1:50  94.86% intr{irq101: t6nex0:0a3}
   12 root        -92    -     0B  1472K CPU1     1   1:49  94.68% intr{irq99: t6nex0:0a1}
   12 root        -92    -     0B  1472K CPU5     5   1:49  94.40% intr{irq103: t6nex0:0a5}
   12 root        -92    -     0B  1472K CPU7     7   1:49  94.18% intr{irq105: t6nex0:0a7}
   12 root        -92    -     0B  1472K CPU0     0   1:49  94.13% intr{irq98: t6nex0:0a0}
   12 root        -92    -     0B  1472K CPU6     6   1:49  94.11% intr{irq104: t6nex0:0a6}
   12 root        -92    -     0B  1472K CPU4     4   1:49  93.81% intr{irq102: t6nex0:0a4}
   12 root        -92    -     0B  1472K CPU2     2   1:48  93.56% intr{irq100: t6nex0:0a2}


# pcm-numa.x

Time elapsed: 1002 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses
   0   1.93       6513 M     3374 M      4179 K                34 K
   1   1.93       6516 M     3374 M      4153 K              3655
   2   1.94       6518 M     3352 M      4122 K                33 K
   3   1.94       6516 M     3367 M      4118 K              8574
   4   1.94       6517 M     3361 M      4142 K                37 K
   5   1.93       6516 M     3376 M      4147 K                10 K
   6   1.93       6515 M     3371 M      4154 K                39 K
   7   1.94       6514 M     3360 M      4173 K                12 K
   8   0.24       1833 K     7596 K      1805                1378
   9   0.20        728 K     3726 K       467                 502
  10   0.11        312 K     2779 K       227                 234
  11   0.14        486 K     3407 K       291                 361
  12   0.12        357 K     2956 K       183                 132
  13   0.07        195 K     2664 K        46                 119
  14   0.13        381 K     3047 K       455                 212
  15   0.23        765 K     3310 K       325                 346
---------------------------------------------------------------------


pmcstat-S cpu_clk_unhalted.thread flamegraph - https://files.fm/u/3njfz2r3g


* Summary

I know, lagg makes a certain amount of overhead but based on my testing a single card performs better than two cards in lagg0 .
Comment 1 Kubilay Kocak freebsd_committer freebsd_triage 2021-09-20 10:44:16 UTC
Thank you for the report Konrad. Could you please include the following additional information:

  - uname -a output
  - customer kernel configuration if not GENERIC (as an attachment)
  - /var/run/dmesg.boot output (as an attachment)
  - /etc/sysctl.conf and /boot/loader.conf configuration (as an attachment, if not empty)
Comment 2 Kubilay Kocak freebsd_committer freebsd_triage 2021-09-20 10:46:00 UTC
Created attachment 228044 [details]
lagg0_16Mpps.svg

Attach flamegraph from comment 0
Comment 3 Kubilay Kocak freebsd_committer freebsd_triage 2021-09-20 10:46:35 UTC
Created attachment 228045 [details]
single_16Mpps.svg

Attach flamegraph (#2) from comment 0
Comment 4 Konrad 2021-09-20 11:22:36 UTC
Created attachment 228046 [details]
dmesg.boot
Comment 5 Konrad 2021-09-20 11:23:09 UTC
Created attachment 228047 [details]
/boot/loader.conf
Comment 6 Konrad 2021-09-20 11:24:29 UTC
Created attachment 228048 [details]
customer kernel configuration

# uname -a
FreeBSD Thunder 13.0-STABLE FreeBSD 13.0-STABLE #5 stable/13-6d8f2277d-dirty: Wed Sep 15 12:09:27 CEST 2021     root@Thunder:/usr/obj/usr/src/amd64.amd64/sys/ROUTER  amd64
Comment 7 Konrad 2021-09-20 12:09:52 UTC
I think is not related with cxgbe directly, I have achieved similar results on mlx5en(4)
Comment 8 Kubilay Kocak freebsd_committer freebsd_triage 2021-09-22 23:21:32 UTC
(In reply to Konrad from comment #7)

Are you able to boot a 14-CURRENT snapshot to attempt reproduction on that version?
Comment 9 Konrad 2021-09-27 11:52:31 UTC
Currently I am not able to boot 14-CURRENT, but I did it in June. The results were similar