We are taking a short break between November 18 and 24. Orders made during this period will be delayed in shipping.

pfSense 2.4.4 throughput benchmark for APU2, bios v4.9.0.2

In 2017, we have published a throughput test for pfSense 2.3.3 on APU2C0 that showed a maximum throughput of about 620Mbit/s.

Things have changed. pfSense released several OS updates, we are now on 2.4.4, and PC Engines released several BIOS updates for APU. The latest BIOS v4.9.0.2 is supposed to enable CPU boost to 1.4Ghz.

Let's see if we get better performance today (February 2019).

All tests were performed on APU2C2, but the same results will be achieved using any other APU2, APU3 or APU4 router.

Network topolog for the test

Router specification:

  • APU2C2 (2GB RAM)
  • pfSense 2.4.4
  • powerd enabled, and disabled (doesn't seem to make a difference)
  • router configured to use multiple NIC queues, and route on multiple CPU cores, as described here.

 

 

Test 1: one TCP connection

The first test is using a single TCP connection (-P 1). This forces all traffic to be routed and processed by a single CPU core.

Old bios v4.0.23

root@homedesktop:/home/sniku# iperf3 -c 192.168.5.50 -P 1
Connecting to host 192.168.5.50, port 5201
[  5] local 192.168.1.111 port 50932 connected to 192.168.5.50 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  94.1 MBytes   790 Mbits/sec    0   3.07 MBytes       
[  5]   1.00-2.00   sec  90.0 MBytes   755 Mbits/sec    0   3.07 MBytes       
[  5]   2.00-3.00   sec  91.2 MBytes   765 Mbits/sec    0   3.07 MBytes       
[  5]   3.00-4.00   sec  90.0 MBytes   755 Mbits/sec    0   3.07 MBytes       
[  5]   4.00-5.00   sec  90.0 MBytes   754 Mbits/sec    0   3.07 MBytes       
[  5]   5.00-6.00   sec  88.8 MBytes   746 Mbits/sec    0   3.07 MBytes       
[  5]   6.00-7.00   sec  90.0 MBytes   755 Mbits/sec    0   3.07 MBytes       
[  5]   7.00-8.00   sec  88.8 MBytes   744 Mbits/sec    0   3.07 MBytes       
[  5]   8.00-9.00   sec  88.8 MBytes   744 Mbits/sec    0   3.07 MBytes       
[  5]   9.00-10.00  sec  88.8 MBytes   744 Mbits/sec    0   3.07 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   900 MBytes   755 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   899 MBytes   754 Mbits/sec                  receiver

iperf Done.

I have executed 10 runs, all results were very similar. 771, 755, 763, 756, 763, 755, 761, 758, 767.

This is a much better result in comparison to the test made in 2017, where the throughput was limited to about 620Mbit/s.

As can be seen above, the throughput is very consistent during the test as well. It varies between 755Mbit and 771Mbit.

Throughput is very consistent.

New bios v4.9.0.2

root@homedesktop:/home/sniku# iperf3 -c 192.168.5.50 -P 1
Connecting to host 192.168.5.50, port 5201
[  5] local 192.168.1.111 port 51070 connected to 192.168.5.50 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   106 MBytes   890 Mbits/sec    0   2.99 MBytes       
[  5]   1.00-2.00   sec   102 MBytes   860 Mbits/sec    0   2.99 MBytes       
[  5]   2.00-3.00   sec   104 MBytes   870 Mbits/sec    0   2.99 MBytes       
[  5]   3.00-4.00   sec   101 MBytes   849 Mbits/sec    0   2.99 MBytes       
[  5]   4.00-5.00   sec   102 MBytes   860 Mbits/sec    0   2.99 MBytes       
[  5]   5.00-6.00   sec  98.8 MBytes   828 Mbits/sec    0   2.99 MBytes       
[  5]   6.00-7.00   sec  96.2 MBytes   807 Mbits/sec    0   3.14 MBytes       
[  5]   7.00-8.00   sec  96.2 MBytes   807 Mbits/sec    0   3.14 MBytes       
[  5]   8.00-9.00   sec  97.5 MBytes   818 Mbits/sec    0   3.14 MBytes       
[  5]   9.00-10.00  sec   100 MBytes   839 Mbits/sec    0   3.14 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1005 MBytes   843 Mbits/sec    0             sender
[  5]   0.00-10.00  sec  1004 MBytes   843 Mbits/sec                  receiver

iperf Done.

I have executed 10 runs, results were 843,  797, 807, 799, 795, 785, 819, 816, 795, 835, 785, 774. 

As shown, throughput is higher, but there's a quite much variation. On my 10 tests it varies between 774Mbit and 843Mbit. It also varies during a single run, indicating that CPU boost may be turning on and off. Nevertheless, it's a nice improvement! 

Throught is higher, but less consistent.

 

 

Test 2: two TCP connections

This test is similar to the first one, but we are using 2 TCP connections (-P 2). This allows the traffic to be processed by 2 CPU cores.

Old BIOS v4.0.23

root@homedesktop:/home/sniku# iperf3 -c 192.168.5.50 -P 2 
Connecting to host 192.168.5.50, port 5201
[  5] local 192.168.1.111 port 50880 connected to 192.168.5.50 port 5201
[  7] local 192.168.1.111 port 50882 connected to 192.168.5.50 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  71.5 MBytes   600 Mbits/sec    2    407 KBytes       
[  7]   0.00-1.00   sec  43.3 MBytes   364 Mbits/sec    6    228 KBytes       
[SUM]   0.00-1.00   sec   115 MBytes   964 Mbits/sec    8             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   1.00-2.00   sec  67.5 MBytes   566 Mbits/sec    1    417 KBytes       
[  7]   1.00-2.00   sec  45.1 MBytes   378 Mbits/sec    0    311 KBytes       
[SUM]   1.00-2.00   sec   113 MBytes   944 Mbits/sec    1             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   2.00-3.00   sec  67.6 MBytes   567 Mbits/sec    0    419 KBytes       
[  7]   2.00-3.00   sec  44.7 MBytes   375 Mbits/sec    0    314 KBytes       
[SUM]   2.00-3.00   sec   112 MBytes   942 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   3.00-4.00   sec  67.4 MBytes   565 Mbits/sec    0    420 KBytes       
[  7]   3.00-4.00   sec  45.1 MBytes   378 Mbits/sec    0    314 KBytes       
[SUM]   3.00-4.00   sec   112 MBytes   944 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   4.00-5.00   sec  66.7 MBytes   559 Mbits/sec    0    423 KBytes       
[  7]   4.00-5.00   sec  44.4 MBytes   373 Mbits/sec    0    317 KBytes       
[SUM]   4.00-5.00   sec   111 MBytes   932 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   5.00-6.00   sec  67.7 MBytes   568 Mbits/sec    1    423 KBytes       
[  7]   5.00-6.00   sec  45.4 MBytes   381 Mbits/sec    0    318 KBytes       
[SUM]   5.00-6.00   sec   113 MBytes   949 Mbits/sec    1             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   6.00-7.00   sec  67.7 MBytes   568 Mbits/sec    0    423 KBytes       
[  7]   6.00-7.00   sec  44.3 MBytes   372 Mbits/sec    0    318 KBytes       
[SUM]   6.00-7.00   sec   112 MBytes   939 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   7.00-8.00   sec  67.0 MBytes   562 Mbits/sec    0    424 KBytes       
[  7]   7.00-8.00   sec  45.3 MBytes   380 Mbits/sec    0    318 KBytes       
[SUM]   7.00-8.00   sec   112 MBytes   942 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   8.00-9.00   sec  67.2 MBytes   564 Mbits/sec    0    424 KBytes       
[  7]   8.00-9.00   sec  44.7 MBytes   375 Mbits/sec    0    318 KBytes       
[SUM]   8.00-9.00   sec   112 MBytes   938 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   9.00-10.00  sec  56.7 MBytes   476 Mbits/sec    0    424 KBytes       
[  7]   9.00-10.00  sec  56.1 MBytes   470 Mbits/sec    0    406 KBytes       
[SUM]   9.00-10.00  sec   113 MBytes   946 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   667 MBytes   559 Mbits/sec    4             sender
[  5]   0.00-10.00  sec   665 MBytes   558 Mbits/sec                  receiver
[  7]   0.00-10.00  sec   458 MBytes   385 Mbits/sec    6             sender
[  7]   0.00-10.00  sec   457 MBytes   383 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  1.10 GBytes   944 Mbits/sec   10             sender
[SUM]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver

iperf Done.

I have executed 10 runs, results were: 942, 944, 944, 944, 764, 944, 944, 770, 769, 944.

Sometimes both connections get assigned to the same CPU core, so the processing is limited to 1 core. This results in a throughput of about ~760Mbit, just like in Test 1.

Throughput is 1Gbit, when using more than 1 connection, most of the time.

New bios v4.9.0.2

root@homedesktop:/home/sniku# iperf3 -c 192.168.5.50 -P 2 
Connecting to host 192.168.5.50, port 5201
[  5] local 192.168.1.112 port 56446 connected to 192.168.5.50 port 5201
[  7] local 192.168.1.112 port 56448 connected to 192.168.5.50 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  67.7 MBytes   568 Mbits/sec    0    434 KBytes       
[  7]   0.00-1.00   sec  47.4 MBytes   398 Mbits/sec    0    325 KBytes       
[SUM]   0.00-1.00   sec   115 MBytes   965 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   1.00-2.00   sec  67.7 MBytes   568 Mbits/sec    0    434 KBytes       
[  7]   1.00-2.00   sec  44.8 MBytes   376 Mbits/sec    0    325 KBytes       
[SUM]   1.00-2.00   sec   113 MBytes   944 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   2.00-3.00   sec  67.6 MBytes   567 Mbits/sec    0    434 KBytes       
[  7]   2.00-3.00   sec  44.7 MBytes   375 Mbits/sec    0    325 KBytes       
[SUM]   2.00-3.00   sec   112 MBytes   942 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   3.00-4.00   sec  66.6 MBytes   558 Mbits/sec    0    434 KBytes       
[  7]   3.00-4.00   sec  44.9 MBytes   377 Mbits/sec    0    325 KBytes       
[SUM]   3.00-4.00   sec   111 MBytes   935 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   4.00-5.00   sec  67.8 MBytes   569 Mbits/sec    0    434 KBytes       
[  7]   4.00-5.00   sec  44.8 MBytes   376 Mbits/sec    0    325 KBytes       
[SUM]   4.00-5.00   sec   113 MBytes   945 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   5.00-6.00   sec  66.6 MBytes   559 Mbits/sec    0    434 KBytes       
[  7]   5.00-6.00   sec  44.8 MBytes   376 Mbits/sec    0    325 KBytes       
[SUM]   5.00-6.00   sec   111 MBytes   935 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   6.00-7.00   sec  67.6 MBytes   567 Mbits/sec    0    434 KBytes       
[  7]   6.00-7.00   sec  44.9 MBytes   376 Mbits/sec    0    325 KBytes       
[SUM]   6.00-7.00   sec   112 MBytes   943 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   7.00-8.00   sec  67.5 MBytes   567 Mbits/sec    0    434 KBytes       
[  7]   7.00-8.00   sec  44.7 MBytes   375 Mbits/sec    0    325 KBytes       
[SUM]   7.00-8.00   sec   112 MBytes   942 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   8.00-9.00   sec  67.4 MBytes   566 Mbits/sec    0    434 KBytes       
[  7]   8.00-9.00   sec  45.1 MBytes   378 Mbits/sec    0    325 KBytes       
[SUM]   8.00-9.00   sec   113 MBytes   944 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[  5]   9.00-10.00  sec  66.9 MBytes   561 Mbits/sec    0    434 KBytes       
[  7]   9.00-10.00  sec  45.1 MBytes   378 Mbits/sec    0    325 KBytes       
[SUM]   9.00-10.00  sec   112 MBytes   940 Mbits/sec    0             
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   674 MBytes   565 Mbits/sec    0             sender
[  5]   0.00-10.00  sec   672 MBytes   564 Mbits/sec                  receiver
[  7]   0.00-10.00  sec   451 MBytes   379 Mbits/sec    0             sender
[  7]   0.00-10.00  sec   450 MBytes   377 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  1.10 GBytes   944 Mbits/sec    0             sender
[SUM]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver

iperf Done.

I have executed 10 runs, results were all at ~944Mbit/s.

In my previous tests I have experienced the same problem of assigning both connections to 1 core, like in the result to the left, but it did not happen this time. It looks like the new BIOS mitigates this problem to some degree, or perhaps it's just a coincidence.

Throughput is 1Gbit, when using more than 1 connection.

 

Test 3: fourty TCP connections

This test is similar to the first one, but we are using 40 TCP connections (-P 40). This allows the traffic to be evenly distributed between 4 CPU cores

Old BIOS v4.0.23

I have executed 10 runs, results were always 944Mbit/s.

Throughput is 1Gbit, when using more than 40 connections.

New bios v4.9.0.2

I have executed 10 runs, results were always 944Mbit/s.

Throughput is 1Gbit, when using more than 40 connections.

 

kernel sometimes assigns processing of 2 heavy tasks to the same CPU core

I don't know too much about CPU scheduling in BSD, but it looks like sometimes it's not ideal. In a few cases kernel schedules processing of multiple connections to the same CPU core, even when 3 other cores are not doing anything. This results in throughput limited to about 760Mbit/s.

When this happens, one can execute "top -P CC" to see what is going on. Here's how it looks like when 1 CPU is processing 2 TCP connections.

last pid: 59123;  load averages:  1.16,  0.85,  0.60    up 0+00:15:26  20:17:26
50 processes:  1 running, 49 sleeping
CPU 0:  0.0% user,  0.0% nice,  0.8% system, 13.8% interrupt, 85.4% idle
CPU 1:  0.0% user,  0.0% nice,  0.4% system, 14.2% interrupt, 85.4% idle
CPU 2:  0.0% user,  0.0% nice,  1.2% system,  5.5% interrupt, 93.3% idle
CPU 3:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
Mem: 66M Active, 46M Inact, 440M Wired, 228K Buf, 1270M Free
ARC: 84M Total, 18M MFU, 64M MRU, 168K Anon, 357K Header, 1507K Other
     37M Compressed, 93M Uncompressed, 2.49:1 Ratio
Swap: 2048M Total, 2048M Free

 

When things go as planned, 2 CPU cores are sharing the load, and we get full gigabit throughput.

last pid: 82545;  load averages:  1.25,  0.88,  0.41    up 0+00:05:17  20:28:48
50 processes:  1 running, 49 sleeping
CPU 0:  0.0% user,  0.0% nice,  0.0% system, 31.4% interrupt, 68.6% idle
CPU 1:  0.4% user,  0.0% nice,  1.6% system,  0.0% interrupt, 98.0% idle
CPU 2:  0.0% user,  0.0% nice,  0.8% system, 76.5% interrupt, 22.7% idle
CPU 3:  0.0% user,  0.0% nice,  0.8% system, 51.4% interrupt, 47.8% idle
Mem: 65M Active, 46M Inact, 433M Wired, 264K Buf, 1277M Free
ARC: 84M Total, 18M MFU, 65M MRU, 168K Anon, 356K Header, 1506K Other
     37M Compressed, 92M Uncompressed, 2.47:1 Ratio
Swap: 2048M Total, 2048M Free

This happens rarily, so in practice it won't be a problem for most users. Still, it's interesting to learn something new :-)

 

Data

If you want to analyze the data yourself, see raw data files from all testruns here 

 

CPU stress test benchmark

While we are at it, let's stress-test CPU and see what the CPU-only improvement is. 

Old BIOS v4.0.23

Single core CPU stress test:

root@debian:~# sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000

Test execution summary:
    total time:                          72.9001s
    total number of events:              10000
    total time taken by event execution: 72.8936
    per-request statistics:
         min:                                  6.62ms
         avg:                                  7.29ms
         max:                                  9.20ms
         approx.  95 percentile:               7.59ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   72.8936/0.00

 

Single CPU core takes close to 73 seconds to calculate 20000 prime numbers.

New bios v4.9.0.2

Single core CPU stress test

root@debian:~# sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=1
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 20000

Test execution summary:
    total time:                          54.6983s
    total number of events:              10000
    total time taken by event execution: 54.6941
    per-request statistics:
         min:                                  5.44ms
         avg:                                  5.47ms
         max:                                  8.00ms
         approx.  95 percentile:               5.48ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   54.6941/0.00

 

Single CPU core takes close to 55 seconds to calculate 20000 prime numbers.

Wow, the difference here is very significant. 25% improvement!

Crypto benchmark. Will VPN also be faster on the new BIOS?

Many customers are using OpenVPN on the APU boxes. OpenVPN is always using one core for encryption/decryption, so let's see if new BIOS improves the VPN speed. 

All these crypto tests have been executed on Debian. More tests are needed on pfSense/BSD platform.

Old BIOS v4.0.23

 

aes-128-cbc speed test

root@debian:~# openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 14383771 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 6768663 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 2212471 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 622353 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 80451 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 39933 aes-128-cbc's in 3.00s
OpenSSL 1.1.0f  25 May 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/x86_64-linux-gnu/engines-1.1\"" 
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      76713.45k   144398.14k   188797.53k   212429.82k   219684.86k   218087.42k

This test is performed on debian 9. it will have to be repeated on pfSense/BSD.


aes-256-cbc speed test

root@debian:~# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 12723922 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 5423190 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 1677103 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 457506 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 58599 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 29178 aes-256-cbc's in 3.00s
OpenSSL 1.1.0f  25 May 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/x86_64-linux-gnu/engines-1.1\"" 
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      67860.92k   115694.72k   143112.79k   156162.05k   160014.34k   159350.78k

New bios v4.9.0.2

 

aes-128-cbc speed test

root@debian:~# openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 17630245 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 8256436 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 2779482 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 781873 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 101212 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 16384 size blocks: 50068 aes-128-cbc's in 3.00s
OpenSSL 1.1.0f  25 May 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/x86_64-linux-gnu/engines-1.1\"" 
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-128-cbc      94027.97k   176137.30k   237182.46k   266879.32k   276376.23k   273438.04k

Performance improved by ~20%, independant of the block size. This should translate to roughly 20% more throughput on OpenVPN when using aes-128-cbc

aes-256-cbc speed test

root@debian:~# openssl speed -elapsed -evp aes-256-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 15604939 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 6746434 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 2104621 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 574371 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 73630 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 36881 aes-256-cbc's in 3.00s
OpenSSL 1.1.0f  25 May 2017
built on: reproducible build, date unspecified
options:bn(64,64) rc4(8x,int) des(int) aes(partial) blowfish(ptr) 
compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/lib/ssl\"" -DENGINESDIR="\"/usr/lib/x86_64-linux-gnu/engines-1.1\"" 
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes  16384 bytes
aes-256-cbc      83226.34k   143923.93k   179594.33k   196051.97k   201058.99k   201419.43k

Same 20% performance improvement is observed on aes-256-cbc. 

Conclusion

The new BIOS (v4.9.0.2) is noticeably improving CPU performance. It looks like the CPU boost is working, however, it's hard to say what the actual CPU frequency is during the boost since the regular tools (cpufreq, powerd) are not able to detect it. 

AMD claims that this CPU can boost up to 1400Mhz, which should translate to roughly 40% performance improvement. In practice, we see a little less.  

When using the new BIOS, a single-core boost is helping with routing performance when using 1 network connection. The improvement is about 10% (from 750Mbit/s to 840Mbit/s). This is great, but it's not the 40% we were hoping for.

When using more than one network connection, APU is able to route at 1 Gbit/s on pfSense regardless of the BIOS version.

CPU stress test showed a very significant 25% improvement. CPU-heavy applications, such as snort, or other IDS, IPS applications should see a significant performance improvement.

Crypto performance improved by 20%, this should translate to roughly 20% faster OpenVPN. 

You should upgrade your BIOS to v4.9.0.2 or later :-)