Quote History Quoted:
... At a minimum I would carve out your etcd and master nodes to baremetal. The worker nodes can be vms. That might help with stability and response times.
I've seen clusters that are full virt with just horrible response times. It's usually due to the control plane sitting on exhausted hypervisors.
View Quote View All Quotes
View All Quotes
Quote History Quoted:
... At a minimum I would carve out your etcd and master nodes to baremetal. The worker nodes can be vms. That might help with stability and response times.
I've seen clusters that are full virt with just horrible response times. It's usually due to the control plane sitting on exhausted hypervisors.
The comments about etcd, and noticing that k3s doesn't use etcd, got me curious and digging.
It looks like etcd requires fdatasync to have a 10ms or less latency at the 99th percentile:
IBM - Using fio to tell if your storage is fast enough for etcdI ran their fio test with the same parameters on my bare metal boxes and a vm on each bare metal box. On all but one of the servers, the 99th percentile latency was abysmal - 30ms to 50ms. One of the servers was 2ms or less. On that fast server, the latency was also 2ms inside the VM. On the two slow servers, when running a vm on a slow server, the latency was an order of magnitude worse at the 99th percentile - one was 350ms, over 35x the recommended maximum.
On my late model desktop, which has a late model consumer ssd (it might even be one of the models from the IBM test, can't remember exactly), the latency was around 1-2ms at the 99th percentile.
All tests were done with LUKS full disk encryption enabled on the bare metal OS but not in the VMs (the unencrypted VM disks are on a luks encrypted host disk).
Here's the one VMs on a slow server:
redacted@redacted:~$ mkdir test-data
redacted@redacted:~$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=2300 --name=mytest
mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.16
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][100.0%][w=76KiB/s][w=34 IOPS][eta 00m:00s]
mytest: (groupid=0, jobs=1): err= 0: pid=1350762: Sun Mar 20 17:24:54 2022
write: IOPS=32, BW=72.7KiB/s (74.4kB/s)(21.0MiB/310060msec); 0 zone resets
clat (usec): min=16, max=2166, avg=47.97, stdev=45.68
lat (usec): min=16, max=2167, avg=48.86, stdev=45.75
clat percentiles (usec):
| 1.00th=[ 26], 5.00th=[ 33], 10.00th=[ 37], 20.00th=[ 39],
| 30.00th=[ 40], 40.00th=[ 42], 50.00th=[ 44], 60.00th=[ 50],
| 70.00th=[ 52], 80.00th=[ 53], 90.00th=[ 57], 95.00th=[ 63],
| 99.00th=[ 99], 99.50th=[ 141], 99.90th=[ 388], 99.95th=[ 1237],
| 99.99th=[ 2040]
bw ( KiB/s): min= 22, max= 130, per=100.00%, avg=72.00, stdev=16.49, samples=619
iops : min= 10, max= 58, avg=32.13, stdev= 7.37, samples=619
lat (usec) : 20=0.16%, 50=60.50%, 100=38.35%, 250=0.84%, 500=0.06%
lat (usec) : 750=0.03%
lat (msec) : 2=0.03%, 4=0.03%
fsync/fdatasync/sync_file_range:
sync (usec): min=1298, max=590465, avg=30856.35, stdev=32447.06
sync percentiles (usec):
| 1.00th=[ 1745], 5.00th=[ 2540], 10.00th=[ 2868], 20.00th=[ 3490],
| 30.00th=[ 6587], 40.00th=[ 11469], 50.00th=[ 21890], 60.00th=[ 30802],
| 70.00th=[ 41157], 80.00th=[ 53740], 90.00th=[ 73925], 95.00th=[ 91751],
| 99.00th=[130548], 99.50th=[158335], 99.90th=[242222], 99.95th=[274727],
| 99.99th=[362808]
cpu : usr=0.07%, sys=0.75%, ctx=28939, majf=0, minf=13
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=72.7KiB/s (74.4kB/s), 72.7KiB/s-72.7KiB/s (74.4kB/s-74.4kB/s), io=21.0MiB (23.1MB), run=310060-310060msec
Disk stats (read/write):
sda: ios=0/27055, merge=0/18520, ticks=0/338597, in_queue=308636, util=99.02%
That's 130 ms.
So I suspect that the etcd -> sqlite change that came with changing microk8s -> k3s is what made the difference.
I haven't deployed the cluster across the servers yet; I'm going to try that next, and deliberately put the k3s master node on a vm on the worst bare metal server.