Hello all,

This post explains and provide details about vSAN congestion Congestion is a flow control mechanism used by vSAN (congestion). and it occurs when the I/O rate of the lower layers of the storage subsystem fails to keep up with the I/O rate of the higher layers.

Congestion is a feedback mechanism to reduce the rate of incoming IO requests from the vSAN DOM client layer to a level that the vSAN disk groups can service. This reduction of the incoming IO request rate is done by introducing an IO delay that is equivalent to the delay the IO would have occurred due to the bottleneck at the lower layer. Thus, it is an effective way to shift latency from the lower layers to the ingress without changing the overall throughput of the system. This avoids unnecessary queuing and tail dropped queues in the vSAN LSOM layer and therefore avoids a lot of wasted CPU cycles in processing IO requests that might eventually be dropped. Hence, regardless of the type of congestion, temporary and small values of congestion are usually OK, and not beneficial to the system performance. However, sustained and large values of congestion may lead to higher latency and lower throughput than desired, and therefore warrant attention and resolution in order to get a better benchmark performance

To understand if our ESXi is having vSAN congestion you can run the following scritp (per host)


 

It is valid to understand the following vSAN metrics

  • Slab Congestion: This originates in vSAN internal operation slabs. It occurs when the number of inflight operations exceed the capacity of operation slabs.
  • Comp Congestion: This occurs when the size of some internal table used for vSAN object components is exceeding threshold.
  • SSD Congestion: This occurs when the cache tier disk write buffer space runs out.
  • Log Congestion: This occurs when vSAN internal log space usage in cache tier disk runs out.
  • Mem Congestion: This occurs when the size of used memory heap by vSAN internal components exceed the threshold.
  • IOPS Congestion: IOPS reservations/limits can be applied to vSAN object components. If component IOPS exceed reservations and disk IOPS utilization is 100.

Use the following commands under your responsability. If your vSAN report LSOM errors in vSAN logs, these metrics can be changed to reduce vSAN congestion.

esxcfg-advcfg -s 16 /LSOM/lsomLogCongestionLowLimitGB  – (default 8).
esxcfg-advcfg -s 24 /LSOM/lsomLogCongestionHighLimitGB  – (default 16).

esxcfg-advcfg -s 10000 /LSOB/diskIoTimeout
esxcfg-advcfg -s 4 /LSOB/diskIoRetryFactor

esxcfg-advcfg -s 32768 /LSOM/initheapsize
esxcfg-advcfg -s 2048 /LSOM/heapsize

Official VMware KB: https://kb.vmware.com/s/article/2150260

—– Edited March 5th —–

 

Sharing is caring!

1 Comment

  1. E.F.

    Reply

    Thanks for the script – VMWare’s own KB doesn’t have any info on how to look it up, only that their technicians will have to get involved to discern the cause of it. Love this blog.

Leave a comment

Your email address will not be published. Required fields are marked *