I’ve seen some problem with burstable instances in AWS. This is how things usually proceed:
- There is a sudden spike of incoming work, which is unusual for the cluster.
- Performance keeps up for some time, but the incoming work keep coming in.
- The disk/cpu ran out of burst credit due to the sustained levels of incoming work.
- AWS restricts the disk/cpu performance since the instance ran out of burst credit.
- The node stay in its “baseline” performance. At this time, everything is stalling, the node looks underutilized, and very little of the work backlog can be processed since node performance was restricted by AWS.
- The incoming work stopped coming in, the node slowly catches up with work and accumulates burst credit again, and eventually things go back to normal.
- A new set of incoming work comes in again, and we’re back at (1).
Since you’re using gp2 disks, I think you’re running into this burst credit issue. This is explained in the Amazon EBS volume types under gp2.
Note that in most cases, this burst credit system is actually beneficial. This allows you to process sudden spike in work without needing to provision a larger hardware for a once-in-a-while events. However the idea is that the work is a spike, and not a sustained load.
If you’re expecting the normal state of the cluster to be as busy as your load tests consistently, then using a provisioned iops disks is probably a better choice vs. using a gp2 instance.
Best regards,
Kevin