Resume of change stream was not possible

Do you have logs from before the failure? That might help pinpoint a resume token.

Are you seeing oplog warnings before the crash? If so, the oplog is probably too small.

Also… Is the EC2 instance provisioned with high IOPS disks? That might help with sync performance. * If you’re using an EBS volume, it probably needs provisioned IOPS

  • Run iostat -x or iotop to check disk usage.
  • If the await time (I/O wait time) is high, the disk is too likely slow.

How about the source cluster… is it possible to use secondary nodes to avoid additional pressure on the primary?