Performance Issues

The following is a list of possible performance issues you may encounter.

When I run tens of jobs simultaneously, OpCenter performance is fine, but when I run hundreds of jobs simultaneously, OpCenter performance slows to unacceptable levels. Why?

This is an indication of insufficient resources. Check the following items.

Enter float config get scheduler.jobExecutorLimit to make sure the value is large enough to serve the requests
Use the "OpCenter Watcher" in the web interface to determine where the bottlenecks are (CPU, memory, IOPS, network, and so on)
Inspect opcenter.log to see whether any quota limits are reached

My Nextflow pipeline kicks off a large number of jobs that use the same container image. The jobs are slow to start. Why?

There may be several causes. If the container image is large (1GB or more), it can take several minutes to load from the image repository. Consider using a volume snapshot to cache a copy of the image. If the jobs are running on popular instance types in a busy CSP region, new jobs can be slow to start because the OpCenter must go through an extended process to find an available instance that meets the requirements. Consider changing the instance requirements in the pipeline or move the pipeline to a different region.

My job uses a large data set as input and it takes a long time to load. How can I speed things up?

Cloud-based object storage services, such as AWS S3, offer large capacity but relatively slow access speeds. Consider pre-staging the data on a high-performance local resource such as existing block storage device or an NFS server.

My pipeline jobs incur frequent reads and writes of data on AWS S3, and s3fs becomes extremely slow. How can I improve I/O performance?

Consider a high-performance distributed file system such as JuiceFS or Lustre, which can use S3 as the underlying data store.

My job incurs a lot of small file reads from, and writes to, a distributed file system. Performance is way below the advertised rates. How can I improve file system performance?

High-performance distributed file system, such as JuiceFS, use caching technology to improve read-write performance. Consider tuning caching parameters for your environment, for example, writing of large files. For high volumes of small file writes, consider turning on write-back caching.

When I scale my computational pipeline from a hundred to several hundred simultaneous jobs, the pipeline sometimes fails. Why?

AWS imposes Service Quotas on your account that impose limits on the resources allowed at any given time. Increase your account's Service Quotas.

I use an allow list to constrain the VM instance types, but sometimes jobs are slow to start. Why?

If the allow list overly constrains the allowed instance types and the OpCenter runs in a busy region, availability of spot instances that meet the constraints can be low. Loosen the constraints of the allow list or use an OpCenter in a different region.