When will my job start?

Artemis’s job scheduler determines when and where jobs will be run. The scheduler is a live system and regularly re-­prioritises work based on the following considerations:

Job Size - CPUs and memory requested

  • If you submit jobs asking for more than 288 cores, your job will never get to run.
  • If you have currently running jobs, your queued jobs cannot start unless the resulting total number of cores used would still remain below the 288 core limit (e.g. your 90 core job can’t start if you are already using 200 cores).
  • If you submit jobs asking for more resources than available (i.e., memory) your job will never run. Asking for a relatively large resource allocation (i.e., lots of CPUs instead of just a few, or all CPUs on a single node) means the scheduler must wait for current jobs to complete and schedule future jobs in such a way as to leave a “hole” for your job to run which may result in a wait time despite there appearing to be resources free.
    • For example, asking for 240 cores on 10 nodes will require the scheduler to wait for any and all jobs to finish on 10 nodes (approx. 1/5th of the total capacity) before your job can run, even though there may already be 240 cores free across the entire cluster.
    • Freeing this level of contiguous resource can take time, as there may be a mixture of long running and short running jobs previously scheduled and running.
    • If you “batch up” jobs, which individually can run, but collectively consume significant resources, the system will run them in a suitable manner (i.e., keeping you below the 288 core CPU limit) and possibly also lower later jobs priorities due to fair share. This means that whilst your jobs are running, some jobs may end up with high wait times.

Time (or walltime) requested

  • Jobs that request longer wall times (that is, run times according to the time on the clock hanging on the wall, or real time) will always take longer to start than jobs that request less wall time.
  • In particular, jobs that request more than 1 week of walltime will run in “large” and take a long time to start. Jobs that request less than 24 hours of wall time will typically start very quickly, unless Artemis is very busy.
  • Some jobs may finish sooner than their set wall time. This means that your estimated start time may change to an earlier time (if other jobs finish early, are cancelled, fail etc.) or a later time if jobs are scheduled that fairly move you further down the fair share queue.

Capacity limits

  • Be aware of the core and memory limit for each node, asking for more than available may mean your job will never get to run.
  • Be aware of system overheads when requesting memory - 128GB nodes have closer to 123GB available.
  • There are per-user limits on the number of concurrent jobs per user.

Fair Share

“Fair Share” assigns priority to jobs based on each project’s recent usage of the system. If a project has recently used a lot of CPU time, then the priority of their future jobs, relative to other projects, will be reduced. Once a job runs, it is allowed to complete and is unaffected by fair share.

Fair share only has an impact when there is contention for resources. Fair Share is calculated at a project level, so if one member of a project uses a lot of CPU time, future jobs submitted by that project will have lower priority.

Different queues (see the Job Queues section for a description of each queue) have different fair share weightings. The small, normal and large queues have a fair share weight of 10, which is considered to be the “standard” fair share weighting. The high memory and GPU queues, however, have a fair share weight of 50. If you request excessive resources (for example, too much memory), your job may be placed in a queue with a higher fair share weighting.

In addition to the above accumulation, fair share also decays with a “half-life” of 2 weeks. If you were to stop or reduce your use of Artemis, your fair share would decrease and the priority of your future jobs would increase.

Note

Fair Share likely won’t affect your job priority unless you’re submitting more than 30,000 CPU hours of work every month. Generally, the sooner you submit your jobs, the sooner they will run.