Job monitoring/management commands

You can use the “jobstat” command to monitor your jobs and storage usage on Artemis. For example:

[abcd1234@login3 ~]$ jobstat
Job Summary for user abcd1234
                                                    Requested -------------------------               Elapsed  ------------------------------
Job ID---- Queue--- Job Name---- Project--- State-- Chunks Cores GPU       RAM Walltime  Start Time-- CPU Hours   CPU% Progress End Time
2269135    large    job1         PANDORA    Running      1    24   -     1.0Gb  20d 20h  11-Jun 04:45    1297.2  99.9%    10.8% 02-Jul 00:45
2281499    large    job2         PANDORA    Running      1    24   -     1.0Gb  20d 20h  01-Jun 20:46    6654.6  99.7%    55.6% 22-Jun 16:46
2281502    large    job3         PANDORA    Running      1    24   -     1.0Gb  20d 20h  01-Jun 20:48    6649.7  99.7%    55.6% 22-Jun 16:48
 * Times with an asterix are estimates only
 * End time is start time + walltime so job may finish earlier
 * Progress is accumulated walltime vs specified walltime - so see above

System Status --------------------------------------------------------------------------------------------------------
CPU hours for jobs currently executing: 1033302.8
CPU hours for jobs queued:              176983.7
Storage Quota Usage ------------------------------------------------
/home                             abcd1234       9.018G          10G
/project               RDS-TEST-PANDORA-RW       184.2G           1T
Storage Usage (Filesystems totals) ---------------------------------
Filesystem Used     Free
/scratch   378.1Tb  6.6%

Alternatively, you can use standard PBS Professional “qstat” commands to monitor jobs. A brief set of useful commands is shown below. For more commands, see the PBS Professional user manual.

Command Description
qstat -u abcd1234 show status of abcd1234’s jobs
qdel 1234567 delete job 1234567 from queue
qstat show status of all jobs
qstat -f 1234567 show detailed stats for job 1234567
qstat -xf 1234567 show detailed stats for job 1234567, even after it has finished

When jobs finish, they produce three output files. One for standard output, one for standard error and a resource usage file. The file formats are as follows:

<JobName>.o<JobID>       – Standard output file
<JobName>.e<JobID>       – Standard error file
<JobName>.o<JobID>_usage – Resource usage file

If you don’t redirect standard output or standard error to a file, they will be printed in the .o or the .e files and only appear after your jobs finish. These files may contain useful information about why your job terminated before it finished.

The resource usage file contains details about how long your job ran for and also the memory used by your job. You can use the information in the resource usage file to optimise your walltime and memory requests for future jobs. An example resource usage file is shown below:

Job Id: 1050977.pbsserver for user abcd1234 in queue small
Job Name: TestJob
Exit Status: 0
Walltime requested:   00:03:00 :      Walltime used:   00:01:36
    Cpus requested:         48 :
          Cpu Time:   00:36:38 :        Cpu percent:       3102
     Mem requested:        8gb :           Mem used:  2342348kb
    VMem requested:       None :          VMem used:  2342348kb
    PMem requested:       None :          PMem used:       None