QTM Research Computing
The QTM department purchased a server with eight Ampere A6000 GPUs in the summer of 2021. This web site provides information relevant to the "qtm.es.emory.edu" server and research computing.
Technical Specifications
The Exxact server has two Intel Xeon Gold 6248R processors. Each processor has 24 cores and 48 threads. The total amount of memory on this system is one terabyte. The QTM server has over 50TB of storage comprised of different sets of SSD drives.
This server has eight NVIDIA RTX A6000 GPUs, each GPU has 48GB of RAM.
Login
Back to topOnly authorized QTM users may login on the "qtm" server. If you cannot login on this server, please email sandeep.soni@emory.edu and request access.
You must use a valid NetID and login from campus or via Emory VPN in order to connect to the "qtm" server. You can find information about the Emory VPN here: https://it.emory.edu/security/vpn.html
Use ssh to login on the server (ssh -l NetID qtm.es.emory.edu). If you are using MacOS or Linux, use Terminal. If you are using Windows, use PowerShell or an SSH client like MobaXterm.
If you would like to keep the ssh connection alive please use the following command:
ssh -o ServerAliveInterval\ 900 -l NetID qtm.es.emory.edu
Home Directory
Back to topThe location of your home directory is "/home" followed by your NetID. For example, Sandeep's NetID is ssoni26 and the location of Sandeep's home directory is: /home/ssoni26
Your home directory has an initial disk quota of 200GB. This quota can be increased if needed.
The permissions of your home directory are closed so that other people cannot view your files. Please use the "chmod" command to change the permissions. Please use the "man chmod" command if you need information about Linux permissions.
Daily online backups (using snapshots) are available in your home directory in this location:
/home/NetID/.zfs/snapshot
For example you can "cd/home/NetID/.zfs/snapshot/backuphomedir20210928" and look at the state of your files and directories on September 28th at midnight.
You can then copy files from the snapshot directory to your home directory if you need to recover a file.
There will be a limited number of snapshots available on the server. Please run the "ls" command in the /home/NetID/.zfs/snapshot directory to see which daily backups (using snapshots) are available.
Scratch Directory
Back to topThe location of your scratch directory is "/local/scratch/NetID".
The scratch directory should be used for computations, instead of the home directory. The scratch directory has an initial disk quota of 200GB but it can be increased. The permissions of your scratch directory are closed.
Your scratch directory is not backed up. The contents of your scratch directory should be removed once your sequence of experiments is completed.
SLURM
Back to topThe QTM server uses a batch system called SLURM to queue jobs on the server. SLURM is a workload manager that runs jobs based on available resources.
To use SLURM you will need to use a few commands to submit, cancel, and view jobs. The following commands and more details can be found here: https://slurm.schedmd.com/documentation.html A quick SLURM cheat sheet can be found here: https://slurm.schedmd.com/pdfs/summary.pdf
sbatch
sbatch is a command that will take a shell script file as an argument and submit that job to SLURM. The shell script will define the resource parameters, job name, output file, etc. After defining the parameters, you can run commands like any shell script.
Ex:
#!/bin/bash
#SBATCH --job-name=slurmtest
#SBATCH --output=out_slurmtest
#SBATCH --gres=gpu:1
cd /path/to/working/directory/
source venv/bin/activate
python gpu.py
In the example above, --job-name defines the job name you are submitting, --output defines the file name for all output of the job, and --gres=gpu:1 defines how many gpus you wish to use for the job.
If you need to use a virtual environment, you need to create it before running sbatch. Please see the Example Job Submission section below for instructions.
Please use 1 GPU per job at this time unless given permission from the technical staff.
squeue
The squeue command will list the jobs that are running or queued to the batch system. You can
see the JOBID of the job you have submitted. You can also see the status of the job under
(REASON).
scancel
The scancel command will take JOBID as an argument and will cancel the job you have
submitted. You can use squeue to get the JOBID.
srun
By using the srun command, you can run SLURM with an interactive shell. This allows you to
request the number of GPU resources through SLURM and be able to interact and debug your
code. To do this run the following command:
srun --pty --gres=gpu:1 -p debug bash
You can specify the number of GPUs with the --gres=gpu:# flag.
Example Job Submission
To run a sample job submission, copy /usr/local/SLURM/gpu.py and /usr/local/SLURM/template.sh to a folder.
- Edit template.sh and change the first path to your current working directory. Please set
the number of GPUs that you would like to use in the gres=gpu line Please specify at
most 4 GPUs in the gres=gpu line - Create a virtual env "virtualenv -p python3 venv"
- Activate the virtualenv "source venv/bin/activate"
- Install tensorflow "pip install tensorflow-gpu" Please allow several minutes
for the installation to complete - Deactivate the virtualenv, please run the command "deactivate"
- Submit the job "sbatch template.sh"
- Check the job queue "squeue"
- Check the output "cat out_slurmtest" after the job completes
After the #SBATCH parameters, you can still activate and use python virtual env. The output of the batch test should give you the number of available GPUs defined by the --gres=gpu:1 parameter. You can script all the commands you need to run your job.
Please do not use the CUDA_VISIBLE_DEVICES environment variable when using SLURM.