Installing PyTorch on NYU Greene High Performance Computing Cluster

This guide provides an introduction to NYU’s High Performance Computing resources and demonstrates how to configure a PyTorch environment on the Greene cluster. For comprehensive documentation, please refer to the official NYU HPC Documentation.

Accessing the Compute Resources

Users with HPC access must connect to the cluster shell remotely. A connection to the NYU network is required, either through VPN or by being physically on-campus.

To log in to the HPC cluster, execute the following command. When prompted, enter the password associated with your NetID. If a fingerprint verification prompt appears, type yes to add the cluster to your list of trusted hosts.

ssh <netid>@greene.hpc.nyu.edu

Upon successful authentication, the shell prompt will indicate connection to the Greene cluster. The default working directory is /home/<netid>/ (or ~). The prompt displays the login node name, such as [<netid>@log-2 ~]$. There are three available login nodes: log-1, log-2, and log-3.

Next, connect to NYU HPC’s Google Cloud Platform (GCP) Burst nodes, which are designated for coursework. Note that the main Greene cluster should be reserved for research purposes only.

ssh burst

This establishes a connection to log-burst, the login node for the GCP Burst cluster. The prompt will update to [<netid>@log-burst ~]$. All subsequent steps should be executed on this cluster.

Installing Conda on HPC

While the official HPC documentation recommends Singularity for managing conda environments, the following alternative method provides a more straightforward setup process.

First, request a compute node on the GCP Burst platform. The parameters for this command are explained in detail in a later section.

srun \
  --account=rob_gy_6203-2022fa \
  --cpus-per-task=8 \
  --partition=interactive \
  --mem=16GB \
  --time=04:00:00 \
  --pty /bin/bash

Step 1: Create the Conda Installation Directory

To avoid exceeding the 50GB quota limit in the home directory, create the conda installation in the scratch space.

mkdir /scratch/<NetID>/miniconda3

Step 2: Download and Install Miniconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -p /scratch/<NetID>/miniconda3

Step 3: Create an Activation Script

Create a script named env.sh in /scratch/<NetID>/:

touch /scratch/<netID>/env.sh

Populate env.sh with the following content using a text editor such as vim, nano, or emacs:

#!/bin/bash

source /scratch/<NetID>/miniconda3/etc/profile.d/conda.sh
export PATH=/scratch/<NetID>/miniconda3/bin:$PATH
export PYTHONPATH=/scratch/<NetID>/miniconda3/bin:$PATH

To activate the conda package manager, execute:

source /scratch/<NetID>/env.sh

By default, conda environments and packages are stored in /scratch/<NetID>/miniconda3.

To enable the conda activate command, initialize conda for shell integration:

conda init

Creating a PyTorch Environment

This section demonstrates the installation of PyTorch v1.13.0 within a conda environment.

Connect to Greene via ssh <netid>@greene.hpc.nyu.edu, then to the Burst cluster via ssh burst. Request a GPU-enabled compute node:

srun \
  --account=rob_gy_6203-2022fa \
  --cpus-per-task=8 \
  --partition=n1s8-v100-1 \
  --mem=16GB \
  --gres=gpu:v100:1 \
  --time=04:00:00 \
  --pty /bin/bash

Once connected to the compute node, create and configure the conda environment:

conda create -n test python=3.9 -y
conda activate test
conda install pytorch torchvision pytorch-cuda=11.7 \
  -c pytorch \
  -c nvidia

Requesting GPU Nodes and Executing Code

There are two methods for running code on the cluster: interactive and non-interactive.

Interactive Mode

Interactive mode provides a terminal shell for direct command execution. To request a Tesla V100 GPU node with 8 CPUs for a 4-hour session:

srun \
  --account=rob_gy_6203-2022fa \
  --cpus-per-task=8 \
  --partition=n1s8-v100-1 \
  --gres=gpu:v100:1 \
  --time=04:00:00 \
  --pty /bin/bash

Non-Interactive Mode

Non-interactive mode submits jobs to a queue managed by the SLURM workload manager:

sbatch test.sbatch

Example test.sbatch configuration:

#!/bin/bash
#SBATCH --account=rob_gy_6203-2022fa    # Account allocation
#SBATCH --partition=n1s8-v100-1         # GPU partition
#SBATCH --nodes=1                       # Number of compute nodes
#SBATCH --ntasks-per-node=1             # Tasks per node
#SBATCH --cpus-per-task=2               # CPU cores per task
#SBATCH --time=1:00:00                  # Maximum runtime
#SBATCH --mem=2GB                       # Memory allocation
#SBATCH --job-name=torch-test           # Job identifier
#SBATCH --output=result.out             # Output file
#SBATCH --gres=gpu:v100:1               # GPU resource request

# Initialize conda
source /scratch/amw9425/env.sh;

# Activate environment
conda deactivate;
conda activate torch;

# Execute script
python test.py;

Example test.py for verifying the PyTorch installation:

#!/bin/env python

import torch

print(torch.__file__)
print(torch.__version__)

# Number of available GPUs
print(torch.cuda.device_count())

# Current GPU name
print(torch.cuda.get_device_name(torch.cuda.current_device()))

# GPU availability status
print(torch.cuda.is_available())

Common SLURM Commands

Command	Description
`squeue -u <netID>`	View submitted jobs
`squeue --me`	View submitted jobs (alternative)
`scancel <JobID>`	Cancel a specific job
`scancel {StartJobId..EndJobId}`	Cancel a range of jobs
`squeue -u $USER \| awk '{print $1}' \| tail -n+2 \| xargs scancel`	Cancel all jobs
`squeue --me --start`	View estimated job start time

Additional Resources

For teams utilizing Habitat-Sim, refer to this tutorial by Irving Fang for installation instructions. Ensure compatibility with the GCP Burst Platform configuration described above.