As part of my research work at Stanford, I have been training Object Detection Deep Neural Networks. Training these networks is extremely compute intensive, and while we have a variety of powerful compute options at Stanford, I was looking for a faster way to train these networks. This article is a detailed guide to training the popular RetinaNet object detection network on TPU.

Google's Tensor Processing Units (TPUs)

Finding a Tensorflow Tensor Processing Unit (TPU) enabled version of the network I was training, I reached out to Google through their TensorFlow Research Cloud program, asking for access to TPUs via Google Cloud. Google quickly responded and graciously allowed us use of several TPUv2 compute units.

The TPUv2 is composed of 8 processors, each with 8GB of High Bandwidth Memory (HBM) and the processors are quoted at 180 teraflops en total or 22.5 TFLOPs each.

We were later granted access to several TPUv3s which are also composed of 8 processors, but each with 16GB of High Bandwidth Memory (HBM) and processors quoted at a total of 420 teraflops or 52.5 TFLOPs each!

To put these specs in perspective, the Stanford DGX-1 (from NVIDIA) has 8x P100 GPUs, each with 16GB of memory and is quoted at a total of 85 teraflops single precision or 10.6 TFLOPs per P100 GPU processor performance.

Moreover, utilizing all 8 GPUs in parallel can be challenging as most networks are not designed to handle this type of hardware parallelism.

In the TensorFlow TPU git repo, we find examples for training popular architectures on TPU.

There is also the Tensorflow Models repo, which contains a similar config for RetinaNet. For some reason, I had some trouble with the configuration in the TPU repository, specifically I had trouble with the evaluation. I'm sure these work fine, but since I was working off the Models repository before RetinaNet was added to the TPU repo, I'll be working off the Models repo in my examples below.

TPU Usage Mental Model

It took a bit to wrap my head around what a "Cloud TPU" actually is and how one would use it. Turns out a "Cloud TPU" is the combination of a virtual machine (VM) and a TPU hardware accelerator that get booted when a TPU is created using the ctpu CLI tool or the Google Cloud Web Interface. When working on the Google Research Cloud, the TPU and it's associated VM are both free, for up to the amount of time you have been allocated the TPU resources. From now on, I'll refer to the TPU and the VM that controls it (and has an associated IP, etc.) as just the TPU.

The TPU vs. VM point is slightly confused by the fact that the ctpu tool will automatically create another VM you can use to interact with the TPU. Since you cannot directly SSH into the TPU, you'll want to use one VM of reasonable size to interact with all your TPUs. This VM that you use to interact with your TPUs is not free.

Also, for the TPUs you're using to read and write info to disk, you'll need to use Google Cloud Storage. Though this resource is inexpensive, it is not free.

When creating a new Google Cloud account, you'll be granted $300 in credit, which should be plenty, if you keep in mind the above. Non-TPU VMs and Google Cloud Storage are inexpensive, but not free. Limit your use of these resources as much as possible by turning off the VM when not in use and only using 1 if possible. Limit duplicate data stored on Google Cloud Storage and minimise transfers between off-Google-services and on-Google-services networks.

Training RetinaNet

Now, let's train RetinaNet!

Setup

On a new Ubuntu machine, run:

sudo apt-get -y install python-pil python-lxml python-tk

Python

We'll use pipenv to isolate this project from the global python environment

pip install --user --upgrade pipenv

pipenv install tensorflow-gpu

Get the TensorFlow model repository, cd to your source directory and run:

export TF_ROOT=$(pwd)
git clone https://github.com/tensorflow/models

Build

Cocoapi

git clone https://github.com/cocodataset/cocoapi.git
pushd cocoapi/PythonAPI
pipenv run python setup.py build_ext --inplace
make
cp -r pycocotools ${TF_ROOT}/models/research/
popd

Protobuf

push models/research
wget -O protobuf.zip https://github.com/google/protobuf/releases/download/v3.0.0/protoc-3.0.0-linux-x86_64.zip
unzip protobuf.zip
./bin/protoc object_detection/protos/*.proto --python_out=.
popd

Environment

Set path export in your pipenv .env file

echo 'PYTHONPATH=${PYTHONPATH}:${PWD}:${TF_ROOT}/models/research:${TF_ROOT}/models/research/slim' | tee .env
echo "LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64/" | tee -a .env

Note that if paths in your environment change, you will need to re-generate the .env file

Cuda

You'll need cuda 9 and cudnn 7

unzip the cudnn 7 tar and copy it's contents into your local cuda dir:

tar xzf cudnn-9.0-linux-x64-v7.4.2.24.tgz
sudo cp -r cuda/* /usr/local/cuda-9.0/
rm -rf cuda

Test

Test your setup with:

pipenv run python models/research/object_detection/builders/model_builder_test.py

Data Prep

Download the COCO dataset, if you don't have it, cd into your data directory first:

(script is roughly https://github.com/tensorflow/tpu/blob/master/tools/datasets/download_and_preprocess_coco.sh)

# Helper function to download and unpack a .zip file.
function download_and_unzip() {
  local BASE_URL=${1}
  local FILENAME=${2}

  if [ ! -f ${FILENAME} ]; then
    echo "Downloading ${FILENAME} to $(pwd)"
    wget -nd -c "${BASE_URL}/${FILENAME}"
  else
    echo "Skipping download of ${FILENAME}"
  fi
  echo "Unzipping ${FILENAME}"
  ${UNZIP} ${FILENAME}
}

export TF_DATA=$(pwd)

# Download the images.
BASE_IMAGE_URL="http://images.cocodataset.org/zips"

TRAIN_IMAGE_FILE="train2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TRAIN_IMAGE_FILE}
TRAIN_IMAGE_DIR="${TF_DATA}/train2017"

VAL_IMAGE_FILE="val2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${VAL_IMAGE_FILE}
VAL_IMAGE_DIR="${TF_DATA}/val2017"

TEST_IMAGE_FILE="test2017.zip"
download_and_unzip ${BASE_IMAGE_URL} ${TEST_IMAGE_FILE}
TEST_IMAGE_DIR="${TF_DATA}/test2017"

# Download the annotations.
BASE_INSTANCES_URL="http://images.cocodataset.org/annotations"
INSTANCES_FILE="annotations_trainval2017.zip"
download_and_unzip ${BASE_INSTANCES_URL} ${INSTANCES_FILE}

TRAIN_OBJ_ANNOTATIONS_FILE="${TF_DATA}/annotations/instances_train2017.json"
VAL_OBJ_ANNOTATIONS_FILE="${TF_DATA}/annotations/instances_val2017.json"

TRAIN_CAPTION_ANNOTATIONS_FILE="${TF_DATA}/annotations/captions_train2017.json"
VAL_CAPTION_ANNOTATIONS_FILE="${TF_DATA}/annotations/captions_val2017.json"

# Download the test image info.
BASE_IMAGE_INFO_URL="http://images.cocodataset.org/annotations"
IMAGE_INFO_FILE="image_info_test2017.zip"
download_and_unzip ${BASE_IMAGE_INFO_URL} ${IMAGE_INFO_FILE}

TESTDEV_ANNOTATIONS_FILE="${TF_DATA}/annotations/image_info_test-dev2017.json"

Get the checkpoint, cd into your data directory:

cd ${TF_DATA}
mkdir checkpoints && pushd checkpoints
wget http://download.tensorflow.org/models/object_detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
tar xzf ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz
popd

Make the 2017 (aka. trainval35k / minival2014) splits for COCO:

pipenv run python models/research/object_detection/dataset_tools/create_coco_tf_record.py \
      --train_image_dir="${TF_DATA}/train2017" \
      --val_image_dir="${TF_DATA}/val2017" \
      --test_image_dir="${TF_DATA}/test2017" \
      --train_annotations_file="${TF_DATA}/annotations/instances_train2017.json" \
      --val_annotations_file="${TF_DATA}/annotations/instances_val2017.json" \
      --testdev_annotations_file="${TF_DATA}/annotations/image_info_test2017.json" \
      --output_dir="${TF_DATA}/tf2017/"

COCO Labels are in: models/research/object_detection/data/mscoco_label_map.pbtxt

Copy them into your data directory with something like:

cp ${TF_ROOT}/models/research/object_detection/data/mscoco_label_map.pbtxt ${TF_DATA}/

Google Cloud

In your environment, set:

export GCP_PROJECT="some-project-name-1337"
export GCP_BUCKET="pick-a-bucket-name"

Install gcloud and Run gcloud auth login to login

Run gcloud config set project some-project-name-1337 to set the default project

TPU and ML apis have been enabled

There is a regional cloud storage bucket (central-1, where the TPUs will run) named pick-a-bucket-name

Get your service account (response should be in the format ..."tpuServiceAccount": "[email protected]"):

curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
    https://ml.googleapis.com/v1/projects/${GCP_PROJECT}:getConfig

and add it to the environment:

export GCP_TPU_ACCOUNT=service-xxx@cloud-ml.google.com.iam.gserviceaccount.com

Grant the account permission:

gcloud projects add-iam-policy-binding $GCP_PROJECT  \
    --member serviceAccount:$GCP_TPU_ACCOUNT --role roles/ml.serviceAgent

Set your default region / zone to the location with free TPUs

gcloud config set compute/zone us-central1-f

A note from Google on free TPUs, Activating Allocations:

5 v2-8 TPU(s) in zone us-central1-f
100 preemptible v2-8 TPU(s) in zone us-central1-f
IMPORTANT: This free 30-day trial is only available for Cloud TPUs you create in the zones listed above. To avoid charges, please be sure to create your Cloud TPUs in the appropriate zone.

Verify you have the right zone:

gcloud config list

Create a new service account so python can access GCP resources:

open https://console.cloud.google.com/apis/credentials/serviceaccountkey?project=some-project-name-1337&folder&organizationId set the type to editor hit create and copy the downloaded file into the projects credentials folder and explort it in the shell:

mv ~/Downloads/my-creds-2395ytqh3.json credentials/service.json
export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/credentials/service.json

Copy the coco dataset to the bucket:

# data is in ${TF_DATA}/tf2017
gsutil -m cp -r ${TF_DATA}/tf2017 gs://${GCP_BUCKET}/datasets/coco/tf2017

Copy labels:

gsutil cp ${TF_DATA}/mscoco_label_map.pbtxt gs://${GCP_BUCKET}/datasets/coco/labels/mscoco_label_map.pbtxt

Copy the checkpoint:

# data is ${TF_DATA}/checkpoints/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/
gsutil cp /scr/ntsoi/datasets/coco/checkpoints/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/model.ckpt.* gs://${GCP_BUCKET}/datasets/coco/checkpoints/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03/

Running on TPU

Setup

Only needs to be done once per project.

Grant the TPU service bucket permission, open https://console.cloud.google.com/iam-admin/iam?project=some-project-name-1337 and add the TPU service account to:

Storage Admin
Viewer

On each Run

Bring up a TPU (v3)

ctpu up --zone us-central1-a --name=tpu-1--v3 --tpu-size=v3-8 --tpu-only

Or v2

ctpu up --zone us-central1-f --name tpu-1--v2 --tpu-size=v2-8 --tpu-only

(or leave off --tpu-only if you need a vm also) try to run only 1 vm though, and as many TPUs as you need, as we're billed for VMs but up to 5 tpus and 100 pre-emptable tpus are free

Which will print a TPU name, paste it below:

If keys are created, you'll see a message like:

Your identification has been saved in $HOME/.ssh/google_compute_engine.
Your public key has been saved in $HOME/.ssh/google_compute_engine.pub.

Connect to the TPU

# From the project root, connect with port forwarding, only possible if you've left the --tpu-only flag off the ctpu up command
gcloud compute ssh tpu1-1--v2 -- -L 6006:localhost:6006

Setup the TPU, on the VM you've created, run:

wget https://gist.githubusercontent.com/nathantsoi/8a422a19d08335f52dc49657058da251/raw/83e2a374224723715ebae529ab52668ac5d71a9c/setup_tpu.sh
chmod +x setup_tpu.sh
./setup_tpu.sh

You'll want to edit ~/.bash_rc and set the GCP_BUCKET to your bucket name.

Local training on the TPU (once ssh'd in):

# Start a tmux sesssion, to keep the session running after disconnecting, then:
pushd ${RUNNER}
export JOB_NAME=retinanet
export MODEL_DIR=gs://${GCP_BUCKET}/train/${JOB_NAME}
python models/research/object_detection/model_tpu_main.py \
    --gcp_project=some-project-name-1337 \
    --tpu_name=tpu1 \
    --tpu_zone=us-central1-f \
    --model_dir=${MODEL_DIR} \
    --pipeline_config_path=models/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config \
    --alsologtostderr

Run eval on a local gpu box, TPU inference isn't supported yet, so you won't be able to run the eval on a TPU, yet.

export JOB_NAME=retinanet_baseline
export MODEL_DIR=gs://${GCP_BUCKET}/train/${JOB_NAME}
python models/research/object_detection/model_main.py \
    --model_dir=${MODEL_DIR} \
    --pipeline_config_path=models/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config \
    --checkpoint_dir=${MODEL_DIR} \
    --alsologtostderr