Measuring Caffe Model Inference Speed on Jetson TX2

Feb 27, 2018

When deploying Caffe models onto embedded platforms such as Jetson TX2, inference speed of the caffe models is an essential factor to consider. I think the best way to verify whether a Caffe model runs fast enough is to do measurement on the target platform.

Here’s how I measure caffe model inference time on Jetson TX2.

Prerequisite:

Build and install Caffe on the target Jetson TX2. Reference: How to Install Caffe and PyCaffe on Jetson TX2
Prepare deploy.prototxt for the Caffe models to be measured

In the following examples I used my own fork of ssd-caffe.

Reference:

Check out the official Caffe ‘Interfaces’ documentation for descriptions about the caffe time command.

Step-by-step:

Assuming a version of Caffe has been built at ~/project/ssd-caffe, we would use the built caffe executable to measure inference time of the models.

Important: During measurement caffe would use whatever input batch size as specified in the deploy.prototxt. When you compare inference speed of 2 different Caffe models, if input batch size is set differently between the 2, you would not be making fair comparisons.

For practical purposes I care most about inference time of batch size 1 (inferencing only 1 single image each time). So when measuring, I would set input batch size to 1 for all models being compared.

Take AlexNet for example. First make a copy of its deploy.prototxt.

$ cp ~/project/ssd-caffe/models/bvlc_alexnet/deploy.prototxt /tmp/alexnet_deploy.prototxt
### Set TX2 to max performance mode before measuring
$ sudo nvpmodel -m 0
$ sudo ~/jetson_clocks.sh
### Modify input batch size as described below
$ vim /tmp/alexnet_deploy.prototxt

Then modeify line #6 of the prototxt to specify batch size as 1.

-   input_param { shape: { dim: 10 dim: 3 dim: 227 dim: 227 } }
+   input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }

Run the caffe time command.

$ cd ~/project/ssd-caffe
$ ./build/tools/caffe time -gpu 0 -model /tmp/alexnet_deploy.prototxt 
I0228 11:53:37.071836  7979 caffe.cpp:343] Use GPU with device ID 0
I0228 11:53:37.616500  7979 net.cpp:58] Initializing net from parameters: 
......
0228 11:53:41.861127  7979 caffe.cpp:412] Average Forward pass: 12.9396 ms.
I0228 11:53:41.861150  7979 caffe.cpp:414] Average Backward pass: 35.2972 ms.
I0228 11:53:41.861168  7979 caffe.cpp:416] Average Forward-Backward: 48.4081 ms.
I0228 11:53:41.861196  7979 caffe.cpp:418] Total Time: 2420.4 ms.

So we get inference time (forward pass only) of bvlc_alexnet on JTX2 is about 12.9396 ms.

Next, repeat the measurement for bvlc_googlenet (set input batch size to 1 as well). And the result is 24.6415 ms.

$ ./build/tools/caffe time -gpu 0 -model /tmp/googlenet_deplay.prototxt 
I0228 12:00:19.444232  8129 caffe.cpp:343] Use GPU with device ID 0
I0228 12:00:19.983999  8129 net.cpp:58] Initializing net from parameters: 
......
I0228 12:00:25.924129  8129 caffe.cpp:412] Average Forward pass: 24.6415 ms.
I0228 12:00:25.924151  8129 caffe.cpp:414] Average Backward pass: 41.9625 ms.
I0228 12:00:25.924170  8129 caffe.cpp:416] Average Forward-Backward: 66.9036 ms.
I0228 12:00:25.924201  8129 caffe.cpp:418] Total Time: 3345.18 ms.

I also downloaded VGG16 and ResNet-50 from links on Caffe Model Zoo and did the measurements. Here are all the results.

Model	Inference TIme
bvlc_alexnet	12.9396 ms
bvlc_googlenet	24.6415 ms
VGG16	91.82 ms
ResNet-50	64.0829 ms

A big take-away for myself by doing these measurements is that bvlc_googlenet, having similar classification accuracy as VGG16, actually runs much faster than VGG16 on JTX2. So it could be a better (speedier) CNN feature extractor for the object detection models such as Faster R-CNN, YOLO, and SSD.