How to Capture Camera Video and Do Caffe Inferencing with Python on Jetson TX2

Quick link:

2018-06-14 update: I’ve extended the TX2 camera caffe inferencing code with a (better) multi-threaded design. Check out this newer post for details: Multi-threaded Camera Caffe Inferencing.

Last week I shared a python script which could be used to capture and display live video from camera (IP, USB or onboard) on Jetson TX2. Here I extend that script and show how you can run Caffe image classification (inferencing) on the captured camera images, all done in python code. This sample should be good for quickly verifying your newly trained Caffe image classification models, for prototyping, or for building Caffe demo programs with live camera input.

I mainly tested the script with python3 on Jetson TX2. But I think the code also works with python2, as well as on Jetson TX1.


$ cd /home/nvidia/caffe
$ python3 ./scripts/ ./models/bvlc_reference_caffenet
$ ./data/ilsvrc12/


How to run the Tegra camera Caffe sample code:

$ python3 --help
  • To do Caffe image classification with the default bvlc_reference_caffenet model using the Jetson onboard camera (default behavior of the python program).
$ python3
  • To use USB webcam /dev/video1 instead, while setting video resolution to 1280x720.
$ python3 --usb --vid 1 --width 1280 --height 720
  • Or, to use an IP CAM.
$ python3 --rtsp --uri rtsp://admin:XXXXXX@
  • To do image classification with a different Caffe model using the onboard camera.
$ python3 --prototxt XXX.prototxt \
                             --model YYY.caffemodel \
                             --labels ZZZ.txt \
                             --mean UUU.binaryproto

When I tested the code with a USB camera and a picture of pineapple, the default bvlc_reference_caffenet said it was 100% sure (probability ~ 1.0) the image was a pineapple!

A pineapple picture shown to

Next, I tried to test with a Caffe model trained with NVIDIA DIGITS. More specifically, I trained an AlexNet with ‘Caltech 101’ dataset, as mentioned in this NVIDIA QuikLabs course: Image Classification with DIGITS. One very nice thing about this free QuickLabs course is that you get 2-hour access of a K520 GPU based cloud server with NVIDIA DIGITS, with no charge at all. After successfully training an AlexNet model with ‘Caltech 101’ dataset (I just trained the model for 30 epochs with plain SGD and the default learning rate, 0.01), I then downloaded the model snapshot of the last training epoch from DIGITS: 20171022-025612-7b04_epoch_30.0.tar.gz. Here’s the list of files in that snapshot tarball.


I then verified this trained Caffe model with the following command. During training, the logs indicated this trained model has an accuracy at only around 67.5% (for classifying 101 classes of objects). When testing, I did find this model working poorly on many test images. But anyway I managed to get this model to classify a ‘pegion’ picture correctly.

$ python3 ./ --usb --vid 1 --crop \
                               --prototxt alexnet/deploy.prototxt \
                               --model alexnet/snapshot_iter_1620.caffemodel \
                               --labels alexnet/labels.txt \
                               --mean alexnet/mean.binaryproto \
                               --output softmax

By the way, in case you’d like to run the code with a Caffe model trained for grayscale image inputs (e.g. LeNet), you’ll have to modify the python code to convert the input camera images to grayscale before feeding them to the Caffe transformer for processing. This could be done by, say, gray = cv2.cvtColor(img_crop, cv2.COLOR_BGR2GRAY) and then net.blobs["data"].data[...] = transformer.preprocess("data", gray).

I have not done much testing of this code with various cameras or Caffe models. Feel free to let me know if you find any issue with the code, and I’ll look into it as soon as I can.

blog built using the cayman-theme by Jason Long. LICENSE