Quick link: jkjung-avt/hand-detection-tutorial
I came accross this very nicely presented post, How to Build a Real-time Hand-Detector using Neural Networks (SSD) on Tensorflow, written by Victor Dibia a while ago. Now that I’d like to train an TensorFlow object detector by myself, optimize it with TensorRT, and deploy it on Jetson TX2, I immediately thought about following Victor’s example and train a hand detector. However, when I started to follow Victor’s post and do the training, I immediately found not only were there quite a few missing dots, but some of the information was also out-dated.
So I decided to work on the hand-detector problem by myself. And I wanted to create a tutorial which is up-to-date and easy to follow. After quite some hard work, I was finally able to put together all necessary code/scripts and create this tutorial.
Note that, not like TensorFlow’s Quick Start documentation (which starts by describing how to train an object detection model on Google Cloud Platform), my goal was to train the model locally using my own PC/server. Let’s dive in.
- Tensorflow Object Detection API
Training an object detector is more demanding than training an image classifier. Ideally, you should have a decent NVIDIA GPU for this task. As stated in my jkjung-avt/hand-detection-tutorial/README.md, I used a good desktop PC with an NVIDIA GeForce GTX-1080Ti, running Ubuntu Linux 16.04, to do the training.
Make sure you have your training PC/server ready and a recent version of TensorFlow is properly installed on it. (Reference: Install TensorFlow)
Please clone my GitHub repository: jkjung-avt/hand-detection-tutorial. And refer to the README.md within. All steps required to train the hand detector are listed there already. I’d just add a few words about some of the steps here.
The ‘models/’ submodule. I added tensorflow/models as a submodule of this project. And I used the same SHA version as in my tf_trt_models repository (which was forked from NVIDIA-Jetson/tf_trt_models). This is to ensure I would train and deploy the model onto TX2 using the same version of Object Detection API.
install.shscript implements what’s been specified in the official ‘Installation’ document. I also patched a few lines of python code in the script. Those are fixes required to run the code with python3. (The original code only works for python2.) Note that some python code under the
models/directory still contains python2-specific code like
<dict>.iteritems(). I did not fix all occurrences of those yet.
prepare_egohands.pyscript downloads the egohands dataset. Annotations in this dataset are ‘polygons’ stored in MATLAB files. The script also converts the annotations into ‘KITTI’ format. Note that NVIDIA’s DIGITS/DetectNet also uses KITTI formatted data for object detection models. You can refer to this document of the format. I chose this format since its annotation is relatively simple and could be easily reviewed/modified by hand.
create_kitti_tf_record.py. And the
create_kitti_tf_record.pyscript was modified from the same script from tensorflow’s object_detection repository. I changed the code to fit file paths in this project. I also shuffled train/val images randomly before creating the TFRecord files.
configs/ssd_mobilenet_v1_egohands.configwas modified from tensorflow object_detection’s sample ssd_mobilenet_v1_coco.config. I modified
num_classesto 1, put in the correct file paths, and adjusted a few hyper-parameters in this file. I plan to discuss more about this file in a later post.
train.shimplements what’s been specified in the official ‘Running Locally’ document.
So far I have implemented and tested
ssd_mobilenet_v1_egohands, set to train for 20,000 steps, took a little bit over 2 hours to train on my desktop PC (GTX-1080Ti). Its loss was around 2.5 at the end of training, and the ‘coco_detection_metrics’ evaluation result was as follows. The IoU 0.50 mAP value was 0.968, which was very good. It basically means, if we use the trained model to inference on images coming from the same distribution, the model could detect hands at both very high precision and very high recall!
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.680 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.968 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.813 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.118 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.329 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.253 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.739 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.744 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.250 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.471 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.773
The high mAP result could be verified if you use TensorBoard to check the output of the
I shall deploy my trained hand detector (SSD) models onto Jetson TX2, and verify the accuracy and inference speed.
I shall write something about how to adapt code in this tutorial to other datasets.
I wanted to test other object detection models, including Faster R-CNN and Mask R-CNN, from Tensorflow detection model zoo. Hopefully, I would be able to do that and share more soon.
I made every effort in coding and writing this tutorial, so that it could be very easy to follow. It should be straightforward to adapt this tutorial/code to other object detection models and other datasets.
After all, I really spent a lot of time reading/writing code and developing this tutorial. If you find it useful, please help to share it with more people who might be interested. Meanwhile, I do appreciate people giving me stars on GitHub. That motivates me to write more posts/sharing.