Using cuDNN to Speed Up DQN Training on Jetson TX1

Feb 22, 2017

I had an idea about speeding up trainig of DeepMind’s DQN by NVIDIA’s cuDNN. Then I found out it was really easy to do that with Torch7!

All I needed to do was just to convert the neural network to ‘cudnn’ after it’s been created/loaded and cuda()’ed. More specifically, I added the corresponding code into dqn/NeuralQLearner.lua. For an explanation of cudnn.benchmark and cudnn.fastest, please refer to the official cudnn.torch page.

    if self.gpu and self.gpu >= 0 then
        self.network:cuda()
        -- I added this part...
        if self.cudnn then
            cudnn.benchmark = true
            cudnn.fastest = true
            cudnn.convert(self.network, cudnn)
            print('*** Using cudnn ***')
        end
    else
        self.network:float()
    end

In addition to cudnn, I also thought about reducing the time the DQN trainer spending on displaying game images, so that the trainer could spend more of its time doing useful training work. So I implemented that in the code too. With some trial, I picked 3 as the ‘display_freq’. That is, I let the DQN trainer display only 1 out of 3 images during training. This way, I effectively reduced CPU consumption on image display while still able to clearly see the progress of the game.

Here’s the result on Jetson TX1 after I implemented both cudnn and display_freq=3. (Note that DQN training does not really start until running for ‘learn_start (5000)’ steps.) The numbers in the table below were all obtained while the DQN was trained for Atari ‘pong’ game with TX1 CPU running at max clock frequency (sudo ~/jetson_clocks.sh).

Test Case	Train 1000 steps	%
Baseline: display all frames, no cudnn	56 s	100
Improvement #1: display 1/3 frames	46 s	82
Improvement #2: 1/3 frames, with cudnn	37 s	66

When I looked deeper at the ‘Improvement #2’ case (as shown in the screenshot below), I saw both CPU (‘cpu’ in ~/tegrastat output) and GPU (‘GR3D’) of TX1 were far away from fully loaded, while one of the 4 CPUs was constantly at ~90% loading. I think this likely indicated the bottleneck was lying in the Atari emulator (‘xitari’ and ‘alewrap’ in Lua). In other words, I think the DQN on TX1 could train much faster if the Atari emulator was able to generate game images at higher rate. Hopefully this would work out better when I train the DQN with the Nintendo Famicom Mini console.

screenshot of Improvement #2

If you’d also like to run this cudnn accelerated DQN, you can refer to my earlier post, Training DeepMind’s DQN to Play Pong. And the repository is here:

https://github.com/jkjung-avt/DeepMind-Atari-Deep-Q-Learner.