Python, OpenGL and CUDA/OpenCL

Python is a nice language for prototyping things. It can wrap C++ libraries (required for performance sensitive parts) quite well, as evident e.g. by the fast numpy (array manipulation) library. Additionally, writing programs in CUDA or OpenCL also has awesome performance. In this article I describe how to setup a nice Python environment.

1 Python basics

To start with Python [tutorial], I start with ipython, a nice command line interface that works a bit like a super-calculator. The numpy library [docs] (partially replaces scipy) for array manipulation is a must, as are the boost library [docs] for C++ interoperability, and pyopengl [docs] with SDL (pygame [docs]) for visualization.

apt-get install build-essential python-dev python-ctypeslib
ipython python-setuptools python-pip

apt-get install python-scipy-dbg python-numpy-dbg python-h5py
python-opencv

apt-get install libboost-all-dev libboost-python-dev
libboost-thread-dev

To work with "float images" (not the usual byte images), which are required for optical flow fields, I install OpenEXR [docs] (note: apt-get install python-openimageio might also be a solution).

apt-get install openexr openexr-viewers libopenexr-dev
libilmbase-dev

easy_install -U openexr

For nicer development, install Eclipse for C++ Developers [download], and from the update manager, pydev [url: http://pydev.org/updates] [info]. This allows code completion, syntax checking etc. within an IDE, which is maybe preferable to command line. If you are debugging in ipython, any file can be included via %run -i command.

2 PyOpenGL

Since we want OpenGL [intro1] [intro2] [api] [mistakes] interoperability with CUDA and OpenCL, we have to install the pyopengl (and pygame) first. We can either use the Ubuntu version (more stable) or the one from pypi [site] (faster, but somewhat likely to give unwanted exceptions in Python):

apt-get install python-opengl easy_install -U pyopengl easy_install -U pyopengl-accelerate

from OpenGL.GL import *; from OpenGL.GLUT import *; from
OpenGL.GLU import *; import random



def display():

    glClear(GL_COLOR_BUFFER_BIT);
glBegin(GL_POINTS)

    x = [0.0,640.0,320.0]; y = [0.0,0.0,480.0];

    curx = 0; cury = 320; glVertex2f(curx,cury)

    for i in range(0,500000):

       
idx=random.randint(0,2); curx=(curx+x[idx])/2.0;
cury=(cury+y[idx])/2.0

        glVertex2f(curx,cury)

    glEnd(); glFlush()



if __name__ == '__main__':

    glutInit()

    glutInitWindowSize(640, 480);
glutCreateWindow("Sierpinski"); glutInitDisplayMode(GLUT_SINGLE
| GLUT_RGB)

    glutDisplayFunc(display)

    glClearColor(1.0,1.0,1.0,0.0);
glColor3f(0.0,0.0, 0.0); glPointSize(1.0)

    glMatrixMode(GL_PROJECTION);
glLoadIdentity(); gluOrtho2D(0.0,640.0,0.0,480.0)

    glutMainLoop()

Note that GLUT is the most horrible of OpenGL frameworks; QT/PySide [docs] would be the most full-blown. A middle ground and more compact framework is SDL/pygame, tested as follows (with ball texture):

import pygame

from pygame.locals import KEYDOWN, K_ESCAPE



if __name__ == "__main__":

    pygame.init(); keep_running = True

    size = width, height = 320, 240; speed
= [2, 2]; black = 0, 0, 0

    screen = pygame.display.set_mode(size,
pygame.HWSURFACE)

    ball = pygame.image.load("ball.gif");
ballrect = ball.get_rect()

    while keep_running:

        for event in
pygame.event.get():

      
    
if event.type == pygame.QUIT or (event.type == KEYDOWN and
event.key == K_ESCAPE):

          
    
pygame.display.set_caption("Shutting down...")

          
    
keep_running = False

        # Move the
ball, bounce back on borders.

        ballrect =
ballrect.move(speed)

        speed[0] =
-speed[0] if ballrect.left < 0 or ballrect.right > width
else speed[0]

        speed[1] =
-speed[1] if ballrect.top < 0 or ballrect.bottom > height
else speed[1]

        # Fill the
background screen buffer, then flip.

       
screen.fill(black); screen.blit(ball, ballrect)

       
pygame.display.flip(); pygame.time.wait(10)

Having that setup, we can continue to compile PyCUDA and PyOpenCL with OpenGL interop.

3 CUDA (the C part)

Using PyCUDA [docs] requires an appropriate Nvidia graphics card [check] and installation of the Nvidia CUDA software: A recent driver with CUDA toolkit [download] (install as root) and the GPU computing SDK (install as user).

In my standard installation, I install the CUDA toolkit either to /opt/cuda or use the x-swat [home] Ubuntu Nvidia drivers; then I install the GPU computing SDK [ubuntu guide] (a set of CUDA examples) to somewhere in my home. Note: CUDA uses the gcc of one generation behind, this requires some configuration. Nouveau Linux graphics card driver may conflict with Nvidia.

nano -w /etc/modprobe.d/nouveau.conf

    blacklist nouveau



# these are necessary to build.

apt-get install cutils freeglut3-dev libxi-dev libxmu-dev

./devdriver_4.0_linux_64_270.41.19.run && export CUDA_ROOT=/opt/cuda apt-get install gcc-4.4 cpp-4.4 mkdir /opt/cuda/gcc-4.4 && cd /opt/cuda/gcc-4.4 ln -s /usr/bin/cpp-4.4 cpp ln -s /usr/bin/gcc-4.4 gcc ln -s /usr/bin/g++-4.4 g++ nano -w /opt/cuda/bin/nvcc.profile compiler-bindir = /opt/cuda/gcc-4.4 nano -w /etc/ld.so.conf.d/cudatoolkit.conf /opt/cuda/lib64 add-apt-repository ppa:ubuntu-x-swat/x-updates apt-get update && apt-get upgrade apt-get install nvidia-cuda-toolkit && export CUDA_ROOT=/usr nano -w /etc/nvcc.profile INCLUDES += -I/usr/lib/nvidia-cuda-toolkit/include nano -w /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf /usr/lib/nvidia-current

su <username> &&
./gpucomputingsdk_4.0.17_linux.run

   
# path: /local/ruhl/devel/cuda-sdk-4.0

After this somewhat painful setup, it is time to test the installation. For this, compile the GPU computing examples and start e.g. the nbody app.

nano -w common/common.mk CUDA_INSTALL_PATH ?= /opt/cuda nano -w common/common.mk CUDA_INSTALL_PATH ?= /usr

4 PyCUDA (the Python part)

After having CUDA up and running, download PyCUDA [install] [ubuntu install] and install (as user):

# only if using Ubuntu nvidia drivers:

ln -s /usr/lib/nvidia-current/libcuda.so /usr/lib/libcuda.so

git clone http://git.tiker.net/trees/pycuda.git tar -xvzpf pycuda-2012.1.tar.gz

./configure.py --cuda-root=/opt/cuda --cuda-enable-gl
--boost-compiler=gcc44



python setup.py build

sudo python setup.py install

For some reason, OpenGL interoperability is not enabled by default. Also, libcuda.so must reside in /usr/lib, regardless of what you tell configure.py. After installation, you may have to edit, in Eclipse,

Window ->
Preferences -> Pydev -> Interpreter - Python

to include the appropriate path (usually /usr/local/lib/python2.7/dist-packages). then test the installation with a minimal example of PyCUDA [script], not using OpenGL yet [tutorial]:

from numpy import *; import pycuda.autoinit;

import pycuda.driver as cuda; from pycuda.compiler import
SourceModule



cuda_lib = SourceModule("""

__global__ void doublify(float *a_ram)

{

  int idx = threadIdx.x + threadIdx.y*4;

  a_ram[idx] *= 2;

}

""")



a_ram = random.randn(4, 4).astype(float32)
        
# create 4x4 image.

a_gpu = cuda.to_device(a_ram);
          
         
# push 4x4 image to GPU.

func = cuda_lib.get_function("doublify")
          
# get CUDA function.

func(a_gpu, block=(4, 4, 1))
          
           
# call 4x4x1 threads on CUDA function.

a_res = cuda.from_device(a_gpu, (4, 4),
float32)   
# poll result from GPU.

passed = (a_ram * 2 == a_res).all()
          
    
# compare to CPU calculation.

print "Test passed: %s" % (passed)

All the initialisation pain is taken away by importing pycuda.autoinit, and all cleanup is done by the Python garbage collector. Functions can be defined inline, as e.g. in shader programming. A block (either 1D, 2D or 3D) defines the number of concurrent threads, a grid the number of concurrent/sequential blocks. Arrays in RAM have to be copied to GPU ("device") memory, and after computation back again.

The interoperability of PyCUDA and OpenGL works over buffer objects. In particular, you create an OpenGL buffer object with glGenBuffers(), then bind the same thing to PyCUDA with the pycuda.gl.BufferObject.

Details can be found in the quick_pygl_sdl.py example, which accesses a RGBA char texture buffer from both OpenGL and CUDA. The example is a bit low on performance (render into framebuffer, copy into texture buffer, apply CUDA operation, copy back to screen), but gives a good general idea.

5 PyOpenCL

CUDA (or the ATI equivalent) is required for PyOpenCL, so install CUDA first, and the pertinent OpenCL implementation.

git clone http://git.tiker.net/trees/pyopencl.git tar -xvzpf pyopencl-2012.1.tar.gz

./configure.py --cl-enable-gl --boost-compiler=gcc44



python setup.py build

sudo python setup.py install

Again, GL interop must be enabled manually with the --cl-enable-gl flag. If on Ubuntu 13.04 (raring), specify OpenCL 1.1 as target by adding CL_PRETEND_VERSION="1.1" to siteconf.py [forum]. To test pure OpenCL [api], take a test tour with a minimal example [script]:

import pyopencl as cl; import numpy as np; import
numpy.linalg as la



ctx = cl.create_some_context();
          
          
          
     
# create OpenCL context

prg = cl.Program(ctx, """

    __kernel void sum(__global const float *a,

    __global const float *b, __global float *c)

    {

      int gid = get_global_id(0);

      c[gid] = a[gid] + b[gid];

    }

    """).build()



queue = cl.CommandQueue(ctx); mf =
cl.mem_flags        
          
   
# create OpenCL queue

a_ram = np.random.rand(50000).astype(np.float32)
          
          
# create buffers

b_ram = np.random.rand(50000).astype(np.float32)

c_ram = np.empty_like(a_ram)

a_gpu = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=a_ram)
# upload/allocate buffers on GPU

b_gpu = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=b_ram)

c_gpu = cl.Buffer(ctx, mf.WRITE_ONLY, b_ram.nbytes)



prg.sum(queue, a_ram.shape, None, a_gpu, b_gpu,
c_gpu)         
      
# call GPU function

cl.enqueue_copy(queue, c_ram, c_gpu)
        
          
          
  
# download result to RAM



passed = 0.0 == la.norm(c_ram - (a_ram + b_ram))
        
          
 
# compare to CPU calculation.

print "Test passed: %s" % (passed,)

You need to explicitely create a context to compile a program; then a queue to run that program; and up/download the RAM data to GPU similar to CUDA. And that would be it.

6 C++ within Python

Of course, some tasks require sequential memory access and are not suitable to GPU computing. For these cases, scipy.weave is your friend: It allows C++ access to numpy arrays.

import numpy as np



cc_src = """

float v;

for (int i = 0; i < wid; i++) {

    for (int j = 0; j < hei; j++) {

        v = src(j, i);

        dst(j, i) = v;

    }

}

"""



# Prepare Python variables to use.

src = np.random.random((3, 4)).astype(np.float32)

dst = np.zeros_like(src)

hei, wid = src.shape



# Use variables in weave Blitz C++.

from scipy import weave

weave.inline(cc_src, ['src', 'dst', 'wid', 'hei'],
type_converters=weave.converters.blitz, compiler='gcc')



print "Test passed: %s" % (np.all(src == dst),)

Two notable things are that you access numpy arrays with () not [], and that you cannot define functions. So this is really more for ad-hoc acceleration of slow Python source code. I use it often.

7 Debugging

When using ipython, I use ipdb [pypi] for debugging. Depending on your tastes, the CLI pudb [pypi] might also be good choice, and works exactly the same way.

import PyQt4.QtCore; PyQt4.QtCore.pyqtRemoveInputHook();
import ipdb; ipdb.set_trace();

When the breakpoint is encountered, an ipdb> prompt will allow you to investigate the current state. h is help, s is single step, c is continue program.