Python is a nice language for prototyping things. It can wrap C++ libraries (required for performance sensitive parts) quite well, as evident e.g. by the fast numpy (array manipulation) library. Additionally, writing programs in CUDA or OpenCL also has awesome performance. In this article I describe how to setup a nice Python environment.
To start with Python [tutorial],
I start with ipython
, a nice command line interface that
works a bit like a super-calculator. The numpy
library [docs] (partially replaces scipy
)
for array manipulation is a must, as are the boost
library [docs]
for C++ interoperability, and pyopengl
[docs] with SDL (pygame
[docs]) for
visualization.
apt-get install build-essential python-dev python-ctypeslib
ipython python-setuptools python-pip
apt-get install python-scipy-dbg python-numpy-dbg python-h5py
python-opencv
apt-get install libboost-all-dev libboost-python-dev
libboost-thread-dev
To work with "float images" (not the usual byte images), which
are required for optical flow fields, I install OpenEXR [docs]
(note: apt-get install python-openimageio
might also be
a solution).
apt-get install openexr openexr-viewers libopenexr-dev
libilmbase-dev
easy_install -U openexr
For nicer development, install Eclipse for C++ Developers [download], and from
the update manager, pydev [url: http://pydev.org/updates
]
[info]. This allows code
completion, syntax checking etc. within an IDE, which is maybe
preferable to command line. If you are debugging in ipython
,
any file can be included via %run -i
command.
Since we want OpenGL [intro1]
[intro2] [api] [mistakes]
interoperability with CUDA and OpenCL, we have to install the pyopengl
(and pygame
) first. We can either use the Ubuntu version
(more stable) or the one from pypi [site] (faster, but
somewhat likely to give unwanted exceptions in Python):
apt-get install python-opengl |
easy_install -U pyopengl |
apt-get install python-pygame
To test the OpenGL, we execute a minimal example [src]:
from OpenGL.GL import *; from OpenGL.GLUT import *; from
OpenGL.GLU import *; import random
def display():
glClear(GL_COLOR_BUFFER_BIT);
glBegin(GL_POINTS)
x = [0.0,640.0,320.0]; y = [0.0,0.0,480.0];
curx = 0; cury = 320; glVertex2f(curx,cury)
for i in range(0,500000):
idx=random.randint(0,2); curx=(curx+x[idx])/2.0;
cury=(cury+y[idx])/2.0
glVertex2f(curx,cury)
glEnd(); glFlush()
if __name__ == '__main__':
glutInit()
glutInitWindowSize(640, 480);
glutCreateWindow("Sierpinski"); glutInitDisplayMode(GLUT_SINGLE
| GLUT_RGB)
glutDisplayFunc(display)
glClearColor(1.0,1.0,1.0,0.0);
glColor3f(0.0,0.0, 0.0); glPointSize(1.0)
glMatrixMode(GL_PROJECTION);
glLoadIdentity(); gluOrtho2D(0.0,640.0,0.0,480.0)
glutMainLoop()
Note that GLUT is the most horrible of OpenGL frameworks; QT/PySide [docs] would be the most full-blown. A middle ground and more compact framework is SDL/pygame, tested as follows (with ball texture):
import pygame
from pygame.locals import KEYDOWN, K_ESCAPE
if __name__ == "__main__":
pygame.init(); keep_running = True
size = width, height = 320, 240; speed
= [2, 2]; black = 0, 0, 0
screen = pygame.display.set_mode(size,
pygame.HWSURFACE)
ball = pygame.image.load("ball.gif");
ballrect = ball.get_rect()
while keep_running:
for event in
pygame.event.get():
if event.type == pygame.QUIT or (event.type == KEYDOWN and
event.key == K_ESCAPE):
pygame.display.set_caption("Shutting down...")
keep_running = False
# Move the
ball, bounce back on borders.
ballrect =
ballrect.move(speed)
speed[0] =
-speed[0] if ballrect.left < 0 or ballrect.right > width
else speed[0]
speed[1] =
-speed[1] if ballrect.top < 0 or ballrect.bottom > height
else speed[1]
# Fill the
background screen buffer, then flip.
screen.fill(black); screen.blit(ball, ballrect)
pygame.display.flip(); pygame.time.wait(10)
Having that setup, we can continue to compile PyCUDA and PyOpenCL with OpenGL interop.
Using PyCUDA
[docs] requires an
appropriate Nvidia graphics card [check] and
installation of the Nvidia CUDA software: A recent driver with
CUDA toolkit [download]
(install as root) and the GPU computing SDK (install as user).
In my standard installation, I install the CUDA toolkit either to
/opt/cuda
or use the x-swat [home]
Ubuntu Nvidia drivers; then I install the GPU computing SDK [ubuntu
guide] (a set of CUDA examples) to somewhere in my home
.
Note: CUDA uses the gcc
of one generation behind, this
requires some configuration. Nouveau Linux graphics card driver
may conflict with Nvidia.
nano -w /etc/modprobe.d/nouveau.conf
blacklist nouveau
# these are necessary to build.
apt-get install cutils freeglut3-dev libxi-dev libxmu-dev
./devdriver_4.0_linux_64_270.41.19.run
&& export CUDA_ROOT=/opt/cuda |
add-apt-repository
ppa:ubuntu-x-swat/x-updates |
su <username> &&
./gpucomputingsdk_4.0.17_linux.run
# path: /local/ruhl/
After this somewhat painful setup, it is time to test the
installation. For this, compile the GPU computing examples and
start e.g. the nbody
app.
cd /local/ruhl/devel/cuda-sdk-4.0/C
nano -w common/common.mk |
nano -w common/common.mk |
make
cd bin/linux/release/
./deviceQuery
./nbody
And voila: Your first step towards parallel computing with CUDA [docs].
After having CUDA up and running, download PyCUDA [install] [ubuntu install] and install (as user):
# only if using Ubuntu nvidia drivers:
ln -s /usr/lib/nvidia-current/
git clone http://git.tiker.net/trees/pycuda.git |
tar -xvzpf pycuda-2012.1.tar.gz |
./configure.py --cuda-root=/opt/cuda --cuda-enable-gl
--boost-compiler=gcc44
python setup.py build
sudo python setup.py install
For some reason, OpenGL interoperability is not enabled by
default. Also, libcuda.so
must reside in /usr/lib
,
regardless of what you tell configure.py
. After
installation, you may have to edit, in Eclipse, Window ->
Preferences -> Pydev -> Interpreter - Python
to
include the appropriate path
(usually /usr/local/lib/python2.7/dist-packages
).
then test the installation with a minimal example of PyCUDA [script], not using OpenGL yet [tutorial]:
from numpy import *; import pycuda.autoinit;
import pycuda.driver as cuda; from pycuda.compiler import
SourceModule
cuda_lib = SourceModule("""
__global__ void doublify(float *a_ram)
{
int idx = threadIdx.x + threadIdx.y*4;
a_ram[idx] *= 2;
}
""")
a_ram = random.randn(4, 4).astype(float32)
# create 4x4 image.
a_gpu = cuda.to_device(a_ram);
# push 4x4 image to GPU.
func = cuda_lib.get_function("doublify")
# get CUDA function.
func(a_gpu, block=(4, 4, 1))
# call 4x4x1 threads on CUDA function.
a_res = cuda.from_device(a_gpu, (4, 4),
float32)
# poll result from GPU.
passed = (a_ram * 2 == a_res).all()
# compare to CPU calculation.
print "Test passed: %s" % (passed)
All the initialisation pain is taken away by importing pycuda.autoinit
,
and all cleanup is done by the Python garbage collector. Functions
can be defined inline, as e.g. in shader programming. A block
(either 1D, 2D or 3D) defines the number of concurrent threads, a
grid the number of concurrent/sequential blocks. Arrays in RAM
have to be copied to GPU ("device") memory, and after computation
back again.
The interoperability of PyCUDA and OpenGL works over buffer
objects. In particular, you create an OpenGL buffer object with glGenBuffers()
,
then bind the same thing to PyCUDA with the pycuda.gl.BufferObject
.
Details can be found in the quick_pygl_sdl.py example, which accesses a RGBA char texture buffer from both OpenGL and CUDA. The example is a bit low on performance (render into framebuffer, copy into texture buffer, apply CUDA operation, copy back to screen), but gives a good general idea.
CUDA (or the ATI equivalent) is required for PyOpenCL, so install CUDA first, and the pertinent OpenCL implementation.
apt-get install opencl-headers nvidia-opencl-dev
ocl-icd-opencl-dev
After that, download or checkout PyOpenCL [install] [ubuntu install] and install:
# Depending on whether you trust releases:
git clone http://git.tiker.net/trees/pyopencl.git |
tar -xvzpf pyopencl-2012.1.tar.gz |
./configure.py --cl-enable-gl --boost-compiler=gcc44
python setup.py build
sudo python setup.py install
Again, GL interop must be enabled manually with the --cl-enable-gl
flag. If on Ubuntu 13.04 (raring), specify OpenCL 1.1 as target by
adding CL_PRETEND_VERSION="1.1"
to siteconf.py
[forum].
To test pure OpenCL [api],
take a test tour with a minimal example [script]:
import pyopencl as cl; import numpy as np; import
numpy.linalg as la
ctx = cl.create_some_context();
# create OpenCL context
prg = cl.Program(ctx, """
__kernel void sum(__global const float *a,
__global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
""").build()
queue = cl.CommandQueue(ctx); mf =
cl.mem_flags
# create OpenCL queue
a_ram = np.random.rand(50000).astype(np.float32)
# create buffers
b_ram = np.random.rand(50000).astype(np.float32)
c_ram = np.empty_like(a_ram)
a_gpu = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=a_ram)
# upload/allocate buffers on GPU
b_gpu = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=b_ram)
c_gpu = cl.Buffer(ctx, mf.WRITE_ONLY, b_ram.nbytes)
prg.sum(queue, a_ram.shape, None, a_gpu, b_gpu,
c_gpu)
# call GPU function
cl.enqueue_copy(queue, c_ram, c_gpu)
# download result to RAM
passed = 0.0 == la.norm(c_ram - (a_ram + b_ram))
# compare to CPU calculation.
print "Test passed: %s" % (passed,)
You need to explicitely create a context to compile a program; then a queue to run that program; and up/download the RAM data to GPU similar to CUDA. And that would be it.
Of course, some tasks require sequential memory access and are
not suitable to GPU computing. For these cases, scipy.weave
is your friend: It allows C++ access to numpy arrays.
import numpy as np
cc_src = """
float v;
for (int i = 0; i < wid; i++) {
for (int j = 0; j < hei; j++) {
v = src(j, i);
dst(j, i) = v;
}
}
"""
# Prepare Python variables to use.
src = np.random.random((3, 4)).astype(np.float32)
dst = np.zeros_like(src)
hei, wid = src.shape
# Use variables in weave Blitz C++.
from scipy import weave
weave.inline(cc_src, ['src', 'dst', 'wid', 'hei'],
type_converters=weave.converters.blitz, compiler='gcc')
print "Test passed: %s" % (np.all(src == dst),)
Two notable things are that you access numpy arrays with ()
not []
, and that you cannot define functions. So this is
really more for ad-hoc acceleration of slow Python source code. I
use it often.
When using ipython
, I use ipdb
[pypi] for debugging.
Depending on your tastes, the CLI pudb
[pypi] might also
be good choice, and works exactly the same way.
apt-get install python-ipdb
In the python script, set a breakpoint with the following line:
import ipdb; ipdb.set_trace();
If QT is used, the QT event queue must be massaged beforehand.
import PyQt4.QtCore; PyQt4.QtCore.pyqtRemoveInputHook();
import ipdb; ipdb.set_trace();
And then, within ipython
, run the script in interactive
mode.
%run -i main.py
When the breakpoint is encountered, an ipdb>
prompt will allow you to investigate the current state. h
is help, s
is single step, c
is continue
program.