OpenCL and the future of FPGA and DSP development

Over the past 25 years FPGAs and DSPs have enjoyed steady growth and have been deployed in a wide array of applications. Their high performance and programmability/reconfigurability have been an irresistible draw and have allowed system designers to replace custom ASIC designs with flexible alternatives. The massive parallelism of FPGAs in particular has enabled countless new signal and image processing applications. However, there is a dark side to this story – development pain.

Both technologies, but FPGAs in particular, have a well-earned reputation for being difficult to develop for. In a world where software development costs far outstrip hardware cost and reducing those costs is a primary concern, DSP and FPGA development stubbornly remain a world of wizards who are both difficult to find and expensive to engage. Costly project overruns, and poor code maintainability and reuse are the norm for this community.

Over the years there have been numerous attempts to address this with greater or lesser success, but nothing has met with wide industry approval. In the DSP space there have been many attempts to create languages and libraries to address this problem such as Mercury’s PAS, MITs StreamIT, and Stanford’s Brook to name a few. In the FPGA space much of this work has, surprisingly, revolved around either C/C++ to RTL translation like SystemC or Impulse Accelerated Technologies’ Impulse-C (see tinyurl.com/kfk3qqs for an incomplete list), or graphical programming environments such as National Instruments’ LabVIEW and MathWorks/Xilinx Simulink/System Generator tools. Despite the appeal of magical translators and box-and-arrow programming models, these tools have found only limited acceptance in the marketplace.

There is a new language available that has the potential to change that. Since the early 2000s people have been looking for ways to harness the parallelism of modern graphics processors (GPUs) for non-graphics applications. In 2008 Apple introduced OpenCL as a framework for writing applications for GPUs and later that year teamed with The Khronos Group to release it as an open, royalty-free standard.

Since then OpenCL has undergone three major revisions, has numerous extensions, and since late 2013 has supported Altera FPGAs and TI DSPs. While on the surface this seems to be yet another attempt to address the development difficulties of DSPs and (most especially) FPGAs, there are a number of important reasons why this one may succeed where others failed. Unfortunately, there are still a few roadblocks ahead as well.

OpenCL is essentially a subset of C (actually, C99) with some additional extensions. As a result, it is a very familiar working environment for developers comfortable with programming in C. It uses a host/device model in which a C program running on a host distributes tasks to processing elements on one or more devices and manages memory, data movements, and error handling. The calculations to be performed on the target device are written in OpenCL in what is called a kernel, which can be thought of as similar to a C function. At runtime multiple kernel programs are launched (called work-items) on the target device, and each kernel performs its calculations on a subset of data defined by the host program. This set of work-items is called a work-group, and the data can be segmented into 1D, 2D, or 3D grids for distribution to the individual kernels. Work-groups, in turn, operate independently from one another. OpenCL also has a hierarchical memory model consisting of global, local, constant, and private memories.

This model of memory hierarchies and executable kernels is what makes OpenCL a powerful tool for expressing both data- and task-paralellism. It is also what makes OpenCL useful for heterogeneous system programming as different types of targets can be grouped as individual work-groups. This power comes at a cost, however.

One argument made against OpenCL is that it is seen as too verbose – that is, many functions need to be called in setting up a device execution and cleaning up afterward. This is arguably not a serious criticism – parallel computing is, by its nature, a complex task. Different types of applications require different approaches to parallelism, different synchronization, etc. To capture these differences one needs a rich language. Besides, OpenCL code is inherently easier to create, read, and understand (and thus maintain) than any HDL or DSP assembly language code.

OpenCL is not a perfect solution, however. There are significant shortcomings that have the potential to either limit its adoption for FPGAs and DSPs, or completely kill it. The first problem is the host/device nature of OpenCL. Many applications for which DSPs and FPGAs are targeted do not lend themselves to host/device task acceleration development models – they require data flow models which do work on streams of data, continuously processing as the data arrives and pushing the results out for later-stage processing or storage. OpenCL does not handle this model well, although Altera is working with The Khronos Group to address this in a future release.

OpenCL takes a lot of the work of task distribution and data movements off the shoulders of developers. It does not, however, manage everything. As was mentioned previously, OpenCL’s very explicit memory heirarchy is one of its strengths. It allows the developer to specify in very specific terms what memories are to be used for various data storage. Unfortunately, it may not always be obvious which of the four memory types (Global, Local, Constant, or Private) will give the application the best performance. In addition, care must be given to insure that shared memories are properly semaphore controlled, as race conditions are still a very real possibility (and like in legacy development environments, often very hard to detect). What this means is that the developer must have a good understanding of the architecture of the target platform and must code explicitly to that architecture in order to get good performance. This also means that OpenCL is not a write-once-run-anywhere sort of language – porting to different platforms and new architectures could be challenging.

Lastly, OpenCL’s latest release is version 2.0 (following 1.0, 1.1, and 1.2) while TI only supports 1.1 and Altera only 1.0 (albeit with some limited later feature support in both cases). This means that many of the latest features are not available for developers on these platforms. While many of these features may not make sense for DSP or FPGA platform development, there are undoubtedly many important and useful features that do. In addition, it can be frustrating working with a tool with incomplete support. Given the general popularity of OpenCL and the rapid pace of its development, both TI and Altera will probably be playing catch-up for some time.

Probably the biggest factor that could make OpenCL a widely adopted platform for DSP and FPGA system development is, surprisingly, the community excitement around GPUs for DSP programming. As more and more people learn OpenCL and use it for real-world problems, the strengths and weakness of the language will become more apparent and new techniques for solving problems will be developed. In time people will need to turn from GPUs to DSPs and FPGAs to gain performance/watt advantages and will rely on OpenCL to make that possible.