Floating-point multiprocessing with C66x DSPs from Texas Instruments

3Fixed-point DSPs store and manipulate integers, while floating-point DSPs use a mantissa and exponent to represent rational numbers. Multiple factors determine whether fixed- or floating-point is the right choice for a particular application, including cost, development time, and performance.

Fixed-point DSPs have tended to be lower priced, while floating-point DSPs have been easier for development while also providing higher accuracy and precision. This means that fixed-point DSPs have been used as high-volume, general-purpose processors, while floating-point DSPs have been chosen for specialized, processing-intensive tasks where dynamic range and precision are important.

Up to now it was often necessary to implement algorithms on fixed-point DSPs because floating-point DSPs were not fast enough, but porting algorithms created with MATLAB or other floating-point tools was slow and time consuming. Texas Instruments Incorporated (TI) changed the game with processors based on its TMS320C66x core, which is capable of both fixed- and floating-point arithmetic.

According to TI, having a DSP core that integrates both fixed- and floating-point capability enables a fundamental change in the way algorithms for embedded systems are developed and deployed.

The C66x DSP core provides floating-point capabilities without sacrificing the speed of fixed-point. It achieved this by merging the C67x floating-point and the C64x fixed-point instruction sets into its C66x instruction set. The instruction set architecture is fully IEEE 754 compliant and supports both single- and double-precision floating-point operations. Software can choose to execute floating-point or fixed-point commands on an instruction-by-instruction basis, which enables developers to optimize their code for the needs of their particular application.

Applications for floating-point DSPs

There are many applications with algorithms that can be accelerated by floating-point DSPs and benefit from their improved precision. The increased affordability of DSPs based on the C66x core means that these benefits are more widely available, and retain the capability of fast fixed-point processing for computation that does not require floating-point arithmetic.

For example, in wireless and radio systems, MIMO and beam-forming algorithms rely on matrix inversion techniques inherently susceptible to quantization and scaling errors associated with fixed-point processing. Using floating-point DSPs to implement these algorithms improves both the speed and the accuracy of the system, resulting in higher performance. The C66x DSP core runs MIMO and other key multi-antenna signal processing algorithms four times faster than the same algorithms running in fixed-point on the C64x DSP.

Radar, navigation, and guidance systems process data that is acquired using arrays of sensors. To extract information on the location and movement of the target, the data must be processed as a set of linear equations. With a floating-point engine like the C66x, a greater precision of output can be achieved, as well as a larger dynamic range. A 32-bit fixed-point DSP has a dynamic range of 0 to 4.3x109 with integer resolution. A 32-bit floating-point DSP has a range of 1.2x10-38 to 3.4x1038, which means that these functions are performed significantly better than with fixed-point DSPs.

Another application that benefits from the accuracy of floating-point is image processing. For example, in ultrasound, the greater precision given by the C66x enables imaging systems to achieve a much higher level of definition and faster recognition due to lower quantization noise, thus improving the diagnostic process.

The new TI DSPs - a closer look

TI offers multiple processors, based on its KeyStone architecture, that incorporate its C66x core. Let's take a closer look at two examples.

First, the TMS320C6678 is based on TI's KeyStone I architecture and includes eight C66x cores each running at up to 1.25 GHz, enabling the equivalent of up to 10 G cycles per second of DSP processing. Executing eight instructions per cycle (and having two-way single-instruction multiple data instruction supports) allows the headline figure of 160 G Floating-point Operations Per Second (GFLOPS) by the device. The C66x core is fully backward compatible with the C6000 family of fixed- and floating-point DSPs.

The TMS320C6678 also provides a comprehensive set of I/O. This includes four lanes of low-latency Serial RapidIO (SRIO) 2.1 at 5 Gbaud per lane full duplex and two lanes of PCIe Gen2, similarly at 5 Gbaud per lane full duplex. An Ethernet MAC subsystem, two telecom serial ports, UART and I2C interfaces complete the conventional interfaces required by today's embedded devices.

In addition to the conventional I/O, the C6678 has a HyperLink interface. This low pin count, point-to-point communication interface is designed to extend internal chip transactions between two KeyStone devices. Supporting high speeds of up to 12.5 Gbaud per lane over four lanes, it can be used to aggregate the device resources of two C6678s DSPs, which can be viewed as a single 16-core system capable of 640 GFLOPS.

The SmartReflex technology of TI's KeyStone devices can also decrease the dynamic power consumption while maintaining performance. Using a dedicated voltage regulator for each device, the device's core voltage is optimized based on the process corner of the device.

Second, the 66AK2H12 processor based on TI's second generation KeyStone architecture (KeyStone II) combines quad ARM Cortex-A15 MPCore processors with eight TMS320C66x DSP cores. With core clocks raised to 1.4 GHz, the 66AK2H12 provides up to 5.6 GHz of ARM and 11.2 GHz of DSP processing power. Its security, packet processing, and Ethernet switching engines make the device a much more powerful wireless embedded processor than ARM Cortex-A15 devices designed for consumer products.

The ARM Cortex-A15 MPCore processor combines leading processing capabilities with a very low power and performance ratio, multicore hardware-based cache coherency, and broad industry software support. By integrating the ARM processors, there is no need in most applications for an additional high-end, general-purpose processor, which greatly reduces system cost and design complexity.

Figure 1: 66AK2H12 System-on-Chip (SoC). (Image courtesy of Texas Instruments)

With wide industry adoption of the ARM processors, developers can quickly and easily migrate existing software to the new KeyStone II-based devices. Full ARM-based Linux systems can be created, while offloading real-time processing to the high-performance C66x cores.

For more complex systems that need additional processing capacity, the 66AK2H12 has two HyperLink ports. These can be used to connect multiple KeyStone devices and thus add more C66x DSP cores, more ARM Cortex-A15 processors, or both. HyperLink allows the devices to work in tandem transparently with tasks executed as if running on local devices.

The 66K2H12 is well suited to many applications that need high DSP performance and the control features of the ARM cores, such as high-quality video processing.

Harnessing the power of the C66x DSP core

Let's take the example of the CommAgility AMC-V7-2C6678, a high-performance signal processing AMC card with two 1.25 GHz TMS320C6678 DSPs - giving a total of 16 C66x cores. The two DSPs are linked with HyperLink, providing a connection at up to 50 Gbaud. They can access up to two GB x 64 DDR3-1333 SDRAM each.

Figure 2: CommAgility AMC-V7-2C6678.

Flexible, high bandwidth off-board communications are provided by Gen2 SRIO at up to 20 Gbps per port. As standard, the board provides a single front panel SFP+ optical interface that links directly to the on-board Xilinx Virtex-7 FPGA, plus a mini-SAS connector linked to the SRIO switch. Should applications require timing and synchronization, this is achieved via the front panel or backplane clock I/O.

The card is well suited to a range of high-performance DSP/FPGA processing applications, including telecoms infrastructure and image processing. By providing a high-performance FPGA, large shared memory and fast, flexible I/O, the horsepower of the TI DSPs can be harnessed and used effectively, while keeping power consumption, physical size, and cost within tight limits.

Application example: machine vision

As the demand for higher resolutions and new algorithms is rising in machine vision, a big increase in processing performance is needed. In one example, the CommAgility AMC-2C6678L was chosen as the main processing board, with the board's two C66x DSPs undertaking sorting and analyzing tasks.

Previously, all processing in the application used fixed-point arithmetic. The C66x DSPs' integration of floating-point and fixed-point capabilities now makes it easier to quickly undertake test implementations of new algorithms, or to use algorithms that need a high dynamic range of input and output data.

Also, TI now supports the OpenMP and OpenCL specifications on the KeyStone family. This helps in distributing the load on the various cores when using algorithms that can be easily parallelized. These technologies are useful for rapid implementations, but in the long term it can be more effective to distribute the workload across the cores using different tasks in the DSPs' SYS/BIOS real-time kernel, and with the hardware facilities provided by TI's Multicore Navigator.

For more information on Texas Instruments DSPs, see: www.ti.com/lsds/ti/dsp/overview.page

Paul Moakes Ph.D. MIET is Technical Director at CommAgility. He has previously been employed by Motorola, Blue Wave Systems, and Marconi Instruments. He holds two patents in the field of MicroTCA and AdvancedMC. He received his Ph.D. in Electrical and Electronic Engineering from Sheffield University, and a First Class Honours degree in Electronic Communications and Computer Systems Engineering from Bradford University.


paul.moakes@commagility.com www.commagility.com