Embedded applications using integrated DSP and microprocessor cores

The trend toward unified microprocessor/DSP cores Embedded applications are composed of a mix of signal processing and control algorithms, which work together to perform the necessary functions inside many real-time embedded systems. Therefore it is essential to understand how the control and digital signal processor (DSP) algorithms interoperate. For applications such as cell phones and MP3 players, this challenge is traditionally solved by dividing the control algorithms into a RISC processor and the signal processing onto a DSP. For example, in a cell phone, the signal processing functions may include echo cancellation or encode/decode for video and audio applications. These algorithms run efficiently on DSPs, whose architecture is designed to execute these forms of signal processing algorithms. The control software in a cell phone implements the state machines that control the user interface, the keypad, and other non-signal processing functions.

There are several challenges when developing embedded applications with both signal processing and control functions that need to interoperate. For example, when porting a desktop application or other complex application onto an embedded device, getting the cores to operate in real-time and with the appropriate code partitioning can be very difficult due to the substantial effort it requires to manage the synchronization between the two different cores. Many advanced embedded applications, such as video or protocol processing, make it difficult to partition onto multiple cores, and most of the partitioning needs to be addressed by a programmer. For example, with a two-core architecture, which uses a Texas Instruments TMS320C55x DSP and an ARM RISC processor, the DSP performs the signal processing task and ARM 9 executes the control functions (see Figure 1).

21
Figure 1
(Click graphic to zoom by 1.7x)

An alternative approach is to unify the DSP and microprocessor onto a single device, which can be accomplished by adding DSP-like instructions to the RISC core. These may consist of multiple or accumulate instructions, adding control-like instructions to the DSP core, and adding the instructions to perform specialized addressing. With the support of available tools, this “unified” technique offers several advantages, such as having only one application running natively on an operating system to allow simpler design techniques, easier integration, and faster time-to-market.

Embedded applications are historically designed to partition general-purpose functions onto a general purpose microprocessor or microcontroller and the signal processing algorithms onto a DSP core. This approach makes sense for a number of reasons:

  • The DSP core is specialized to run signal processing algorithms efficiently.
  • DSP architectures share a number of common features such as parallel compute/moves, fast Multiply-and-Accumulate (MAC) operations and Harvard architectures, which allow the simultaneous fetching of multiple operands.
  • DSP processors are usually not based on RISC design principles.
  • DSP architectures are driven by applications like video-, image- and voice processing, by data compression and decompression. These applications are found in many telecommunication and multimedia applications.
  • DSP instruction sets are memory oriented and optimized for performing signal processing algorithms such as filters and transforms. To support these operations, DSPs include dedicated registers, address units, multiply and accumulate units, and on-chip memory.

A key challenge with system application partitioning is that each core requires its own external memory subsystem, which consumes more power. In addition to the power required to control these independent memory subsystems, oftentimes each core has to control its own set of peripherals in order to get data into and out of the processing core (this is shown in Figure 1 where both the DSP and the ARM interface to a set of potentially different peripherals and memory subsystems). This also consumes more power and adds to the overall system communication overhead.

Adding DSP instructions to the RISC core Integrating a DSP unit into a RISC architecture leads to additional parallelism and a more efficient sharing of resources, which include peripherals and memory. Therefore, DSP algorithms can be executed faster by utilizing the faster speed of the RISC architectures.

However, RISC architectures are based on a load and store principle and have a more general purpose instruction set, which can have a negatively impact performance. These devices have sophisticated cache strategies and heavily use pipelining wherever possible, which leads to higher clock frequencies. These RISC-based microprocessors have been augmented with DSP enhancements to support DSP algorithms, such as multiply and accumulate instructions and dedicated units for graphics or image processing.

Since one processor does the work of two with an integrated RISC/DSP core, there is no need for inter-processor communication. With one integrated core there can be dynamic assignment of DSP versus controller code in response to changes in system requirements or environment. In addition, this model provides faster context switching and a reduced number of resources because there are no duplications in peripherals or memory. It also provides a more integrated system, which leads to lower power consumption and other advantageous benefits (cost, performance, and die size) that can occur by integrating more functional units on one device.

As mentioned earlier, there are several common features inherent in modern DSP processors. These include specialized data paths configured for DSP operations, specialized instruction sets for DSP-centric operations, multiple memory banks and buses for multiple sequential memory accesses, and specialized peripherals for DSP.

On the other hand, general purpose processors are designing DSP-like extensions into their own cores. Various approaches have been used to add DSP capability into general purpose processors. Designers can add dedicated single-instruction, multiple-data instructions and extensions, such as what is found on the Pentium Multi-Media Extension (MMX) instructions, or they can integrate fixed-point DSP processor-like data paths for multi operand fetch as well as other related resources into an existing CPU core like the Hitachi SH-DSP processor. Similar to ARM Ltds. NEON architecture, a DSP co-processor can be added to the CPU core. Developers can also create hybrid architectures such as the TriCore processor.

Architectures with integrated DSPs features The NEON SIMD instructions allow up to 16 elements to be processed in parallel, which accelerates media and DSP applications. They are tightly coupled to the core (see Figure 2) and this integration provides a unified view of memory which is shared with the ARM core. This results in the ability to use a single instruction stream and gives a single platform target, which speeds overall application development.

Architectures like this would work well for certain applications like 3G cell phones. In these applications, the DSP data engine could be used for dedicated processing such as video encoding, and the ARM core with NEON DSP extensions could be used for the audio and video decode, and the RISC processing engine could be used for the user interface and protocol stack processing.

22
Figure 2
(Click graphic to zoom by 1.4x)

The TriCore architecture (see Figure 3) couples as an MCU-like RISC-based load and store architecture with a DSP-like Harvard memory architecture. The address buses are each 32 bits wide, while the program and data memory buses are 64 bits wide. The core itself does not contain any memory, but can be customized by the designer. The superscalar architecture contains a 32-bit fixed-point data path, a load and store unit, and a program control unit. This device can execute up to three instructions per cycle, which is required for high performing DSP applications. These include: data-path instructions, load and store instructions, or instructions that specifies a loop.

23
Figure 3

This device also supports DSP addressing modes including register-indirect with pre- and post-increment, indexed addressing, circular (modulo) addressing, and bit-reversed addressing. Bit reversed is useful for unscrambling the inputs or outputs of FFT algorithms, which is a common DSP operation. Zero-overhead hardware looping is also supported.

In summary, the key benefits to using an integrated RISC/DSP processor for real-time embedded systems are:

  • A single architecture merges both DSP and microcontroller features without sacrificing the performance of either.
  • Fast task switching allows the integrated core to act more like a virtual processor and switch between DSP and microcontroller tasks very quickly, sometimes as fast as a couple of clock cycles.
  • Larger on-chip memory blocks (RAM, ROM) lead to higher performance and lower system power.
  • An integrated architecture provides direct control of on-chip peripherals without the need for additional glue logic.