Software programming techniques for embedded DSP software

Due to real-time constraints, today’s DSP applications will eventually grow to a point where they are stressing the available CPU and memory resources. Several techniques can be utilized to improve the performance of the embedded DSP software application. However, knowing whether to optimize performance, power or memory can be difficult simply due to the several positive and negative tradeoffs. Understanding the workings of the DSP architecture, compiler and algorithms can help significantly accelerate the speed of the application.

Embedded applications are an excercise in optimization, which is a procedure that seeks to maximize or minimize one or more performance indices including:

  • Throughput (or execution speed)
  • Memory usage
  • I/O bandwidth and
  • Power dissipation

Specifically, when optimizing the application, there are three key tactics developers need to bear in mind:

  • Optimizing the DSP architecture
  • Optimizing the algorithms
  • Optimizing the compilers

What and where to optimize Generally, DSP optimization follows the 80/20 rule, which states that 20 percent of the software in a typical application uses 80 percent of the processing time. This is especially true for DSP applications that concentrate a large portion of time in tight inner loops of DSP algorithms. The real issue in optimization is not how to optimize but what and where to optimize.

Understanding where the execution cycles are being spent can efficiently optimize the DSP application to achieve the best results. The best way to determine which parts of the code should be optimized is to profile the application to determine which modules take the longest time to execute. These will be the best candidates for performance-based optimization, and similar questions can be asked about memory usage and power consumption.

Since many DSP systems are real-time systems, at least one (and probably more) of the performance indices must be optimized. Given that it is very difficult and usually impossible to optimize all these indices at the same time without having some sort of tradeoff between the various indices, it is very important to understand the system goals before attempting any form of optimization. For example, optimizing an application for speed often means a corresponding decrease in power consumption but an increase in memory usage. Optimizing for memory may also result in a decrease in power consumption due to less memory accessed but an offsetting decrease in code performance.

However, determining which indices to optimize will have to depend on the goals of the developer. If the performance is optimized, a slower and cheaper DSP can be used to do the same amount of work, and this cost saving can lead to a significant impact on the success of the application.

The developer can also choose to optimize the application to allow for the addition of more functionality. This can be very important if the additional functionality improves the overall performance of the system, or if the developer can add more capability to the system such as an additional channel of a base station system. Optimizing memory by reducing the application size can reduce overall system cost. Finally, optimizing for power means that the application can run longer on the same amount of power, which is very important for battery powered applications. This type of optimization also reduces the overall system cost with respect to power supply requirements and other cooling functionality required.

Understand the application before you start Make certain you have a thorough understanding of the architecture, compiler and the application before you start because each target processor and compiler has different strengths and weaknesses. Since DSP application optimization requires a disciplined approach to achieve the best results, the basic guidelines should be understood and implemented before proceeding.

Many of today's advanced DSP compilers enable the developer to use a higher order language such as C and very little assembly language, which can lead to faster code development, easier debugging and more reusable code.

Significant improvements can be found early in the process with relatively little effort, such as accessing data from fast on-chip memory using the DMA and pipelining inner loops. However, as the optimization process continues, the effort expended will increase dramatically while further improvements and results will fall dramatically (see Figure 1).

Figure 1

Making several optimization changes at the same time will make it difficult to determine what change led to which improvement percentage. Change one parameter at a time see how each change affects the entire program.

Even for the most advanced developer, it can be very difficult to know what is actually optimized and what performs the best especially since optimizing today's DSP compilers are becoming increasingly complex. However major or minor the performance issue is, these can go unnoticed if the proper tools, such as a profiler, are not utilized.

Optimizing is not an easy feat. Difficult optimizations can create subtle changes to the software program that can lead to more problems. It is important to develop a test plan to compare the expected results with the actual results. The test should be run several times throughout to catch any problems as early as possible. The programmer must verify that program optimizations have not broken the application. It is extremely difficult to back track optimized changes out of a program when a program breaks.

Code optimization process The code optimization process consists of a series of steps, and at each step the compiler generated code and available optimization opportunities need to be examined (Figure 2). For example, an abundance of NOPs in the code caused by delays in accessing memory and/or another processor resource are key areas for improvement. Some techniques that can be applied to reduce the processor cycle count include:

  • Software pipelining
  • Loop unrolling
  • DMA resource utilization
  • Assembly code can be utilized to hand-tune the algorithms but only as a last resort

Many times, the C code can be modified slightly to achieve the desired efficiency but finding the right “tweak” for the optimal solution can take time and several iterations.

Figure 2

There have been substantial improvements in industrial DSP compilers with respect to advanced optimization techniques, and many of the aggressive optimizing compilers have grown to be quite complex. But for all their complexity, they still have their limits.

Make the common case fast The "common case" refers to the area in the code that consumes the most resources, such as cycle count, memory or power consumption. Three strategies can be utilized: optimize the DSP architecture, algorithms or compilers.

The fundamental rule in programming real-time DSP-based systems is “make the common case fast and favor the frequent case.” This is really just Amdahl’s Law, which states the performance improvement to be gained using some faster mode of execution is limited by how often the faster mode of execution is used. Thus, do not spend time trying to optimize a piece of code that will hardly ever run. Instead, if possible eliminate just one cycle from a loop that executes thousands of times to see a bigger impact on the bottom line.

DSP architectures DSP chip designers have developed hardware architectures to efficiently implement algorithms with Sum of Products (SOP), which is when the standard DSP algorithms such as filters, Fourier Transforms and convolutions all perform multiples and adds over and over again. This is achieved by developing specialized instruction sets such as single cycle multiple and accumulate (MAC), architectures that all multiple memory accessed in a single cycle and special hardware that handles loop counting with very little overhead.

DSP algorithms Using algorithmic transformation, such as the Fourier Transform (FT), can notably speed up the DSP algorithms, and how these are implemented can have a significant impact on overall performance of the algorithm. This common algorithm is a mathematical method of breaking a signal in the time domain into all of its individual frequency components1.

There are different ways to characterize Fourier transforms:

  • The Fourier Transform is a mathematical formula using integrals:
    Figure 3: The Fourier Transform
    (Click graphic to zoom by 1.3x)
  • The Discrete Fourier Transform (DFT) is a discrete numerical equivalent using sums instead of integrals, which maps well to a digital processor like a DSP:
    Figure 4: Discrete Fourier Transform (DFT)
    (Click graphic to zoom by 1.3x)
  • The Fast Fourier Transform (FFT) is just a computationally fast way to calculate the DFT which reduces many of the redundant computations of the DFT. The FFT makes use of periodicities in the sines that are multiplied to perform the transform. This significantly reduces the amount of calculations required.

DSP compilers Optimizing compilers perform sophisticated program analysis including intraprocedural and interprocedural analysis. These compilers also perform data flow and control flow analysis as well as dependence analysis, and often require the right methods for modifying or transforming code. Much of this analysis is to prove that the transformation is correct in the general sense.

A general optimization strategy is to write DSP application code that can be pipelined efficiently by the compiler. Software pipelining is an optimization strategy to schedule loops and functional units efficiently. For example, with a Texas Instruments TMS320C6000TM DSP, there are eight functional units that can be used at the same time (Figure 5). Sometimes it is a matter of a subtle change in the way the C code is structured that makes all the difference. The simple addition of the “restrict” keyword to indicate that the two arrays are not overlapped in memory allows the compiler to perform simultaneous reads and writes from the two arrays without having to assume data dependencies. In software pipelining, multiple iterations of a loop are scheduled to execute in parallel. The loop is reorganized in a way so that each iteration in the pipelined code is made from instruction sequences selected from different iterations in the original loop.

Figure 5

Power optimization How to balance the right amount of performance, power and memory can be very tricky as there is almost always some sort of tradeoff between the three. To achieve higher performance, memory is usually sacrificed. For example, unrolling a loop can lead to significant performance gains but it also needs more program memory. Software pipelining can be even more memory intensive and can take 10 to 20 times the instructions of a non-pipelined loop.

Power optimization on the other hand is more directly proportional to performance than it is to memory. Developers often optimize their code for execution speed and typically, the performance is good enough that further optimizations are not needed. When considering power consumption, faster code will usually allow for more time to leverage idle or sleep modes, or a greater reduction in the CPU frequency requirements. However, be aware that in very specific situations, faster code can actually increase power consumption. This could include more parallelism and subsequent circuit activity. To avoid this situation, the engineer needs to first benchmark and experiment to know the best tradeoff.

There are a myriad of peripherals and processes to implement to achieve low power consumption, and each can be used in isolation or in conjunction with other techniques. For instance, intelligent architecting of the software, stopping or idle points in the application can trigger points for turning off parts of the device or peripherals including the DMA. This makes the global management of the power efficiency of an application easier to manage. In addition, by placing the code and data as close to the CPU as possible since it will take less time to access and less hardware to use. This also minimizes of-chip accesses that turn on and off buses and other peripherals. Use hierarchical memory models as much as possible to leverage the cache and instruction buffers to significantly lower off-chip memory access and power consumption. More techniques can be found in Table 1.



Perform necessary code and data size optimizations

To reduce the application footprint, memory and corresponding leakage caused by a larger required memory for the application

Reduce the critical path in the software application

To identifying the worst case performance scenario so the system supply voltage can be lowered while still keeping system throughput fixed

Use a DMA instead of the CPU

More efficient transfer of data

Use co-processors

To efficiently handle/accelerate frequent/specialized processing. Co-processors can be used to handle and accelerate specialized processing since they are optimized to efficiently perform certain computationally intensive tasks, which also saves power.

Increase buffering and batch processing

For more computation at once and more time in low power modes

Interrupt-driven programming instead of polling

Since software is required to poll an interface periodically to detect events, a DSP operating system can be used to lock the application as necessary.

For example, a keypad interface routine might need to spin or periodically wake to detect and resolve a keypad input. Designing the interface to generate an interrupt on keypad input will not only simplify the software, but it will also enable event-driven processing and activation of processor idle and sleep modes while waiting for interrupts.

Boot time power saving module

Since unnecessary resources may be powered during boot up since most processors typically boot up fully powered at a maximum clock rate, the application can traverse the system, turning off or idling unnecessary power consumers

Accept less accuracy

If the application is over-calculating results, such as when it is using long integers when short integers will suffice, it is perfectly okay to accept less accuracy in some calculations to help cut down on processing requirements.

For example, certain signal processing applications may be able to tolerate more noise in the results, which enables reduced processing and reduced power consumption.

Low-power code sequences and data patterns

Since different processor instructions exercise different functional units and data paths, they have different power requirements. To bridge this gap, use low-power code sequences and data patterns. Analyzing the affects of individual instructions and data patterns is an extreme technique that is sometimes used to maximize power efficiency.

Table 1

Points to remember Embedded real-time applications are an exercise in optimization whether it be performance, power or memory, and developers must thoroughly understand all potential problems and the tools to supply the necessary improvements. In short, the three main three optimization strategies include DSP architecture, algorithms and compilers:

  • DSP architecture optimization: DSPs are optimized microprocessors that perform signal processing functions very efficiently by providing hardware support for common DSP functions:
  • DSP algorithm optimization: Choosing the right implementation technique for standard and often used DSP algorithms can have a significant impact on system performance
  • DSP complier optimization: DSP compilers are tools that help the embedded programmer exploit the DSP architecture by mapping code onto the resources in such as way as to utilize as much of the processing resources as possible, gaining the highest level of architecture entitlement as possible

Used with the permission of the publisher, Elsevier Science/Newnes, this article is based on chapter six of "DSP Software Development Techniques for Embedded and Real-Time Systems," by Robert Oshana. More information, please visit

1 Brigham, E. Oren, 1988, The Fast Fourier Transform and Its Applications, Englewood Cliffs, NJ: Prentice-Hall, Inc., 448 pp.

Texas Instruments
Robert Oshana is an engineering manager in the Software Development Organization of TI’s DSP Systems business. He is responsible for the development of hardware and software debug technology for many of TI’s programmable devices. He has 25 years of real-time embedded development experience.