Migrating PowerPC to FPGAs: New C-to-hardware tools to renew PowerPC applications
This article explores the considerations that make migrating legacy embedded code better or worse candidates for FPGA deployment.
There is a large and established body of intellectual property developed in C for embedded systems. Much of this code resides in legacy embedded processors, and for cost reduction or for increased performance must be migrated to more modern processing elements. Engineers tasked with this migration have many choices including the use of newer, more powerful processors and/or the redeployment of applications into FPGA devices.
Porting a legacy application to a new type of embedded computing platform is in large part determined by how much redesign work is needed. The full cost of redeployment in these cases goes beyond the bill of materials difference in hardware to include the cost of refactoring the embedded code for a different processor or, in the case of FPGA conversion, rewriting some substantial portions of the application in a hardware description language. FPGAs with embedded standard processors can greatly simplify this migration path.
In the shortest FPGA conversion path, it is possible to move legacy PowerPC code to an embedded PowerPC core on an FPGA. In more sophisticated deployments, application developers also take the opportunity to consider moving high bandwidth processing elements of their code out of the processor and into the actual hardware logic on the FPGA. This has value because the FPGA hardware can exploit parallelism to run algorithms at a slower clock speed, using less power, but with higher performance than a software-only version.
This article explores the considerations that make migrating legacy embedded code better or worse candidates for FPGA deployment, including algorithm refactoring and decisions about software/hardware partitioning, as well as re-deployment involving both migration of existing processor code and consolidation of peripheral logic to create "systems on programmable devices". We'll also explore how software-to-hardware tools can be used to greatly speed the redesign and refactoring process, and discuss some new architectural options this configuration opens up.
The PowerPC has been a solid workhorse processor for many years. There are tens of thousands of PowerPC-based embedded applications that range from satellites to set-top-boxes. Many of these legacy applications and their underlying algorithms remain valid and useful. Moving such applications to FPGAs is a low-cost way to value-engineering a product so it can run faster, include new features, be cheaper to produce, and use less power. FPGA manufacturers offer well-realized versions of existing processors like the PowerPC and ARM. Royalty-free "soft" processors such as Altera NIOS and Xilinx MicroBlaze are powerful targets for legacy application conversion.
Design types that will benefit from migration to FPGA
Today, the most common applications being moved to FPGAs from discrete processors include image and signal processing, sonar, radar, data security, and automotive. Key shared characteristics of these applications include: significant non-sequential logic that can be accelerated through parallelizing, I/O requirements that are outside the scope of a traditional processor, and the need for flexible memories. These projects also tend to have low to medium production volumes (too low to amortize the cost of a full ASIC implementation) or require frequent hardware field updates or customization.
The goals and constraints of an FPGA migration can include performance, power, cost, heat, physical size, and reliability. Additional benefits are the classic FPGA attributes of shorter time to market and last-minute design flexibility, as well as reduced risk of component obsolescence.
Obviously, if the current system is stable, with a non-obsolete PowerPC, migration to an FPGA-based equivalent may only pay off if there is significant consolidation of peripheral logic such that the overall device count, power consumption, and board size is reduced. In these situations the system team weighs the total cost difference, including board savings, to make a payback decision. If the system is actually scheduled for revision and it already has a PowerPC and an FPGA, there is a higher probability of payback through slightly upsizing the FPGA in order to eliminate a discrete processor. Examples of this might include:
- Trying to reduce device count and/or physically shrink a PowerPC based system. In this format, the PowerPC code moves to the embedded processor and as much of the peripheral and processing logic moves to the remaining logic elements on the FPGA.
- Increasing throughput. The more sophisticated use of this layout comes with higher speed application. In these cases it is possible to use the PowerPC as the control element in the system, and to configure logic processing engines in the FPGA fabric that can run C code in multiple threads or streams. This can enable the FPGA to out run higher clock speed processors, at significantly lower power.
Central to this is hardware/software partitioning. The FPGA embedded CPU core can handle both control and processing logic. This is the mode primarily under discussion here. In this mode, the original PowerPC code runs "native" on an embedded processor core. The porting is nominal; the original application can often be brought up in the FPGA in hours. But simply porting the application from a discrete PowerPC processor to the FPGA does not carry with it substantial benefits; it may in fact run slower. The key to leveraging the FPGA's parallel processing capability is moving processing code to dedicated hardware accelerators. Prime candidates for moving are the processing bottlenecks, those subroutines and inner code loops that consume the most processor cycles. In a coprocessing implementation, on one FPGA device there can be one or more PowerPC processors running legacy C code, closely coupled to hardware accelerators that offload the performance-critical computations.
Using HDL simulators, FPGA hardware designers often create test benches that will exercise specific modules by providing stimulus (test vectors or their equivalents) and verifying the resulting outputs. For algorithms that process large quantities of data, such testing methods can result in very long simulation times, or may not adequately emulate real-world conditions. Adding an in-system prototype test environment bolsters simulation-based verification and inserts more complex real-world testing scenarios.
Unit testing is most effective when it focuses on unexpected or boundary conditions that might be difficult to generate when testing at the system level. For example, in an image processing application that performs multiple convolutions in sequence, you may want to focus your efforts on one specific filter by testing pixel combinations that are outside the scope of what the filter would normally encounter in a typical image.
It may be impossible to test all permutations from the system perspective, so the unit test lets you build a suite to test specific areas of interest or test only the boundary/corner cases. Performing these tests with actual hardware (which may for testing purposes be running at slower than usual clock rates) obtains real, quantifiable performance numbers for specific application components. Introducing C-to-hardware compilation into the testing strategy can be an effective way to increase testing productivity. For example, to quickly generate mixed software/hardware test routines that run on the both the embedded processor and in dedicated hardware, you can use the Impulse C compiler to create prototype hardware and custom test generation hardware that operates within the FPGA to generate sample inputs and validate test outputs.
Impulse C generates FPGA hardware from the C-language software processes and automatically generates software-to-hardware and hardware-to-software interfaces. You can optimize these generated interfaces for the MicroBlaze processor and its fast simplex link (FSL) interface or the PowerPC and its processor local bus (PLB) interface. Other approaches to data movement, including shared memories and the use of Xilinx auxiliary processing unit (APU), are also supported.
In-system testing using embedded processors is a viable complement to simulation-based testing methods, allowing you to test hardware elements using actual hardware interfaces, resulting in more accurate real-world input stimulus. This helps to augment simulation, because even at reduced clock rates the hardware under test will operate substantially faster than is possible in RTL simulation. By combining this approach with C-to-hardware compilation tools, you can model large parts of the system (including the hardware test bench) in C language.
Image processing migration example
As an example, one design team recently moved a JPEG-based video image filtering application to an embedded/FPGA platform. They took an existing image processing algorithm that had been written in C, using common tools for embedded systems design. Using the Xilinx Embedded Development Kit and its cross-compiler environment, they were then able to port the legacy algorithm to an embedded PowerPC residing inside a Xilinx Virtex-5 FXT device. The Base System Builder tools (part of the EDK tools) were used to assemble the PowerPC-based system and select the needed I/O devices for a specific Xilinx reference board, in this case the Xilinx ML510 development kit.
Two versions of this application were created, one using a single embedded PowerPC device for algorithm testing purposes (see sidebar) and one using two embedded PowerPC processors, custom C-language coprocessors, and an embedded web server to create a complete system-on-FPGA as shown in Figure 1.
This initial porting of the application created a baseline for validation. The team was able to run the application against the original discrete processor, against the embedded PowerPC processor, and provide a software test bench for use on a development PC - all from a single code source.
To identify opportunities for acceleration, they ran gprof (a common, publically available profiler) and identified computational hotspots. For example, a discrete cosine transform (DCT) algorithm was identified as one computational bottleneck. Using the Impulse CoDeveloper tools, the team was able to offload the DCT, image filters, and other bottleneck functions into dedicated FPGA hardware. Impulse C API functions were used to describe how the data moved from the PowerPC processor to the FPGA accelerators and back. From these API function calls, the Impulse compiler generated appropriate software/hardware interfaces, using the Xilinx Auxiliary Processing Unit (APU) streaming interface.
The Impulse software automatically generated the code to connect through this bus so the team did not need to learn how to write the VHDL code that this interconnection would otherwise have required. This is a key aspect of modern tools for hardware/software codesign: by abstracting away the details of hardware/software interfaces, the tools allow software developers to focus their energy on the application, and on optimizing the actual calculations being performed rather than on the details of the hardware-level systems interfaces.
Interface development, in fact, is one of the hurdles that many software teams report keeps them from trying an FPGA acceleration approach. In this case, abstract streaming interfaces provided in the Impulse APIs allowed for automatic connections between the main algorithm running on the PowerPC and hardware accelerated subroutines running in the FPGA. The modified software algorithm, which now includes three independently synchronized processes and two embedded PowerPC processors, was simulated again with standard C tools to verify that its behavior remained equivalent to the original.
The C-language subroutines are now hardware accelerators and are analyzed and optimized by the Impulse C compiler, resulting in VHDL files. These files were exported to Xilinx ISE for a first pass synthesis. Synthesis took about 15 minutes for this project. The team looked at the FPGA usage post synthesis and determined that there were some improvements to be made.
Optimization reports generated by the Impulse C compiler helped the team determine that using a relatively low clock rate (50 MHz) in the FPGA fabric in combination with increased cycle-by-cycle throughput (through the use of automated pipelining) - produced the best overall results. This works because the overhead of software-to-hardware data communication via APU is nominal, and the pipelining of the inner code loop results in high throughput even at a low clock speed. Note that during this compilation process, additional compiler outputs were generated that represented hardware-to-software interfaces, including the necessary Virtex-5 APU interface logic. Software runtime libraries were automatically generated at this point, corresponding to the abstract stream and shared memory interfaces specified on the processor side of the application. Finally, the generated hardware and software files were exported from the Impulse tools (as a PCORE peripheral) and imported directly into the Xilinx Platform Studio environment for connection to the embedded PowerPC.
Different ways to allocate resources
Combining a PowerPC processor and programmable logic on a single chip opens up new possibilities for acceleration of software applications, and acceleration of the design process as well. A certain amount of effort is required to make the switch to FPGA-accelerated embedded processors, but the use of software-to-hardware methods allows software programmers to create the entire embedded system, including the hardware accelerators in C, experiment with partitioning strategies and create effective, deployment-ready results.
Libraries are playing an increasingly im-portant role in such applications. In one configuration, the system designer may rely on pre-optimized libraries (such as those found in the Xilinx CORE Generator) as well as writing customized hardware modules specific to a particular algorithm.
The field reconfigurable nature of FPGAs also plays a strong role in many applications. Configurations can be updated remotely, which is a key capability for space applications as other domains including Software Defined Radio (SDR) where remote frequency changes must be performed.
Another more theoretical use of this reconfiguration capability is dynamic reconfiguration, whereby FPGA logic is dynamically "recycled" into other configurations during operation. In this mode of use, the same device could act as an image convolution filter one moment, and an object recognition algorithm the next.
Setting proper expectations
In this article we have summarized the process of moving applications from legacy C and legacy processors into mixed software/hardware implementations on FPGAs. We do not want to suggest, however, that this is a trivial, push-button process. Software programmers targeting FPGA devices need to learn some new concepts, and think about how to optimize their applications for greater levels of parallelism. Important considerations include:
- The need to refactor critical algorithms for increased performance, for example by enabling loop-level pipelining or re-partitioning to taking better advantage of multiprocess parallelism
- Coding styles or coding errors that are of little concern in a traditional processor, but have great impact in an FPGA
- Hardware-centric, difficult to understand errors and warning messages generated by FPGA synthesis and routing tools
- Long place-and-route times that limit the ability for software programmers to use code-compile-debug methods of development
- Algorithm partitioning that makes sense in a microprocessor implementation (for example, using threads) but propagates differently in an FPGA
These considerations and issues can be overcome by software engineers, but do require some new learning. Software developers who are familiar with threading libraries or multi-process parallelism have less difficultly taking advantage of FPGA parallelism than those who don't.
With these skills, legacy PowerPC applications can be moved to FPGAs with minimum modification. Once implemented in the FPGA and validated, their processing-intensive components can also be partitioned and compiled directly from C software descriptions to efficient, high-performance hardware that can be mapped directly into FPGA logic. Tools available today can parallelize C code for hardware implementation and generate the required software/hardware interfaces. The generated interfaces and other support files are exported into the FPGA mapping tools. This process renews existing PowerPC applications by moving them to easily upgraded hardware with minimal software modification. This process also typically improves performance, reduces device count, and reduces development time.
The authors wish to thank Glenn Steiner of Xilinx, and Brian Durwood of Impulse for their assistance with this article.