Dynamically reconfigurable Massively Parallel Processor Arrays in high-performance embedded military systems

The performance requirements of high-performance embedded military systems are outstripping the capabilities of ordinary CPU and DSP processors. These embedded systems are also required to become increasingly flexible, multimodal, and even dynamically reconfigurable in field operation. Reliable system development and operation remains essential. A new architecture, the Massively Parallel Processor Array (MPPA), has been developed specifically for meeting the challenges of designing these embedded systems.

Many advanced surveillance, intelligence, and reconnaissance sensors support multi- mode operation to respond to unique situational demands. For example, the Lockheed Martin AN/SPY-2 is a long-range, 3D, multifunction radar designed for theater ballistic-missile defense and full-area anti-air warfare. These functions are in addition to short-to medium-range searching and multi-target tracking. Most advanced Electro-Optic (EO) systems integrate multiple types of EO/InfraRed (IR) detectors in one sensor package. For example, the Thales SIRIUS long-range IR search-and-track system provides bi-spectral panoramic surveillance with simultaneous Mid-Wavelength InfraRed (MWIR) and Long Wavelength InfraRed (LWIR) band operation for automatic detection, track initiation, target priority ranking, and tracking.

In the past, separate hardware and software modules were dedicated to each of these functions. Today, however, this is too expensive and inefficient for a flexible, multi-modal system. A common set of hardware should be reconfigurable on demand. In fact, the Joint Tactical Radio System (JTRS) initiative of the Department of Defense has made recon- figurability a requirement for Software Defined Radio (SDR). SDR requires a system architecture that supports multiple, evolving protocol standards. Radio and protocol modules must be parameterized and components must be exchangeable. Sensor interface systems and other high-performance systems face similar requirements.

Existing platforms for reconfigurable embedded computing

Hardware implementations with ASIC chips cannot be reconfigured. FPGAs are hardware-programmable devices that are widely used in embedded systems today, but are rarely, if ever, dynamically reconfigured in the field. Other programmable solutions are based on software-programmable processors. Since a single CPU or DSP usually cannot deliver enough performance, multiple processors executing in parallel must be used.

Multi-core DSPs for embedded systems

Multi-core CPUs with a shared memory architecture have become common in general-purpose computing, and some DSP vendors have begun adopting this ar-chitecture for embedded systems. General purpose systems are naturally capable of runtime reconfiguration. To date, the relative ease of adoption of dual-core and quad-core chips is misleading, because, in the long run, performance demands will require more and more cores.

Since each CPU core must share and communicate with every other core, ex-pensive interconnect and complex cache coherency systems are required. Plus, the interconnect and systems grow faster than the number of cores. Multi-threaded shared-memory multicores are nondeterministic, due to the interplay of caches and runtime thread behavior, and become more so as they increase in size. Debugging massively parallel multithreaded appli-cations promises to be difficult.

In the long run, massively parallel multicore platforms are not likely to be well suited to the development, reliability, cost, and power constraints of embedded systems.

FPGAs and reconfiguration

FPGA architecture was never conceived to support runtime reconfiguration, which is why it has been complex, limited, and relatively slow, despite some recent improvements. The latest work on runtime reconfiguration of FPGAs still requires complex bit-stream location at compile time and at runtime. This is to account for arbitrary FPGA constraints such as the physical layout of logic and specialized blocks. Register-to-register timing closure must be achieved or reconfiguration fails. Hardware interfaces between regions of the application are non-standard and hardware-specific. In addition, application development requires hardware expertise. Partial bitstream sizes are hundreds of kilobytes and reconfiguration takes tens of milliseconds. Finally, the majority of recently published work in reconfigurable computing on FPGAs does not discuss runtime reconfiguration.

Massively Parallel Processor Arrays

In embedded computing, a new platform may be adopted as the specialized and im-plementation-specific designs aren’t bound by the enormous application compatibility constraints inherent in general-purpose platforms. A new parallel platform, the Massively Parallel Processor Array, has been developed specifically for embedded systems. Designers use an MPPA platform to optimize performance and performance-per-watt for reconfigurable embedded system applications. They also use MPPAs because of the reasonable and reliable application development process. And, they choose MPPAs so that their hardware architecture and development effort will scale with Moore’s Law many years into the future.

MPPA architecture

An MPPA is a massively parallel array of CPUs and memories, interconnected by a 2D-mesh configurable interconnect of word-wide buses, as shown in a general way in Figure 1. One or more CPUs and RAMs are combined with a configurably-switched interconnect of channels to make a tile. Many tiles are stacked to form the array.

Figure 1
(Click graphic to zoom by 2.5x)

MPPA is a Multiple Instruction streams, Multiple Data (MIMD) architecture in which memory is distributed and accessed locally, not shared globally. Each processor is strictly encapsulated, accessing only its own code and memory. Point-to-point communication between processors is directly realized in the configurable interconnect. Since a RISC CPU is now so small and inexpensive (< 0.5 mm2), pro-cessors are not multithreaded or multitasked. Likewise, memory is not cached or virtualized, and interconnect is not shared. The MPPA is a physical “what you see is what you get” platform aimed at deterministic and reliable real-time performance with straightforward debugging.

Previous parallel systems were very hard to program largely because hardware was architected and chips were built without regard for how they would be programm-ed. Ambric first developed the Structured Object Programming Model (SOPM) for its MPPAs and then designed its architecture, the Am2045 chip, and software tools to realize the model. The SOPM is illustrated in Figure 2.

Figure 2
(Click graphic to zoom by 1.9x)

Each application is designed from the top down as a structure of processor and memory objects and the data and control tokens they exchange. This structure is often similar to the block diagrams used to initially define the application. As in a block diagram, there is hierarchy, and the contents of a block may be another block diagram.

Objects communicate through a parallel structure of dedicated self-synchronizing channels in the chip’s configurable interconnect. One processor’s program only sends a word down a channel when the second processor is ready to accept it; otherwise, the first processor stalls until the second processor is ready. In the same way, a program trying to receive a word stalls if the channel is empty. This way, sending a word from one processor to another is also an event, which keeps them directly in step with each other. Unlike a multicore system, this feature is built into the MPPA programming model, so it is not an option or a nondeterministic debugging and testing problem for the developer.

Commonly used building blocks such as filters, transforms, and so forth are checked into libraries and reused. Once tested, MPPAs can be reused with confidence because they are encapsulated and have standard interfaces. Reuse is more effective than in other platforms, especially hardware platforms, which helps keep development effort scalable.

The MPPA chip described has 336 32-bit CPUs and 336 2 KB memory blocks in a hierarchical 2D mesh interconnect running at 300 to 350 MHz. At full speed, all process-ors together are capable of 1.2 trillion operations per second (more than one teraOPS) and 60 GMACS of DSP performance. This is supported by the interconnect’s 792 Gbps bisection bandwidth, 26 Gbps of off-chip DDR2 memory, PCI Express at an effective 8 Gbps each way, and up to 13 Gbps of parallel general purpose I/O.

An integrated development environment based on Eclipse includes the compiler and assembler, simulator, automatic realization with placement and routing onto the chip, and runtime source-level interactive parallel debugging on the actual system. Development and debugging are rapid: Compiling an entire application takes less than two minutes.

Dynamic MPPA reconfiguration

Dynamic Objects (DOs) may be used in the MPPA programming model just described. DOs reconfigure themselves in an orderly manner when a reconfiguration packet arrives on a specific input channel.

To control execution and reconfiguration, the local memory of each processor includes a persistent tiny kernel. Also persistent is a loop of control channels, used only for reconfiguration, that link all the processors in the DOs in a closed daisy chain, as shown in Figure 3. When the input kernel receives the header of a reconfig- uration packet, it sends a reconfigure token around the control channel loop to the other kernels. Then it reconfigures its own processor by loading its new code into its own local memory. It then sends the remaining processor configuration into the control channel. The next kernel reconfigures itself, sends the remainder on, and so forth. Finally, each kernel waits for another token to circulate through the control channel, presumably a work token, which will start the new active object.

Figure 3
(Click graphic to zoom by 1.9x)

Figure 4 is a runtime reconfigurable work farm featuring dynamic reconfiguration. The Boss object has a set of precompiled reconfiguration packets stored in its local memory. In normal operation, work packets are sent by the Boss to the selected worker objects in the work farm. New reconfiguration packets may be sent to the work farm anytime. When a Dynamic Object completes its reconfiguration, it sends an ID back to the Boss to indicate the completion of reconfiguration. Upon receipt of the ID, the Boss initiates the distribution of new work packets to the newly reconfigured DO until its configuration is changed again.

Figure 4
(Click graphic to zoom by 2.5x)

Reconfiguration time

Table 1 shows the time it takes to recon-figure objects. The Configuration column lists the total reconfiguration time in microseconds, from receipt of the first reconfiguration packet to the time Boss receives its ID to indicate completion of the reconfiguration process.

Table 1
(Click graphic to zoom by 2.3x)

Full R & R

By using a Massively Parallel Processor Array and its straightforward programming model, engineers can develop very high-performance embedded military systems with full reconfigurability and runtime reliability. The MPPA platform is scalable for a long lifetime of investment in software methodology.

Mike Butts, an Ambric Fellow, has an extensive background in computer architecture, especially large-scale reconfigurable hardware, and is the coinventor of hardware logic emulation using reconfigurable hardware. He has developed several processor architectures, reconfigurable chips, and systems at Mentor Graphics, Quickturn, Synopsys, Cadence, Tabula, which he cofounded, and Ambric. Mike has 38 U.S. patents issued and BSEE and MSEE/CS degrees from MIT.

Paul Chen is Director of Strategic Business Development at Ambric and has extensive knowledge of sensor and image processing, graphics, computer system architecture, and real-time operating systems. Prior to Ambric, Paul was with Tektronix, Intel, Barco, and Metheus, where he served as Senior VP. Paul published several papers on medical imaging, radar processing, embedded X Window Systems, and real-time UNIX operating systems on multiprocessor platforms. He also was a major contributor in developing the world’s first graphics controller that drives a 2,048 x 2,048 pixel 20-inch square monitor for air traffic control applications. Paul holds a BS in EE from National Taiwan University and an MS in CS from the University of Oregon.

Ambric, Inc.