Datasheet
Datasheet
Datasheet
EDNS ANNUAL DSP DIRECTORY HIGHLIGHTS THE ARCHITECTURES AVAILABLE FOR YOUR HOTTEST DESIGNS. HERES HELP IN SORTING THROUGH THE MYRIAD DSP DEVICES. YOU CAN ALSO ACCESS OUR FREQUENTLY UPDATED, FEATURE-TUNED DATABASE USING OUR SEARCH ENGINE TO FIND THE RIGHT DEVICE FOR YOUR DESIGN NEEDS.
DSP-architecture directory
By Markus Levy, Technical Editor
Im beginning to sound like a broken record. (Remember those vinyl platters?) Every year I begin the introduction to EDNs DSP Directory by remarking on the tremendous growth in DSP technology, and its no different this year. You can judge this growth from the number of new DSP companies and the number of new DSPs. And youll find descriptions of all the above right here in this directory. Whats the cause of all this excitement? Cellular phones, broadband communications, medical-imaging equipment, modems, audio equipment, motor control, and tons more. Youve probably seen the commercials from Texas Instruments about DSP technologyanother testimony that DSPs are penetrating our lives. In addition, plenty of RISC processors keep popping up with DSP instruction sets. (The appearance of RISC processors with DSP instruction sets, however, raises the question Are they really still RISCs?) These processors include those from ARC, ARM, Improv Systems, Lexra, MIPS, Sandcraft, STMicro, Tensilica, and more (www.arccores.com, www.arm.com, www.improvsys.com, www.lexra.com, www.mips.com, www.sandcraft.com, www.st.com, www.tensilica.com). Although this directory does not cover these processors, youll find them in the upcoming Microprocessor Directory. There, youll also find the traditional RISC-DSP combos, such as Hitachis SH-DSP, Infineons TriCore, and Hyperstones E1 (www.hitachi.com/semiconductor, www.hyperstone.de). I may also be repeating myself when I talk about the new benchmarks from the EDN Embedded Microprocessor Benchmark Consortium (EEMBC). The consortium has been actively working on these industry-standard benchmarks for three years, and now the benchmarks are ready to go. Starting on April 11, you can go to the EEMBC Web site at www.eembc.org and get free DSP and other processor benchmark-certified scores. These benchmarks include cascaded biquad filters, the Viterbi address-compare-select function, autocorrelation, bit allocation, and FFTs. If you dont find benchmark scores for your favorite DSP on EEMBCs Web site, urge the corresponding vendor to provide its EEMBC scores. Let them tie some quantitative performance information to those multiply-accumulate units and pipeline stages. Next year, well include those scores in this directory for direct comparisons. The Motorola and Lucent Technologies joint venture has yielded the fruits of the two companies labors in the StarCore scalable SC140 DSP core and the MSC8101. However, were still anxiously awaiting a DSP architecture from the Intel/Analog Devices partnership. An introduction should happen soon. Were also waiting to see production quantities of TIs new 750-MHz C64x: the fastest RISCer, DSPon the planet. To help you sort through the myriad DSP devices, access our frequently updated database using our feature-tuned search engine to find the right device for your design needs (www. ednmag.com/ednmag/reg/micro.asp).
www.ednmag.com
16 BITS Analog Devices ADSP-21xx . . .62 Analog Devices TigerSharc DSP . . . . . . . . . . . .63 BOPS ManArray . . . . . . . . . . . .64 DSP Group cores . . . . . . . . . . . .66 Equator Technologies MAP-CA . . . . . . . . . . . . . . . . . .68 Infineon Carmel DSP 10XX and 20XX cores . . . . . . .70 LSI Logic ZSP DSPs . . . . . . . . . .72 Lucent/Motorola StarCore SC100 . . . . . . . . . . . .74 Lucent Technologies DSP16xx . . . . . . . . . . . . . . . . . .76 Lucent Technologies DSP16000 . . . . . . . . . . . . . . . . .78 Massana FILU-200 DSP coprocessor core . . . . . . . . . . . .80 Motorola DSP56800 . . . . . . . . .81 NEC SPRX DSP . . . . . . . . . . . . .82 Philips REAL DSP . . . . . . . . . . .83 Texas Instruments TMS320C2000 . . . . . . . . . . . . .84 Texas Instruments TMS320C5000 . . . . . . . . . . . . .86 Texas Instruments TMS320C6000 . . . . . . . . . . . . .88 3DSP SP-3 and SP-5 DSP cores . . . . . . . . . . . . . . . . .90 Zilog Z893x1/Z893x3 . . . . . . . .92 24 BITS Motorola DSP563xx . . . . . . . . .93 32 BITS Analog Devices SHARC DSP . . . . . . . . . . . . . . .94 Texas Instruments TMS320C3x . . . . . . . . . . . . . . .95 Texas Instruments TMS320C4x . . . . . . . . . . . . . . .96 64 BITS Module Research Centers NeuroMatrix NM6403 DSP . . . . . . . . . . . . . .98
www.ednmag.com
dspdirectory 16
bits
www.ednmag.com
dspdirectory 16
bits
www.ednmag.com
dspdirectory 16
bits
BOPS ManArray
The Billions of Operations Per Second (BOPS) Inc ManArray is a family of reusable and scalable DSP cores. A developer can configure each of the family members into 16- and 32-bit, fixed-point formats; 32bit, single-precision, floating-point formats; or both. The ManArray achieves a high-level of parallelism by combining an indirect-very-long-instruction-word (iVLIW) architecture with single-instruction-multiple-data (SIMD) instructions and inherent multiprocessing capability. The SIMD instructions support 8-, 16-, and 32-bit packed data and 32-bit, single-precision, floating-point formats. A single-cycle, zero-latency, interprocessor-communications fabric and direct DMA access to all processing elements enhance the ManArrays parallelism. The ManArray architecture comprises a sequence processor (SP) and a processing element (PE). The various product configurations combine one SP and multiple PEs. The SP handles program control and combines with a PE to form the smallest increment of the ManArray architecture: a single-SP, single-PE unit. Each SP/PE unit also includes the instruction and data-address-generaDevice provides intion units. struction-level paralEach PE contains a multiported, lelism with indirect 32 32-bit register file, an iVLIW-memVLIW. ory (VIM) unit, local data memory, and three bus interfaces. The bus interfaces DSP supports 8-, 16-, include a 32-bit instruction bus; a 32-bit and 32-bit fixed and data bus; and a cluster switch (CS), a sinfloating point. gle-cycle PE-interconnect bus for movArchitecture features ing data between the SP and PEs. soft-macro DSP cores. The five execution units comprise a multiply-accumulate unit (MAU), an ALU, a data-select unit (DSU), a 64-bit load unit, and a 64-bit store unit. The DSU supports data-manipulation instructions, such as shift, rotate, floating-point conversions, and SP-to-PE and PE-toPE communications through the CS. The register file logically performs as 32 32-bit registers or 16 64-bit registers supporting packed-data operations on an instruction-by-instruction basis. The BOPS 2010 core, a one one-element array, comprises an SP/PE combination. The BOPS2020 uses the one one-element array and adds a PE element through the CS to form a one two-element array. Likewise, the BOPS2040 is a two two-element array comprising one SP/PE combination and three PEs. The SP uses a 32-bit instruction set that supports both one one-element and N M-element arrays and allows you to use one tool set for all combinations of the cores. The topology of the BOPS architecture allows the devices to interconnect and organize a set of PEs into standard ring, mesh, torus, hypercube, and other organizations. The organization depends on algorithmic data-flow requirements. The importance of the topology type is that the performance of any parallel algorithm depends on the efficiency of data movement on the processor and the cost of the interconnection mechanism. The term iVLIW refers to the ManArrays ability to indirectly access an encapsulated instruction sequence into a horizontal VLIW format that can simultaneously execute operations. You create iVLIWs with an iVLIW load VLIW (LV) instruction, which identifies as many as five programmer-defined 32-bit instructions that comprise the VLIW, as well as the VIM address in which to store the instructions. After executing the LV instruction, you issue a sequence of simple instructions that form the iVLIW. Once the SP and PEs store the iVLIW in VIM, your program can dispatch, or broadcast, an execute-iVLIW (XV) instruction to the SP and all PEs. The XV instruction contains the offset of the VIM address for the VLIW to execute in each PC. This nontraditional use of VLIWs effectively creates instructions for applications using 32-bit instruction paths. Using the iVLIW architecture, BOPS requires no large VLIW buses around the chip, as is common with VLIW machines. The VIMs allow BOPS to use a single 32-bit instruction bus in the array of PEs; this approach promotes scaling in both the number of PEs and the width of the iVLIWs, reducing the amount of program memory. The iVLIW architecture allows you to overlap the communications operations with the computation operations, thereby providing zero-latency data transfers between PEs. The architecture accomplishes this task by placing the communications instructions in the DSU and using software pipelining to transfer a result that an arithmetic-execution unit calculates in the previous machine cycle to any of the directly connected ManArray PEs. The load and store units provide independent datapaths between the SP data memory and the PEs and between each PE and its local data memory. Addressing modesThe processors support arrayparallel memory-addressing modes, including direct, base plus displacement, register indirect, and modulo indexed. Special instructionsThe MAU and ALU support floating-point and packed-data operations with saturation, and the DSU provides a complement of bitmanipulation, shift, rotate, and PE-to-PE-communiwww.ednmag.com
cations operations. ManArray fixed-point processors support a number of special DSP instructions, including packed-data multiply-accumulate, and multiply complex data. SupportThe BOPS software-development kit comprises an integrated development environment, a visual system simulator, a cycle-accurate instruction-set simulator, a Gnu-C compiler with assembler/preprocessor/linker, and a basic DSP library. BOPS also has a compiler for Matlab. The instruction-set simulator provides views of all core resources, including the disassembly of iVLIWs in VIM and pipeline stages. You can find some demonstrations of this architecture at http://bopsnet.com/training/demos.shtml.
www.ednmag.com
dspdirectory 16
bits
PalmDSPCore split the 64k-word data space into X, Y (ROM/RAM), and Z space for peripherals. Teak uses linear program memory as large as 256k words and paging as large as 4M words. PalmDSPCore offers as much as 1M word of linear memory and 16M words of paged memory. All cores have a 16-bit loop counter for repeating instructions or instruction blocks as many as 65,536 times. OakDSPCore allows you to nest a repeat instruction in a loop block with as many as four levels of block nesting. Teak and PalmDSPCores allow you to use an infinite number of repeats and block repeats with a special instruction. All cores can operate from 1.8 to 2.7V and have power-management features to cut power dissipation. Internal control can automatically shut off unused functional units and memory. Addressing modesAll cores support circular buffering, direct, register, indirect, relative, and short and long immediate-addressing modes. OakDSPCore, Teak, and PalmDSPCores also support index and long direct-addressing modes. Teak and Palm can have a quadruple-indirect-addressing mode to simultaneously feed the four inputs of the two multipliers and bit-reversal addressing mode. Special instructionsThe cores support conditional subroutine call/return from a subroutine and interruptible- and block-repeat instructions. (PineDSPCore has one repeat level and one block-repeat; OakDSPCore has four levels of nesting.) They also support division step, bit-field test, set, reset and change (except for the PineDSPCore), compare, square, accumulate/subtract previous product, move data/program memory, conditionally modify accumulator, double-precision calculations, bit-field operations, exponent evaluation, normalization, context switching, minimum/maximum calculation with pointer latching and modification, delayed return, and automatic boot. Teak and Palm include special instructions to support coprocessors and built-in accelerators for Viterbi acceleration, FFTs, and other functions. Palm adds support for delayed branches, many conditional instructions, and special instructions for least mean square, vector quantization, and other algorithms. SupportDSP Group provides software-development tools along with evaluation/development boards and on-chip-emulation capabilities through the JTAG interface. The software tools include an assembler/linker; a preprocessor, a loader; a debugger; a C/C++ compiler; and the Assyst simulator. The debugger works in simulation or emulation mode and includes an application profiler. It contains support that allows you to extend the simulator and add logic to customize the debuggers hardware interface and to perform multicore debugging. All tools run under Windows and Solaris. Check out DSP Groups third-party developers at www. dspg.com/prodtech/core/3rdparty.htm.
www.ednmag.com
dspdirectory 16
bits
ference with new partition shift-in for efficient blockmatching operations. SupportEquator offers its iMMediaTools software developers kit, which includes a trace-scheduling Clanguage compiler, the FIRtree media-intrinsic C-language extensions; an assembler; a linker; a source-level debugger; an assembly-level debugger; a profilingCPU simulator; a virtual-machine, cycle-accurate simulator; and assorted libraries. MAP-CA supports the Microsoft Windows NT and Linux host-development environments.
www.ednmag.com
dspdirectory 16
bits
debugging. Infineon provides an evaluation/development board that supports JTAG-based emulation using Carmels on-chip debugging-support capability. These tools allow you to run programs in real time and within the applications hardware system. You can also check out Infineons partner section at www.infineon. com/dsp.
www.ednmag.com
dspdirectory 16
bits
www.ednmag.com
dspdirectory 16
bits
Special instructionsThe SC140 multipliers support all combinations of signed and unsigned operands and both fractional and integer formats. The MAC units support add, subtract, negate, absolute value, and clear. The MAC units also support division iteration, comparison, maximum/minimum operations, transfers between registers, arithmetic-shift operations, and rounding. The SC140 supports a single-instruction-multiple-data version of maximum/minimum, additions, and subtractions (MAX2, ADD2, SUB2) by treating values in registers as packed pairs of 16-bit data operands. Using these instructions, the SC140 can perform eight 16-bit additions or maximum/minimum operations per cycle. The SC140 includes MAX2VIT, a special version of the maximum/minimum operation that works with the Viterbi shift-left instruction, a specialized move instruction that supports efficient implementation of Viterbi decoding algorithms. SupportTools include an assembler, an optimizer, a linker, a simulator, and an ANSI C- and C -compliant C/C compiler. The compiler intrinsically supports for International Telecommunications Union/European Telecommunications Standards Institute primitives. Green Hills Software (www.ghs.com) will also provide C/C support with its Multi development environment.
www.ednmag.com
dspdirectory 16
bits
www.ednmag.com
dspdirectory 16
bits
emits information to enable C source debugging, and allows mixed C and assembly code. The assembler supports the ANSI C preprocessor to allow file inclusion, macro substitution, conditional assembly, and various numeric-constant formats. The assembler also allows expressions to include multiple user-defined labels and supports the Tcl preprocessor directives to allow the assembler to share macro operations with the debugger. The debugger supports integrated debugging of one processor or multiple homogeneous or heterogeneous processors. It supports data and instruction breakpoints, software simulation with near real-time visibility, mixed assembly and C debugging, extensive code profiling, stand-alone or networked hardware emulation through the JTAG with the TargetView Communication System, hardware trace, and on-chip cycle count. Extensive on-chip debugging hardware lets you monitor many DSP16000s in real time. Another feature of the debugger is an architectural view, which provides a block-diagram view of the DSPs multiple processing units. As you step through the instructions of your application code in the debugger, the architectural-view utility graphically displays the data flow through the DSP. This feature enables you to view underused parts of the architecture and make code changes to increase code efficiency. Third-party tools, such as Synopsys COSSAP, Cadence SPW, and Mathworks Matlab support the DSP16000 simulator. The software tools cost $1500, and the hardware tools cost $5000 to $7000.
dspdirectory 16
bits
to check the status of FILU-200 execution. During host-code development, the C calls within the API initiate functions that perform operations that simulate the behavior of the FILU-200 hardware. Massanas instruction-set simulator provides bit-true, cycle-accurate simulation of the FILU-200 in C. Massana provides a FILU-DMT, which is a FILU-200 core with preprogrammed G.Lite DSP functions and a synchronous serial interface for the analog front end (AFE). The AFE provides the ADC, DAC, interpolation, decimation, and front-end analog filtering in a single chip. The FILU-DMT is available on a PCI card, which enables a user to develop a soft G.Lite implementation on real-time hardware. The card includes the FILU-DMT implemented in a 20-MHz FPGA; a 33-MHz, 32-bit PCI interface; a G.Lite AFE using TIs TLFD500 codec; line drivers, and a data-access arrangement.
dspdirectory 16
bits
Motorola DSP56800
The DSP familys parallel-instruction set controls three concurrent execution units: the data ALU, the address-generation unit (AGU), and the program controller. The general-purpose C-style instruction set with its flexible addressing modes and bit-manipulation instructions enables you to write control code without worrying about DSP complexities. The data ALU provides single-cycle multiplies and multiply-accumulate (MAC) instructions with 36-bit accumulation (4 guard bits), as well as a set of logical and arithmetic operations. The ALU contains X0, Y0, and Y1 input registers; two accumulators, which can also serve as input registers; a MAC unit; a 16-bit barrel shifter; and automatic saturation logic. You can write ALU results back to either of the accumulators. Additionally, if you dont expect the ALU result to be 36 bits, then the result can go directly back to one of the three input registers without corrupting an accumulator value. The AGU can provide two data-memory addresses with address updates in one cycle. It contains five 16bit pointer registers (one functioning as a stack pointer), an offset register, a modifier register for circularbuffer support, and two address ALUs (one supporting modulo arithmetic) to fetch two data items from memory every instruction cycle. The stack pointer has several addressing modes, improving compiler performance and supporting structured programming techniques, such as parameter passing and local variables. The 56800 supports an interruptible hardware do loop on an any-sized block of instructions. In a set of nested loops, a programmer generally uses hardware looping for the innermost loop. Then, you can perform the outer loops using software looping and the 56800s data ALU register, AGU register, or a memory location to store the loop counter. To improve the performance of software looping, the 56800 supports a decrement instruction that operates directly on X memory and uses a conditional branch operation. Furthermore, Motorola added an addressing mode that requires no address calculation and allows direct access to the first 64 locations in X memory; this approach makes the access faster than a long immediate access. Addressing modesThe 56800 supports register-direct, short and long memory-direct, seven memoryindirect, and immediate addressing modes. It also supports short-branch offset and modulo arithmetic for circular buffers. Special instructionsThe 56800 performs hardware-do and -repeat looping on one instruction or a block of instructions. Single and dual parallel-move instructions perform memory accesses in parallel with ALU operation, allowing two data-memory accesses while fetching an instruction. The 58600 can perform bit-manipulation operations on any register or memory location, and it can perform single-cycle multiply and MAC with optional rounding, addition, subtraction, and squaring. Using a conditional transfer instruction with a compare instruction implements searching and sorting alDevice mixes DSP with gorithms. If the specified condition is control functions. true, then the DSP performs a transfer from one register to another (for examArchitecture features ple, to store the array index of the maxan interruptible hardimum value in an array). ware do loop. SupportThe 56800 uses Motorolas Device operates at OnCE port for on-chip emulation 2.7V and 70 MHz. through a standard JTAG interface. Metrowerks CodeWarrior, which Motorola now owns, offers an integrated development environment, a C compiler, an assembler, a linker, a code simulator, and a graphical source and assembly-level debugger for PCs. This package includes a 30-day evaluation license for CodeWarrior in the evaluation module (DSP56824EVM) and development system (DSP-56824ADS).
www.ednmag.com
dspdirectory 16
bits
dspdirectory 16
bits
assembler handles mapping of the AXU commands. Special instructionsThe REAL DSPs instruction set uses 16-bit operation codes, but you can extend the instruction set with 96-bit ASIs. An on-chip RAM or ROM look-up table contain these very-long-instruction-word-like ASIs, which contribute to a high level of parallelism. A 16-bit instruction entering the core contains an index of this table, which activates the corresponding ASI operations. If the silicon implementation of the DSPs look-up table is in RAM, your application can download sets of ASIs to the chip while the application is running and dynamically customize the DSP core. The assembler, linker, and instructionset simulator account for the ASIs that you specify when writing your application program. You have to specify the keyword asi followed by all the operations that the core executes in parallel. The assembler/linker then checks for duplicate ASI instructions, translates the instructions to an ASI look-up table and, if necessary, downloads them to the DSP. You may use as many as 256 ASIs. SupportPhilips Semiconductor supports DSP-C, a proposed extension of ISO/ANSI C to better handle DSP-specific capabilities, DSP has dual Harvard such as fixed-point data types and dual architecture. Harvard architectures. The company is developing a C compiler for the REAL Data-computation unit architecture using the Associated Comcomprises two 16 piler Experts (www.ace.nl) CoSy com16-bit signed multiplipiler-development platform. ers and two 40-bit
ALUs. Device is customizable using applicationspecific units and instructions.
www.ednmag.com
dspdirectory 16
bits
company also supplies a C compiler, a source-level C assembler/debugger, a linker, a simulator, a profiler, and an application library. Evaluation modules, prototype cards, emulators, and application algorithms are also available from third parties. TI also offers analog devices, such as data converters and a power-management supply, which you can combine with the C2000 DSPs. See www.ti.com/sc/4123 for more details.
dspdirectory 16
bits
dspdirectory 16
bits
The C55x has a PFU that tracks a programs execution point and generates the 24-bit addresses for instruction fetches for as much as 16 Mbytes of program memory. This unit includes hardware for looping and for speculative branching, conditional execution, and pipeline protection. A separate program counter is dedicated to fast returns from subroutines or interrupt service routines. The PFU also includes the logic for managing the instruction pipeline and the four CPU status registers. The PFU provides three levels of hardware loops by nesting a block-repeat operation within another block-repeat operation and including a single repeat in either or both of the repeated blocks. It also includes hardware to support conditional repeats. The PFU handles pipeline-control hazards and provides protection against write-after-read and read-after-write data hazards. When such data hazards occur in a C55x instruction stream, the pipeline-protection logic inserts cycles to maintain the intended order of operations and correct execution of the program. An integrated software wait-state generator in both DSPs allows you to use slower external memories. All devices support on-chip dual-access RAM (DARAM) that you can configure as data or program memory. The C55x has expanded options for synchronousburst RAM, synchronous DRAM, and asynchronous SRAM and DRAM. A PLL allows you to throttle the clock, but the C55x core can also actively and automatically manage power consumption of on-chip peripherals and memory arrays. When a program is not accessing individual on-chip memory arrays, they switch into a low-power mode. The processor provides similar control for on-chip peripherals. Peripherals can enter low-power states when they are inactive and respond to processor requests and exit their low-power states without latency. The C55x also implements user-controllable, low-power Idle domains. These domains are sections of the device that you can selectively enable or disable under software control. When you disable a domain, it enters the Idle state, maintaining register or memory contents. When you enable the domain, it returns to normal operation. On initial C55x devices, the Idle domains are the CPU, the DMA, the peripherals, the
external memory interface, the instruction queue, and the clock-generation circuitry. Addressing modesThe C54x supports single datamemory-operand addressing that also supports 32bit operands. It also supports dual-data-memoryoperand addressing, which parallel instructions use. It provides immediate, memory-mapped, circular, and bit-reversed addressing. In addition to the C54x modes, the C55x supports absolute addressing, register-indirect addressing, and the direct-addressing, or displacement, mode. The C55xs ADFU includes dedicated registers to support circular addressing for instructions that use indirect addressing. Your program can simultaneously use as many as five independent circular buffer locations with as many as three independent buffer lengths. These circular buffers have no address-alignment constraints. The C54x supports two circular buffers of arbitrary lengths and locations. Special instructionsThe C54x performs dedicated-function instructions, such as FIR filters, single and block repeat, eight parallel instructions (for example, parallel store and multiply accumulate), multiply and accumulate and subtract (10 multiply instructions), and eight dual-operand memory moves. The C55x also has special instructions that take advantage of the additional functional units as well as increase parallelism capabilities. User-defined parallelism allows you to combine certain instructions to perform two operations. You can also combine a built-in parallel instruction with a user-defined parallel instruction. SupportThe eXpressDSP software-technology strategy includes DSP integrated development tools; a scalable, real-time software foundation; standards for application interoperability and reuse; and a growing base of TI DSP-based software modules from third parties (www.ti.com/sc/docs/general/dsp/expresssp/ index.htm). Code Composer Studio, an integrated suite of DSP-software-development tools, incorporates TIs C5000 C compiler with the Code Composer integrated development environment, DSP/BIOS, and Real-Time Data Exchange technologies. Thirdparty tools and application algorithms are also available. See www.ti.com/sc/4123 for more details.
www.ednmag.com
dspdirectory 16
bits
unit sets register bank; the functional-unit set performs this procedure through one data bus; on the C62x, all units except the two data units have a data cross-path to the other set of units. The C64x data-cross-path accesses allow multiple units per side to simultaneously read the same crosspath source. Thus, one, multiple, or all the functional units on a side in a VLIW-execute packet may use the cross-path operand for that side. In the C62x, only one functional unit per datapath per execute packet could access an operand from the opposite register file. The C62x register files support packed 16-bit data through 40-bit, fixed-point and 64-bit, floating-point data. You can store values larger than 32 bits in register pairs. The C64x register file supports all the C62x data types, packed 8-bit types, and 64-bit fixed-point data types. Packed data types store four 8-bit values or two 16-bit values in a single 32-bit register or four 16bit values in a 64-bit register pair. Each C64x multiplier can return a result as large as 64 bits, so an extra write port is available from the multipliers to the register file. The C6000 families support no separate X- and Ymemory spaces. Instead, they provide a single data memory with two 64- and 32-bit paths, respectively, for loading data from memory to the register banks. Two other 32-bit paths (64 bits for C64x) store register values to memory. A 32-bit address bus supports these datapaths. The C64x can also access words and double words at any byte boundary using nonaligned loads and stores; the C62x requires alignment on 32or 64-bit boundaries. A 32-bit address bus addresses the program memory, but the single datapath is 256 bits wide. This width allows the C62x to fetch, but not necessarily execute, eight 32-bit instructions per cycle. TI calls this approach a fetch packet. The C62x architecture does not allow fetch packets to cross fetchpacket boundaries, resulting in compiler-generated nonoperation (NOP) instructions to pad fetch packets. The C64x architecture resolves this code-bloat issue with instruction packing in the instruction-dispatch unit. This approach removes execute-packetboundary restrictions and eliminates all filler NOP instructions. The CPU can execute one to eight instructions per cycle, but data dependencies, instruction latencies, and resource conflicts limit optimal performance. Multiple execute packets allow fully parallel, fully serial, or parallel/serial combinations; therefore, eight serial instructions require the same code size as eight parallel instructions. The compiler and assembly optimizer play big roles in establishing the sequence of instructions for the C6000 to execute. The programming
www.ednmag.com
dspdirectory 16
bits
tools link instructions in a fetch packet by the least significant bit of an instruction. If the bit is set, the C6000 executes the instruction in parallel with the subsequent instruction. The assembly optimizer performs dependency checking and parallelism among instructions. Therefore, the code executes as programmed on independent functional units and eliminates the need for core features, such as out-of-order execution or dependency-checking hardware. Two devices from these families, the C6211 and C6711, are the industrys first DSPs with L1 and L2 onchip cache memory. The C6211 incorporates a twolevel cache structure with 4-kbyte Level 1 program and data caches. The internal Level 2 cache memory is a unified 64-kbyte data and instruction RAM. The C6211 also includes a 16-channel DMA controller that tracks 16 independent transfers and allows you to link each channel to a subsequent transfer. The C6202, C6203, and C6204 have a 32-bit expansion bus that replaces the 16-bit host-port interface and complements the external memory interface (EMIF). The second bus for I/O devices reduces the loading on the EMIF and increases data throughput. The EMIF and the expansion bus are independent of each other, allowing the CPU to perform concurrent accesses to both ports. Addressing modesThe C6000 performs linear and circular addressing. However, unlike most other DSPs that have dedicated address-generation units, the C6000 calculates addresses using one or more of its functional units. Special instructionsAll C6000 processors conditionally execute all instructions, a method of reducing branching and, therefore, keeping the pipeline flowing. On the C64x, the MPYU4 instruction performs four 8 8-bit unsigned multiplies. The ADD4 instruction performs four 8-bit additions. All functional units can perform dual 16-bit addition/subtraction, compare, shift, minimum/maximum, and absolute-value operations. The M units, and four of the six remaining functional units, support quad 8bit addition/subtraction, compare, average, minimum/maximum, and bit-expansion operations. TI also added instructions that operate directly on packed 8- and 16-bit data. Bit-count and rotate hardware on the M unit extends support for bit-level algorithms, such as binary morphology, image-metric calculations and encryption algorithms. The C64xs the branch-and-decrement (BDEC) and branch-on-positive (BPOS) instructions com-
bine a branch instruction with the decrement and test positive of a destination register, respectively. Another instruction helps reduce the number of instructions needed to set up the return address for a function call. The dual 16-bit arithmetic combines with six of the eight functional units and a bit-reverse (BITR) instruction to improve FFT cycle counts by a factor of two. The Galois field-multiply instruction (GMPY4) provides a performance boost over the C62x for Reed Solomon decoding using the Chien search. Special average instructions improve the performance of motion compensation by a factor of seven on a per-clock cycle basis versus the C62x. The quad-absolute-difference instruction bolsters motion-estimation performance by a factor of 7.6 on a per-clock-cycle basis for an 8 8-bit minimum-absolute-difference (MAD) computation. The C64x provides data packing and unpacking operations to allow sustained high performance for the quad 8-bit and dual 16-bit hardware extensions. Unpack instructions prepare 8-bit data for parallel 16-bit operations. Pack instructions return parallel results to output precision, including saturation support. SupportThe eXpressDSP software-technology strategy includes DSP integrated development tools; a scalable, real-time software foundation; standards for application interoperability and reuse; and a growing base of TI DSP-based software modules from third parties (www.ti.com/sc/docs/general/dsp/expresssp/ index.htm). The Code Composer Studio, an integrated suite of DSP-software-development tools, incorporates TIs C6000 C compiler with the Code Composer integrated development environment, DSP/BIOS, and Real-Time Data Exchange technologies. The assembly optimizer simplifies assembly-language programming and automatically schedules and parallelizes instructions from serial, inline assembly code. The assembler reads straight-line code without regard to registers or functional units and does the resource assignment. Deterministic operation allows the debugger to lock-step through the code. The debugger performs code profiling to determine the amount of time the processor spends in various portions of the code. Free tools are available for a 30-day trial on the Web at www.ti.com/sc/docs/tools/dsp/ 6ccsfreetool.htm. Third-party tools and application algorithms are also available. See www.ti.com/sc/4123 for more details. TI offers hardware-emulation boards and starter kits.
www.ednmag.com
dspdirectory 16
bits
JTAG-compatible debugging port. The company also provides software libraries, including support for digital-still-camera applications; H.263 image-compression algorithms; standard voice codecs; FFTs; and a 2-D, discrete-cosine transform. An RTOS kernel supports multitasking applications and handles multiple priorities for tasks, semaphore queues for synchronizing events between tasks, system functions to handle interrupts, state preservation, and DMA requests.
dspdirectory 16
bits
Zilog Z893x1/Z893x3
Zilog built the Z893xx familys architecture around a single-cycle multiply-accumulate (MAC) unit, which includes a 24-bit product register and a 24-bit accumulator and arithmetic-logic unit with no guard bits. The DSP runs from a 4k- or 8k-word ROM or one-time-programmable (OTP) program ROM. Two internal bus setsa program-address/data-bus set and a data-address/data-bus setallow the processor to access program and data concurrently with a MAC operation. The architectures two RAM blocks can hold coefficients and data samples, which automatically feed directly into the MACs input registers each cycle. RAM-block addressing automatically increments or decrements the address, which eliminates the need for data-address-generation code for each MAC cycle. Results of the MAC operation land in the product register and the 24-bit accumulator during each cycle. You can treat the product register as a general-purpose register when it is not performing multiplies. Although the Z893xx lacks a barrel shifter, a shifter between the product register and ALU allows you to shift the product result right Device has an accuby 3 bits before adding it to the accumulator-based, mulator. modified Harvard You can use the external I/O bus to acarchitecture. cess external peripheral devices, such as an ADC. An external read/write takes The MAC unit includes one cycle. You can insert one wait state a 16 16- to 16 24using software control to access slow bit multiplier with auperipherals; you can use the Wait pin for tomatic truncation. additional wait states. Running code Features include an from external memory takes one addiexternal I/O bus. tional cycle for each instruction; the chip reads the data in one cycle, but the data Device does no hardis unavailable for processing until the ware-repeat looping or next instruction cycle. bit manipulation.
Z893x1 devices have a codec interface that is compatible with 8-bit PCMs, 16-bit codecs, and 64-bit stereo sigma-delta codecs. You can adapt many general-purpose peripherals, including 8- and 16-bit ADCs and DACs, to this interface. You can also use the interface as a high-speed serial port or general-purpose counter. The Z893x1 chips also have one 13-bit timer for the CODEC interface and one 13-bit timer for general-purpose uses. You can concatenate these timers for extended timing. The Z893x3 has an 8-bit half-flash ADC, a highspeed SPI, three counter/timers, and as many as three 8-bit ports. It also has a PLL-driven system clock that drives the DSP to operate as fast as 20 MHz from a 32kHz watch crystal. Addressing modesThe Z893xx supports memorydirect addressing for as many as 512 RAM-based words; it also supports register-indirect addressing to RAM or ROM with pointer registers and immediate, short-form direct addressing using 16-bit data registers in RAM. It provides one-cycle, external-peripheral addressing, treating the peripheral as a register. Modulo-addressing options include Modulo 2 to 256 for data access. Special instructionsThe Z893xx performs compare register to accumulator, conditional execution of certain instructions, and conditional branching and subroutine calls. SupportZilog offers its Zilog Developer Studio which comprises a macroassembler, a linker/loader, and a source-level debugger. Also available is the 3xx Compass/IDE, which comprises a C compiler, an assembler, a linker, a simulator, and application libraries. Zilog offers emulators and evaluation boards, OTP programming adapters, and target emulation pods supporting a design-in environment. Check out www.zilog.com/products/dspapp.html for additional information.
www.ednmag.com
dspdirectory 24
bits
Motorola DSP563xx
The 563xx is Motorolas highest performance fixedpoint DSP architecture, achieving single-cycle instruction execution. Although a branch penalty is three cycles, the 563xx supports conditional ALU instructions, which often avoid the need to change program flow. When the processor executes a single-cycle multiply-accumulate (MAC) operation, the first execution stage does the multiply, and the second stage does the accumulate. The 563xx uses an interlocking mechanism that automatically inserts a nonoperation (NOP) instruction into the pipe to avoid stalls. This approach permits execution to catch up with data dependencies. The 563xx is binary-code-compatible with the 56000, but the 563xx also supports addressing modes that include address-register program-counter (PC) relative. This mode is useful for multitasking and position-independent code, which lets a programmer deliver and relocate object modules without relinking to the original code. Motorola expanded addressing on the 563xx to the full 24 bits, up from 16 bits on the 56000 family. Unlike the DSP56000, which has a 16location stack limit, the DSP563xx implements an overflow mechanism for the on-chip hardware stack to off-chip data memory. Although the mechanism prevents unrecoverable stack overflows, the chip takes a two-clock penalty when externally dumping stack entries. The DMA has separate address and data buses. The DMA transfers data among memories (P, X, and Y) or among memory and peripherals or the external host buses (PCI or ISA). You can program the size of the program RAM, instruction cache, X-data RAM, and Y-data RAM. The static core operates from dc to 80 MHz and uses a PLL with a built-in prescaler that allows dynamic clock throttling. For additional power savings, the core automatically powers down unused memories, peripherals, and core logic on every instruction. The newest members of the 56300 family are the 56307 and 56311. These devices include an on-chip enhanced filter coprocessor (EFCOP) that processes filter algorithms in parallel with the DSPs core operation. The EFCOP provides performance improvements for tasks such as voice coding and echo cancellation. It operates in modes to perform real- and
complex-FIR, infinite-impulse-response filtering, adaptive-FIR filtering, and multichannel-FIR filtering. The EFCOP has its own access to memory, as well as a port into DMA. Addressing modesThe 563xx supports register-direct, address-register-indirect, PC-relative, immediate absolute addressing. Special instructionsThe 563xxs barrel shifter supports multibit-shift instructions in both directions and by any number of bits. The shifter also supports instructions for bit-stream parsing and generation. The device can conditionally execute parallel ALU instructions based on all possible condition codes. If the test condition is false, the processor executes an NOP instruction. The 563xx performs 16-bit arithmetic that is useful for handling various compression algorithms, such as LD-CELP (low-delay code-excited linear prediction). Normally, when using a 24-bit architecture for 16-bit arithmetic, performance degrades, because you have to round the 24-bit numbers in software. SupportMotorola backs the 563xx family with a host of development tools.You can use an applicationdevelopment system to evaluate the chip and debug target systems. The system Device has a sevencomes with an application-development stage pipeline, commodule, a host-interface card, a comprising two fetches, mand converter, an assembler, simulator one decode, two adsoftware, and a C compiler. The 563xxs dress generations, and JTAG-based OnCE port allows you to extwo executions. amine all internal buses in real time and record the last 12 change-of-flow inDevice has conditionstructions. Motorola provides the Suite56 al-ALU instructions. hardware- and software-development Architecture is registertools for the DSP563xx family. A range of based. third-party tools complements these tools. Third-party tools include a comSix-channel DMA oppiler and debugger from Tasking (www. erates concurrently tasking.com) and a debugger from with cores execution Domain Technologies (www.domaitec. units. com). The Motorola software tools are Most devices operate available on CD-ROM, or you can downat 3.3V and have 5Vload them from www.motorola.com/ tolerant I/O; some opSPS/WIRELESS/dsptools/index.htm. erate at 1.8V and have Metrowerks, now part of Motorola, will 3.3V-tolerant I/O. unify the look and feel of development tools for new processors under the Filter coprocessor opMetrowerks Code Warrior style.
erates in parallel with the core.
www.ednmag.com
dspdirectory 32
bits
a combination of 16-, 32-, or 40-bit data and 48-bit instructions and perform as many as four accesses per cycle: program memory for code and data, data memory for data, and an off-chip load using the chips I/O controller. SHARCs I/O controller executes I/O transfers in parallel with CPU execution. The I/O controller offloads reads and writes between on- and off-chip memory. The dual-ported, dual-banked nature of the memory, combined with the I/O processor, allows the core and the DMA to simultaneously access internal SRAM. The I/O controller manages all DMA channels, transferring data among internal and external memory and all peripherals, such as the host port, as many as eight serial ports, and six link ports. All DMA operations generally do not interrupt or delay core thread execution. The DMA controller allows you to dynamically control the external-memory-bus width. The synchronous serial ports support time-divisionmultiplexed serial streams and hardware companding and can transfer data as fast as 40 Mbps. In all but the ADSP-21065L, the six communication ports move data in 4-bit nibbles, transferring as much as 1 byte/clock cycle. With six links operating simultaneously, maximum throughput is 600 Mbytes/sec. The CPU, I/O controller, and peripherals interconnect and perform flexible, nonintrusive transfers through a multibus-crossbar-interconnection unit. To reduce bottlenecks, the interconnect crossbar permits unlimited data and instruction movement from external or internal memory or cache and permits I/O from on- or off-chip peripheralsall in one cycle. The 211660, 21060, and 21062 provide six communication ports for array multiprocessing. These ports feed through the I/O controller and let you create meshes of DSPs that can access each others memory spaces. (Point-to-point connections between DSP ports define each processor in the mesh.) The on-chip I/O controller sets up, runs, and responds to these ports. Transfers pass through the I/O ports to and from internal memory. The I/O controller separates these transfers from mainstream DSP. A parallel port serves as a direct interface to off-chip memory, peripherals, or a host processor. As many as six SHARCs can share this interface with a host processor. SHARCs offer a unified address space using a 32-bit address bus and a 32- or 48-bit data bus. For a 100-MHz clock, the chip supports a 10-nsec access time with zero-wait-state memory. The special host interface supports both 16- and 32-bit Ps, as well as system buses, such as ISA and PCI. The host treats the SHARC as a memory-mapped device with direct writes or reads to internal memory.
www.ednmag.com
The lowest priced SHARC DSP, the ADSP-21065, also provides a synchronous DRAM (SDRAM) interface that transfers data to and from SDRAM as fast as 240 Mbytes/sec, or twice the clock frequency. The glueless SDRAM interface can access 16- or 64-Mbyte SDRAMs and enables you to connect to any one of four external memory banks. Addressing modesSHARC offers immediate, indexed, bit-reversed, circular-modulo, and register-direct and -indirect addressing. (It must use indirect addressing for off-chip memory access.) Special instructionsSHARC provides bit manipulation, division iteration, reciprocal of square-root seed, conditional subroutine call, single and block repeat with zero-overhead looping, fixed- and floating-point compare, and conditional execution of most instructions. SHARC supports IEEE-754 single-precision, floating-point (23-bit data, 8-bit exponent, and sign bit), and a 40-bit extended IEEE format for additional accuracy (32-bit data). SupportAnalog Devices software- and hardware-development tools include the companys VisualDSP integrated development environment, in-circuit emulators, and a development kit. VisualDSP provides the interface to an optimizing C compiler, an assembler, a linker, a simulator, and a debugger. Analog Devices emulators are available for Universal Serial Bus, PCI, and Ethernet host platforms. An EZ-Kit Lite consists of an evaluation board and limited but full-featured VisualDSP. Analog Devices based the SHARC assembly language on an algebraic syntax.
www.ednmag.com
dspdirectory 32
bits
memories. The core registers, eight 40-bit extendedprecision registers, auxiliary registers, and key-control registers reside in a central multiported register file of 32 registers. The C3x uses a software stack to support context switching. The third C3x subsystem, the I/O, comprises a single-channel DMA controller (dual channel in the C32) and a collection of peripherals that interlink with the peripheral-address and data-bus set. The memory-subsystem buses pass through a multiplexer and link to the peripheral bus, which serves the DMA controller and peripherals. Addressing modesThe C3x supports register-direct, paged-memory-direct, register-indirect, and immediate addressing. A single circular buffer supports circular addressing and bit-reversed addressing for FFTs. The circular buffer requires block-size and basepointer registers plus an auxiliary register that the buffer shares with X and Y memories. Special instructionsThe C3x performs single- or block-instruction hardware looping (supports nestable block repeats but lacks automatic save and restore of status); standard branches, which empty the pipe; delayed branchDevice features 32es, which wait three program cycles bit, floating-point before changing the program counter; architecture. interlocked access instructions for multiprocessing (load/store integer or floatHarvard-like architecing-point value and signal interlocked); ture has a von Neucomputed gotos (dynamic subroutine mann-like programcalls); and conversion of floating-point ming environment. to integer and vice versa. You can also specify instructions to execute in parallel. SupportTI supplies full-speed in-circuit emulators, evaluation modules, and starter kits. The C3x, except for the C33, lacks JTAG support but has a proprietary five-pin emulation interface. TI sells a tool set that includes a C compiler, an assembler/linker, a source-level debugger, a code profiler, a simulator, and an application library. Third-party tools include C and ADA compilers, multiple OS products, filter-design packages, advanced graphical-design tools, and hardware tools. Check out www.ti.com/sc/docs/tools/dsp/ index.html for more information.
www.ednmag.com
dspdirectory 32
bits
ory space. I/O can also use the external buses. A six-channel DMA subsystem with its own address and data buses moves data between the communications ports and memory without altering the CPUs sequential threads. Such data movements do not overload the DSP with servicing overhead, although some data contention for memory may slow CPU execution. Addressing modesThe C4x supports register-direct, paged-memory-direct, register-indirect, immediate, and circular addressing to support single-sized circular buffers. The CPU applies bit-reversed operations to register-indirect addressing only. Special instructionsThe C4x performs single or block instruction, zero-overhead hardware looping (nestable block repeats but without automatic save and restore of status), standard/delayed branches, interlocked access for multiprocessing (load/store integer or floating-point value and signal interlocked), conversion of floating point to integer and vice versa, reciprocal and reciprocal square-root seed, and conversion to and from IEEE floating-point formats.You can also specify some instructions to execute in parallel. SupportDevelopment system includes scan-based emulation via the C4xs JTAG test port. External hardware can use the JTAG port to control the processor and to set and monitor registers or memory. You can string multiple C4x chips on a JTAG circuit for parallel debugging. One processor breakpoint can halt execution in an array of C4x chips, and you can singlestep them all in lock step. TI sells a C4x evaluation board with four processors that works with a number of host platforms. Software tools include a C compiler, a source-level debugger for parallel debugging, an assembler/linker, and a simulator. TI offers an application library. Third-party support includes the Spox, Parallel C, Virtuoso, and Helios OSs, as well as a variety of hardware tools.
www.ednmag.com
dspdirectory 64
bits
www.ednmag.com