Software Defined Radio (SDR) Based Implementation Ofieee 802.11 Wlan Baseband Protocols Thesis
Software Defined Radio (SDR) Based Implementation Ofieee 802.11 Wlan Baseband Protocols Thesis
Software Defined Radio (SDR) Based Implementation Ofieee 802.11 Wlan Baseband Protocols Thesis
BY
THESIS
Submitted to the University of New Hampshire in Partial Fulfillment of the Requirements for the Degree of
__________________________________________ Michael J. Carter, Ph.D. Thesis Director, Associate Professor, Department of Electrical and Computer Engineering.
__________________________________________ William H. Lenharth, Ph.D. Research Associate Professor, Department of Electrical and Computer Engineering, Director, UNH Research Computing Center.
__________________________________________ Kevin Karcz, M.S. Technical Lead, Wireless Consortium, UNH- InterOperability Lab.
__________________________________ Date
ii
DEDICATION
To my parents SREEMAN NARAYANA and LEELAMANI SURINENI for their support and encouragement
iii
ACKNOWLEGEMENTS
I would like to formally thank Prof. Michael Carter for his expert guidance and support throughout the thesis. I would also like to thank Prof. Sivaprasad and Dr. William Lenharth for giving me an opportunity to pursue graduate studies at UNH and for providing me an opportunity to work at the InterOperability Lab and also for serving on my thesis committee. I would like to express my deepest appreciation to Kevin Karcz for supporting me throughout my graduate career. I would also like to express my appreciation for the support of the staff of InterOperability Lab, especially Wireless Consortium. Finally, I want to express special thanks to all my friends for their support and encouragement.
iv
TABLE OF CONTENTS
CHAPTER
PAGE
DEDICATION .........................................................................................................................iii ACKNOWLEGEMENTS........................................................................................................ iv LIST OF FIGURES ...............................................................................................................viii LIST OF TABLES ................................................................................................................... ix ABSTRACT .............................................................................................................................. x Chapter I ................................................................................................................................... 1 Introduction .............................................................................................................................. 1 1.1 1.2 1.3 Motivation ................................................................................................................. 1 Contribution............................................................................................................... 2 Overview ................................................................................................................... 4
Chapter II.................................................................................................................................. 6 Background............................................................................................................................... 6 2.1 Introduction ............................................................................................................... 6 2.2 Spread Spectrum............................................................................................................... 7 2.2.1 FHSS Frequency Hopping Spread Spectrum............................................................ 8 2.2.2 DSSS Direct Sequence Spread Spectrum................................................................. 8 2.2.2.1 CCK Complementary Code Keying.................................................................. 8 2.2.2.2 PBCC Packet Binary Convolutional Coding ..................................................... 9 2.2.2.3 OFDM Orthogonal Frequency Division Multiplexing.......................................... 9 2.3 Summary ........................................................................................................................ 13 Chapter III .............................................................................................................................. 15 Prototype Hardware Selection................................................................................................ 15 3.1 Introduction .................................................................................................................... 15 3.2 Software Defined Radio (SDR) Architecture ................................................................... 15 3.2.1 Radio- Frequency Front End .................................................................................... 16 3.2.2 Analog/Digital Converters ....................................................................................... 17 3.2.3 Baseband Signal Processing ..................................................................................... 17 3.3 Comparison of DSP and FPGA systems .......................................................................... 18 3.3.1 Introduction to DSP and FPGA systems ................................................................... 18 3.3.2 DSP and FPGA Architectures .................................................................................. 19 3.3.2.1 Architecture of DSP Processor .......................................................................... 19 3.3.2.2 Architecture of FPGA ....................................................................................... 21
3.3.3 Memory Access and Parallel Processing................................................................... 23 3.3.4 Reconfigurability and Reuse .................................................................................... 24 3.3.5 Implementation of DSP Algorithms ......................................................................... 25 3.3.5.1 Digital Down Converter .................................................................................... 26 3.3.5.2 Implementing MAC based FIR filter ................................................................. 26 3.3.6 Availability of Libraries and IP Cores ...................................................................... 28 3.3.7. Architectures for Implementing DSP Algorithms in FPGA...................................... 29 3.3.7.1 Fast Fourier Transform (FFT) ........................................................................... 30 3.4. Summary.................................................................................................................. 31
Chapter IV .............................................................................................................................. 33 Simulation and Prototyping Methodologies ........................................................................... 33 4.1 . Simulation and Prototyping .................................................................................... 33 4.2 Simulation and Prototyping Environments....................................................................... 34 4.3. Prototype Design Flow................................................................................................... 36 4.4. Summary ....................................................................................................................... 38 Chapter V................................................................................................................................ 40 HRDSSS System Design.......................................................................................................... 40 5.1. Transmitter Design......................................................................................................... 40 5.1.1. Spread Spectrum Modulator .................................................................................... 40 5.1.1.1. CCK Modulation.............................................................................................. 41 5.2. Spread Spectrum Receiver Design.................................................................................. 42 5.2.1. Acquisition and ADC.............................................................................................. 42 5.2.2. Barker Matched Filter ............................................................................................. 43 5.2.3. Frequency and Phase tracking ................................................................................. 44 5.2.3.1. Data Demodulation .......................................................................................... 44 Chapter VI .............................................................................................................................. 46 OFDM System Design............................................................................................................. 46 6.1. Introduction to OFDM system........................................................................................ 46 6.1.1. OFDM Signal Representation.................................................................................. 46 6.1.2. Cyclic Prefix ............................................................................................................... 47 6.2. OFDM Frame Structure ................................................................................................. 47 6.3. OFDM Transmitter Description...................................................................................... 48 6.4. OFDM Receiver Design................................................................................................. 49 6.4.1. Synchronization ...................................................................................................... 50 6.4.1.1. Frame Detection............................................................................................... 50 6.4.1.2. AGC ................................................................................................................ 51 6.4.1.3. Frequency Offset Estimation ............................................................................ 51 6.4.1.4. Symbol Timing Estimation............................................................................... 53 6.4.1.5. Channel Estimation.......................................................................................... 54
vi
6.4.1.6. Residual Frequency Offset Correction.............................................................. 54 6.4.2. Demodulation and Decoding ................................................................................... 55 6.5. Summary ....................................................................................................................... 55 Chapter VII............................................................................................................................. 56 Low Level Design.................................................................................................................... 56 7.1. Design Requirements ..................................................................................................... 56 7.2. Design Flow................................................................................................................... 57 7.3. Prototype Implementation .............................................................................................. 58 7.3.1 Design Overview ..................................................................................................... 58 7.3.2. Prototype Architecture ............................................................................................ 60 7.4. Testing and Verification................................................................................................. 63 Chapter VIII ........................................................................................................................... 65 Conclusions and Future work................................................................................................. 65 8.1. Conclusions ................................................................................................................... 65 8.2. Future Work................................................................................................................... 66 LIST OF REFERENCES ....................................................................................................... 67 APPENDIX A.......................................................................................................................... 69
vii
LIST OF FIGURES
Figure 3-1 Generalized WLAN interface .......................................................................16 Figure 3-2 Architecture of a DSP processor ...................................................................19 Figure 3-3 Structure of a FPGA .....................................................................................22 Figure 3-4 FIR filter in graphical form...........................................................................27 Figure 3-5 Distributed arithmetic filter mechanism ........................................................30 Figure 5-1 Block diagram of DSSS modem ...................................................................42 Figure 6-1 OFDM frame format.....................................................................................48 Figure 6-2 OFDM training sequence structure ...............................................................48 Figure 6-3 Architecture of OFDM synchronizer.............................................................53 Figure 7-1 OFDM Prototype Design Flow .....................................................................59 Figure 7-2 OFDM Transmitter Functional Block Diagram.............................................61 Figure A-1 Detection and Timing Acquisition ...............................................................70 Figure A-2 Frequency offset estimation on Short and Long symbols..............................71 Figure A-3 Constellation before frequency offset correction ..........................................72 Figure A-4 Constellation after frequency offset correction.............................................72 Figure A-5 EVM per symbol for decoded packet ...........................................................73
viii
LIST OF TABLES
Table 2-1 WLAN Standards Comparison ................................................................................... 7 Table 3.1 Comparison of ALU performances .......................................................................... 20 Table 5.1 Data rates and symbol rates for HRDSSS system ...................................................... 40
ix
ABSTRACT Software Defined Radio Prototyping of IEEE 802.11 Wireless LAN Protocols
By Shravan K. Surineni Department of Electrical and Computer Engineering and UNH InterOperability Laboratory
The IEEE 802.11 family of wireless LAN protocols defines multiple physical layer implementations of which direct sequence spread spectrum (DSSS, 802.11b) and orthogonal frequency division multiplexing (OFDM, 802.11a/g) are currently the most popular. Market pressures are forcing the convergence of multiple wireless protocols into the same access device, and shortened product design cycles dictate rapid prototyping of new or enhanced protocols. The computationally intensive signal processing algorithms and high data rates associated with these protocols necessitate dedicated hardware implementation of some portions of the signal processing chain, yet allocating separate hardware resources for each of these standards would make the universal access device bulky and inefficient. Re-using the same software-
reconfigurable hardware to handle different processing algorithms would enable an efficient, flexible alternative to current prototyping and implementation methods. In this thesis, the feasibility of using Software Defined Radio architectures as a prototyping tool for wireless LAN baseband signal processor implementations is explored. Signal processing architectures and algorithms for DSSS and OFDM protocols were developed in the Simulink and Matlab environments, and were then translated to VHDL hardware descriptions. A reference design for a OFDM transmitter was synthesized for implementation on a Xilinx Virtex II FPGA, and functional and timing simulations verified the design correctness.
xi
Chapter I Introduction
1.1 Motivation
There has been rapid growth in the field of Wireless Local Area Networks (Wireless LAN or WLAN) over the last decade. Wireless LANs have transformed from bulky, costly, niche technologies into high performance systems operating at near-wire speeds. The Institute of Electrical and Electronics Engineers (IEEE) is driving the standardization of the wireless LAN technologies and the IEEE 802.11 standard [1] is currently the most popular wireless LAN protocol. Different wireless LAN standards (IEEE 802.11b, IEEE 802.11a and IEEE 802.11g, Bluetooth, HiperLAN etc.) exist that provide users with different levels of connectivity in terms of speed, area of coverage, and ease of use. The convergence of these existing and future technologies would produce a seamless, ubiquitous wireless network with voice, video and multimedia and broadband data services traveling across multiple wireless interfaces providing anytime, anywhere communications to its users. Such technology would enable users to always be connected to a network through a single device which has the ability to run different wireless LAN standards. This poses lot of challenges at the different layers of the network, right from the wireless interface (radio) to the application level. The device would have to monitor the different RF signals on different wireless interfaces and switch to standards appropriately. Implementing these different Wireless LAN technologies on a
single device that is limited in computational power and size would be challenging and exiting. In this thesis, we have focused on the issues in the implementation of the physical layer of the different Wireless LAN protocols, with specific attention to the defining architecture and implementation of the baseband processing of the IEEE 802.11 physical layer. The IEEE 802.11 defines multiple standards at the baseband, and spread spectrum and orthogonal frequency division multiplexing (OFDM) based standards are currently popular. Several complex and highly sophisticated signal processing algorithms have been developed for these standards to give better bit error rates and improve the quality of these standards. Implementing these multiple standards on the physical layer would require running these different algorithms for different standards on the same device. These computationally intensive signal processing algorithms require lots of hardware resources and allocating separate hardware resources for each of these standards would make the device bulky and inefficient. Reusing the same hardware to handle different algorithms, thus enabling a reconfigurable and flexible software radio system would be an efficient alternative. The feasibility of a Software Defined Radio (SDR) based implementation for two of the IEEE 802.11 physical layer technologies, Spread Spectrum and OFDM is studied in this thesis.
1.2 Contribution
The main contributions of this thesis include MATLAB-based reference implementations of two different physical layers for the Wireless LANs along with
architectures for SDR based prototyping. Analysis of algorithms for generation, detection and estimation of these modulations is carried out for performance evaluation, and architectures for implementing the same in software and hardware are proposed. Computationally efficient algorithms are studied and enhanced to enable simple architectures for SDR-based Wireless LAN implementation. There has been research in the field of implementation issues for the different physical layers for the Wireless LAN protocols, but little work has been done on the aspect of reconfigurability of these systems. Digital Signal Processing (DSP) would seem to be an obvious choice for a reconfigurable solution for WLAN protocols. The WLAN supports high data rates (in the range of 1-54 Mbps) and higher data rates would be supported by the next generation WLAN standards. To achieve these high rates, significant processing needs to be done in dedicated hardware. The most powerful current DSPs can only offer decoding rates of up to a few Mbps, and are not scalable with the evolving standards. As a result, we have chosen an FPGA implementation over a DSP implementation. Kevin Karcz at University of New Hampshires InterOperability Lab has designed a sytem level implementation of OFDM system. The developed system operates on a packet level and the receiver assumes the samples of the entire packet are available before demodulation. This system has been modified to suit a real-time operation working on a symbol by symbol basis. New architecture is proposed and system level design is carried out, to enable hardware prototyping. The proposed reference architecture for SDR based implementation uses a parallel and pipelined approach to achieve the desired rates, using significant hardware resources. Such a hardware intensive
approach is not suited for the uniprocessor architecture of a DSP. The proposed architecture enables implementation of the design on a Xilinx Virtex-II FPGA. The proposed architecture is capable of supporting up to 36 Mbps data rates. The SDR-based prototyping of the WLAN baseband processor is broken into three parts: 1. Understanding the communication algorithms used in the WLAN baseband processor and implementing these algorithms in software 2. Designing a flexible and simple hardware architecture based on the initial software implementation above. 3. Implementing these algorithms in hardware.
1.3 Overview
The rest of the thesis is organized as follows: In Chapter 2, we will briefly describe the evolution of different WLAN standards and build a better understanding of the different physical layers used in these protocols. Spread spectrum and OFDM modulation are described in detail and the main building blocks of these modulation schemes are discussed. In Chapter 3, we discuss the choice of hardware for prototyping the WLAN baseband. The advantages and disadvantages of the two hardware approaches, the DSPbased implementation and FPGA-based implementation are discussed in detail. Chapter 4 discusses the different prototyping and simulation methodologies available for rapid prototyping of communication systems and explains the data flow of the design process. Chapter 5 discusses the high-level design of an OFDM baseband processor and the modifications done to the original implementation done by Kevin, to enable a SDR based prototyping of the OFDM receiver. The implementation is done in MATLAB [2] . Some
well-known architectures and signal processing algorithms are presented and trade-offs for different architectures are examined. Chapter 6 discusses the low-level design using block diagrams and the implementation issues. The issues in implementing communication and DSP algorithms in hardware and using Xilinx Intellectual Property (IP) cores are described. Finally, we conclude with some directions for future work in Chapter 7.
Chapter II Background
2.1 Introduction
The Institute of Electrical and Electronic Engineers (IEEE) 802.11 standards are the most widely deployed Wireless Local Area Network (WLAN) standards compared to the other proprietary protocols. The IEEE 802.11 standard, completed in 1999 and reaffirmed in 2003 [1], specifies an operating air interface in the unlicensed 2.4 GHz Industrial, Scientific, and Medical (ISM) band. The base standard includes both Frequency Hopping (FHSS) and Direct Sequence (DSSS) spread spectrum methods. The DSSS mode is based on a 11-chip Barker spreading code. BPSK and QPSK modulations are used. The IEEE 802.11a standard, also released in 1999 defines an operating air interface in the 5.1-5.8 GHz Unlicensed National Information Infrastructure (UNII) bands. It uses Orthogonal Frequency Division Multiplexing (OFDM) to achieve data rates up to 54 Mbps. The IEEE 802.11g standard, released in June 2003, uses the same OFDM modulation as 802.11a, but enables operation in the 2.4 GHz ISM band. Table 2-1 gives a summary of these different standards and the technologies used. In the following sections we will describe these technologies, along with the basic concepts of spread spectrum and OFDM.
IEEE 802.11b RF Freq. Spreading Bandwidth Modulation Max. Power Throughput MAC Sensitivity 2.4 GHz ISM band Direct Sequence 22 MHz QPSK 100 mW 1-11 Mbps CSMA/CA -80 dBm
IEEE 802.11a 5.15-5.825GHz UNII-2 bands OFDM 20 MHz BPSK, QPSK, 16QAM, 64QAM 800 mW 6 54 Mbps CSMA/CA -80 dBm
IEEE 802.11g 2.4 GHz ISM band OFDM, Direct Sequence 20 MHz BPSK, QPSK, 16QAM, 64QAM 100 mW 1 54 Mbps CSMA/CA -80 dBm
data in 8-bit blocks (8*1.375 = 11). Six of the 8 bits are used to choose 1 of 64 complementary codes, which are 8 chips long and clocked out at 11 MHz. The other two bits are combined with the code in the QPSK modulator. 2.2.2.2 PBCC Packet Binary Convolutional Coding The optional PBCC is also defined in the IEEE 802.11b/g specifications. PBCC uses Forward Error Correction to improve the link performance when noise is the limitation. The data is encoded using a convolutional encoder, which utilizes a 6-stage memory. The encoder produces two bits for each input data bit and these output states are mapped to QPSK states. A code word controls how a chosen word alternates over time.
2.2.2.3 OFDM Orthogonal Frequency Division Multiplexing Orthogonal Frequency Division Multiplexing was adopted as the modulation technique for the IEEE 802.11a high rate wireless LAN and for the IEEE 802.11g highspeed extensions in the 2.4 GHz band. It uses a multi-carrier approach and is considered a robust method to overcome the hostile effects of a wireless channel such as multi-path propagation. The basic principle of multi-carrier modulation is to de-multiplex the original high rate data stream into a number of parallel lower data rate streams and then modulate a different carrier with each of these lower data rate streams with a different frequency. The resultant signals are transmitted together in the same band. The separation between the different carrier frequencies imparts the orthogonal nature to this frequency division-multiplexing scheme. The frequencies are chosen such that the spectral null of 9
one modulated carrier is exactly at the frequency of an adjacent carrier. Even though this would make the spectra of the modulated carriers overlap, it ensures that the signals at the different frequencies are recovered at the receiver without significant mutual interference. To achieve this end, it would require a large number of oscillators, each locked to precise frequencies, for the orthogonality of the carriers to hold. This is expensive and cumbersome to implement in practical systems. However, the advances in Digital Signal Processing, particularly the Fast Fourier Transform (FFT), make implementation of OFDM viable. Using IFFT and FFT, data can be digitally modulated/demodulated onto a large number of carriers. Another important feature of an OFDM system is the length of the FFT transform. The set of sub-carriers generated during one transform is called an OFDM symbol. The duration of the OFDM symbol, or the length of the FFT, is related to one of the main reasons for using OFDM, which is the reduction of multi-path or echo effects. The longer the length of the FFT, the longer the echoes that can be handled. For indoor wireless networks, the transmitter and the receiver may be separated by up to 100 m, and the delay spreads are up to 300 ns, and the channel bandwidth is in the range of 20 MHz. When using a 20 MHz channel and with a OFDM symbol duration of 3.2 us, the number of carriers required would be 64, and a 64 point FFT would be sufficient.
10
Figure 2.1 shows the various components of an OFDM transmitter and receiver. The data entering into the transmit chain is scrambled and convolutionally encoded. The encoded data is sent to the interleaver. The interleaver consists of a block interleaver and a matrix interleaver. The interleaver decorrelates the data and spreads adjacent data over many subcarriers. The data is then passed to a modulation mapper, which is commonly Phase Shift Keyed (PSK) or Quadrature Amplitude Modulation (QAM) depending on the type of communication system. The IEEE 802.11a standard specifies usage of Binary PSK, Quaternary PSK, 16 QAM and 64 QAM. While the wireless standards use lower order modulations, wired OFDM technologies such as the digital subscriber loop (xDSL) use higher order modulations up to 256 QAM and 1024 QAM. The modulator thus concurrently and separately modulates a set of tones in the OFDM spectrum. The output of the modulator is then converted from serial to parallel form and the complex frequency domain data is thus transformed to time domain using Inverse Fast Fourier Transform (IFFT). The time varying data is then cyclically extended with a cyclic prefix to reduce inter-symbol interference between successive OFDM symbols. The cyclic extension of the OFDM symbol, or the guard interval, also helps in reducing the effects of multipath
11
as long as the guard interval is greater than the maximum channel delay spread albeit at the cost of a part of signal bandwidth. Taking the cyclic prefix into account, the signal model for OFDM transmission over a multipath channel becomes very simple. The transmitted symbols Xl,k at time slot l and sub-carrier k are solely affected by a factor Hl,k which is the channel transfer function (Fourier transform of channel impulse response) at that sub carrier frequency and by additional white Gaussian noise. Equation 2.1 below, shows the influence of the channel and noise on the received signal.
(2.1)
An equalizer block follows the removal of the cyclic prefix and the FFT, and removes the influence of the channel. The FFT brings the data back to the frequency domain and the equalizer retrieves the input data symbols using an estimated channel inverse model. To help in estimation of the channel transfer function, special training symbols are embedded in the OFDM symbol.
12
The channel transfer function is obtained by interpolation between the coefficient values estimated from the training symbols. However, this simple equalization scheme might not work for those frequencies where the channel transfer function is relatively small because dividing by a small value amplifies the noise, and thus make detection of the input data difficult. Since the data was channel-coded at the transmitter, one can still retrieve the symbols that experienced a bad channel with the help of those that were received successfully. Following the equalizer block is the Viterbi decoder for decoding the convolutionally coded data. The descrambler following the decoder, outputs the original bit stream sent for transmission.
2.3 Summary
The two popular WLAN systems use two different protocols in the physical layer, namely Spread Spectrum and OFDM. The baseband processing of these two systems is vastly different with one using a single carrier modulation and the other using multi13
carrier modulation. In particular, the OFDM physical layer supports a variety of data rates by using convolutional coding and BPSK, QPSK and QAM modulations. The errorcorrection code used in OFDM systems has rate , K=7, and polynomials {133, 171}. The rate of the convolutional code determines the amount of redundancy added to the transmitted code word. A lower code rate is thus stronger, but makes inefficient use of bandwidth. The higher order modulations required for higher rates also demand high SNR at the receiver to be able to demodulate the signal. Thus the longer constraint length and higher order modulation increases the complexity of OFDM systems. Thus the OFDM systems require signal processing architectures that support data rates in the range of 6-54 Mbits/s. Before we describe the architecture used for simulating and prototyping the OFDM system, we shall review some of the important aspects of simulation and prototyping of communication systems and WLAN systems in particular.
14
15
architecture that would permit implementation of different WLAN protocols on a single design.
Host Interface
Antenna Switch
PA 2.4 GHz
802.11a/g Modem
16
Where b is the DAC resolution in bits, Fsampling is the sampling frequency and BW the bandwidth of interest. Fsampling is bound by the Nyquist theorem as twice the bandwidth of the signal to be sampled. The band of interest is 22 MHz for 802.11b systems and 20 MHz for 802.11a systems requires a sampling frequency of 44 MHz and 40 MHz respectively. The signal quality goals are defined by the requirements set by the standards. The goal of the thesis is to implement the baseband section of the 802.11a/b/g protocols and hence the ADC/DAC selection is not discussed further.
17
1) Multi- mode ASICs where device functionality can be switched according to the mode of operation, 2) DSP-based implementation 3) Programmable logic-based implementation (using FPGA or PLD). The long design cycle of ASICs makes them unsuitable for prototyping. Both DSP and programmable logic based designs have advantages over other techniques. The next section discusses the advantages of both FPGA-based and DSP-based implementations and how the FPGA-based implementation is ideal for Software Defined Radio based prototyping of WLAN baseband processing.
18
number of available gates on the FPGA limits its performance, and the performance of the FPGAs is generally given in terms of the number of available system gates and the maximum clock speeds. FPGAs are uniquely suited for repetitive DSP tasks, such as multiply and accumulate operations (MAC), because they can perform these repetitive tasks in parallel. The Virtex II from Xilinx represents the state of the art in FPGA with up to 6 Million system gates and typical clock frequencies up to 300 MHz. The TMS 6000 series from Texas Instruments represents the state of the art in DSP chips with up to 2400 MIPS and clock speeds up to 300 MHz. However, with FPGAs, achievable clock speeds are often far less than the stated maximum clock speed, whereas with DSPs the MIPS rate is more often achieved. Efficient architecture is the key in utilizing the full clock speed of an FPGA.
Program Memory
Instruction Decode
I/O
Registers
ALU
Memory
19
In a DSP processor, the ALU is tailored towards DSP functions. The main DSP function is the multiply and accumulate operation and the important factor is memory bandwidth. The program, which is stored in the program memory, determines the execution sequence and the program memory itself might be an overhead. The DSP structure is generally fixed with a fixed size data width, and some of the algorithms might not need the full data width. The ALU itself has instructions to support lots of different algorithms, and the algorithms needed in 802.11 signal processing might not need all the instructions supported. Flexibility of a DSP processor lies in the fact that it executes instructions sequentially. But it would be disadvantageous when executing loops and for doing data input and output operations.
Sample Rate
Operations per Sample Period Two Multiplier ALU at 200 MHz 50000 325 20 9 4 Four Multiplier ALU at 400 MHz 200000 1302 80 36 16
The algorithm performance in a DSP processor (which in turn determines the maximum sample rate) is determined by the speed (maximum clock rate) supported by ALU, the number of parallel operations supported by the ALU and the number of operations required by the algorithm on a per symbol basis.
20
As seen from the Table 3.1, the DSP processor has much flexibility for voice applications, which uses 8 kHz sampling rate, and for CDMA cellular phone communication chip rates, which uses 1.2288 MHz sample clock. The 20 MHz and 44 MHz sample rates are associated with 802.11a and 802.11b systems and the number of operations that can be done per sample is reduced to 36. High rate analog to digital converters typically use sample rates above 100 MHz, in applications such as digital down conversion, and the DSP processing power is greatly reduced at this rate. Some important features of the one of the state-of-art DSP processors, the TMS320C6414 from Texas Instruments (Ti) are as given below Highest performance fixed point DSP with 2, ns instruction time Up to 720 MHz clock rate Six ALUs, which support Single 32-bit, Dual 16-bit, or Quad 8-bit arithmetic per clock cycle. Two multipliers supporting four 16*16-Bit Multiplies per clock cycle. Viterbi decoder co-processor L1/L2 Memory architecture that supports 128 K-bit L1P program cache, 128 Kbit L1D Data Cache, 8 M-bit L2 Unified Mapped RAM/Cache Two external memory interfaces
3.3.2.2 Architecture of FPGA There is currently a wide range of FPGA products being offered by many semiconductor vendors, including Xilinx, Altera and Atmel. The architectural approaches
21
of different manufacturers are different. However, a generalization can be made in that most FPGAs are organized as An array of logic elements, and a set of Programmable interconnections between the logic elements, the I/O pins, and other resources such as on-chip memory. The logic elements: Each logic element typically consists of 1 or more n-input RAM based look-up tables where n is between 3 and 6, and several flip-flops. Configuration: Configuration of the FPGA and interconnection of logic elements, the I/O pins and other resources inside the FPGA is accomplished by downloading a bitstream into a static RAM configuration memory inside the FPGA. This bit stream defines the functionality of each of the logic elements and the internal routing between each of the logic elements. Different applications can be supported inside the FPGA by reconfiguring the FPGA with appropriate bit streams.
Block Memory
22
The Xilinx Virtex-II series: As a specific example, the Xilinx [2] Virtex-II family of FPGAs is considered. The basic logic element in this FPGA is a slice; it consists of Two 4-input look-up tables (LUTs) Two flip-flops Several multiplexers Additional silicon to support other applications
Four of these slices form a configurable logic block (CLBs). Some important features of the Xilinx Virtex-II FPGA are Largest FPGA with up to 8 million equivalent system gates Up to 3 Mbits of embedded BlockRAM memory Up to 168 18*18 multipliers Up to 12 DCMs (16 global clocks) Up to 1108 user I/O pins 64 to 16224 CLBs
23
these high rates of I/O. With the advantages of minimal signal loading and with no overhead for instruction fetching or data fetching, FPGAs become a more suitable choice in high bandwidth and high data rate systems. In DSP, large data sets can be used as it is optimized for use of external memory. FPGAs traditionally have limited amounts of internal memory, which makes smaller data sets ideal. The current FPGAs have overcome this obstacle with more than 3 Mbit onchip memory. Current DSP chips have multiple arithmetic units for performing
arithmetic operations. They include several MAC units and, up to a point parallelism can be used to increase the performance of MAC operations. The degree of concurrency is limited by the total number of MACs in the DSP chip and thus limits the full exploitation of potential concurrency in the DSP algorithm. At the cost of lower precision in calculation, the effective number of MAC units can be increased. FPGA hardware allows allocating separate resources for fully exploiting the concurrency in the signal processing algorithms. A 32-tap FIR filter has to perform 32 multiplications followed by an addition for each output of the filter. An FPGA can be configured to perform all of these operations in parallel using dedicated multipliers and then add their product.
Re-use of processing units is supported by DSP, where a multiplier that is used in implementation of an FIR filter can be reused in a FFT routine. In FPGAs reuse of processing units is not permitted, as the resources are not shared between different routines. In the Virtex II FPGA, this obstacle is overcome by having a large number of multipliers on the FPGA. The digital components found in a communication modem include DSP chips, RISC microprocessors, microcontrollers, RAMs and communication chipsets (such as PCI and Ethernet interfaces) that can be completely reprogrammed in software. FPGAs allow design of modems in which the majority of the system can be reconfigured remotely both in software and hardware. A DSP program has to take advantage of the architecture of the specific DSP processor, where only the program can be reconfigured. FPGAs with short turn around times allow custom architectures to be made for different applications.
25
3.3.5.1 Digital Down Converter Digital Down Conversion (DDC) is a fundamental part of the communication system. Digital radio receivers have fast A/D converters delivering vast amount of data, but in many cases the signal of interest is a small portion of that bandwidth. DDC allows the rest of the data to be discarded, thus allowing more intensive processing to be performed on the signal of interest. 3.3.5.2 Implementing MAC based FIR filter The FIR filter is one of the basic building blocks common in nearly all DSP systems. The output sample stream is generated by convolving the input sample stream with N filter coefficients. In equation 3.2, N multiplications and N-1 additions are required to compute each value in the output stream Y(n).. The operation of an FIR filter is described by equation 3.2,
Y ( n) =
i = N 1 i =0
k (i) S (n 1)
(3.2)
The filter coefficients k(i) determine the frequency response of the system. In applications that require a large number of filter coefficients, the arithmetic load would be substantial. Figure 3.3 shows that the same function can be realized as a set of delay elements and multipliers with one delay and one multiplier for each tap of the filter followed by a summation function.
26
S(n)
Z-1
Z-1
Z-1
Z-1
k0
k1
k2
k3
kN-1
The basic operation of the FIR filter equation is multiply and accumulate (MAC). It requires a multiplier that gives the product of the two operands presented at its input and an accumulator to accumulate the data samples. The main requirement is to provide operands (data samples and coefficients) to the multiplier input. Coefficients and data samples are stored in memory, and address generation units select the data and coefficient pairs to be presented to the multiplier. The memory should be able to supply two operands to the multiplier in a single clock cycle, which requires two separate memory banks or a dual-port memory. The MAC based filter operates at a rate of N times the sample rate. As the sample rate increases, the product of N and sample rate becomes very large. If a single MAC engine has to support an 8-tap filter at 100 MHz, it would require a MAC engine rate of 800 MHz and memory bandwidth of 1.6 GHz. A solution would be to split the number of taps (N) between a collection (called a farm) of MAC engines. If the above filter were split into 8 different MAC engines, then each MAC engine rate would be 100 MHz. The DSP farm is a term used to describe an array
27
of DSP processors on a PCB, each of which perform a part of a bigger function. Although the MAC rate is divided, the memory bandwidth requirement remains the same. In each clock cycle the filter is performing 8 multiplications, which requires 16 words. If the memory words are also split among blocks of memory using distributed RAM, it would relax the requirements on memory bandwidth. As sample rates and the number of taps on the filter increase, more calculation units and more memory accesses would be required. This leads to more individual memories of smaller capacity. By continuously dividing the number of taps between more and more MAC engines, a fully parallel filter structure can be achieved. The Virtex-II FPGA has dedicated 18-bit * 18-bit multiplier blocks that can be used in the MAC-based FIR filter. Higher order multipliers can be implemented using the logic slices. There is 3 Mb of dual-port RAM in 18 kbit block select RAM resources and 1.5 Mb of distributed RAM which can be used as storage for the N filter coefficients and N data samples. The configurable nature of the FPGAs would permit a designer to exploit the high degree of potential parallelism in many DSP algorithms to achieve even higher performances.
modules are being made available for FPGAs as predefined modules (whose parameters can be tuned as necessary). A variety of standard DSP algorithms for filters, correlators, 28
transforms, memories and math functions can be used to support a wide variety of applications including communication and signal processing. These cores make it much easier for DSP system designers to use FPGA technology and reduce the implementation time.
29
However, storage must be provided to store all of the multiplication tables used to accomplish partial product multiplication. x(n)
SR SR SR SR SR Lookup Table
Scaling Accumulator
Add/ Sub
R E G
y(n)
clock
A shift register can be used to shift the multiple-bit input value into the FPGA, one bit at a time. With an input variable of N bits length, N cycles are needed to complete a single inner product calculation. Speed is improved by splitting the input word into sub-words, and shifting each of these sub-words into FPGA in parallel.
3.3.7.1 Fast Fourier Transform (FFT)
The data bits in OFDM are mapped to complex signal vectors using BPSK, QPSK or QAM modulation, and these vectors modulate sub-carriers using 64-point complex Fast Fourier Transform (FFT). The FFT is computationally the most demanding block of the transmitter and receiver design in the OFDM radio. State of the art DSP chips can perform the 64-point FFT in 60 s. Additional time is required to perform bit-reversal operation if required. Using standard radix-4 Cooley-Tukey algorithm and using parallelism, the same 64-point FFT can be performed in a Virtex-II FPGA in less than 4 s. This includes the bit reversal operation. The IEEE 802.11a standard defines the 30
FFT/IFFT period to be 3.2us, which is very difficult with the DSP technology available (at the time the project was started). Xilinx provides an FFT core, which is optimized specifically for the Virtex-II FPGA and is guaranteed to run speeds up to 100 MHz. The Xilinx FFT core provides a result vector every 192 clock cycles, which is equivalent to 3.2 s at a 60 MHz clock speed . The Xilinx FFT core also eliminates the need to design the 64-point complex FFT. In order to reach the real-time speed requirement of 60 MHz for the IEEE 802.11a baseband, it is absolutely necessary to use a highly pipelined architecture as much as possible. This implies that all the arithmetic operations are to be performed in parallel, which is quite difficult using a DSP . The I/O bandwidth is another limiting factor in a DSP chip. In an FPGA, with the exception of some dedicated pins, most of the pins can be used for I/O, which provides the extremely high I/O bandwidth required for FPGA-based signal processing designs. The large number of I/O pins and the multiple banks of internal memory can be used to employ a high degree of parallelism in an FPGA design. Separate memory banks are used for input/output and working memory in commercially available FFT cores. This permits concurrent I/O operation and FFT computation.
3.4.
Summary
In this thesis we chose the FPGA as the target hardware for the baseband prototype
based on the speed, size and reconfigurability requirements of the baseband design. A benchmark for the performance of a hardware solution is the number of operations per second it can support. To support the 54 Mbit/s rate specified by the IEEE802.11a
31
standard, the baseband design requires at least 15 GOPS (giga operations per second), which is almost 10 times faster than the fastest Texas Instruments DSP on the market (C64X) which advertises up to 1.6 GOPS. On the other hand, current FPGAs can easily support up to 20 GOPS due to the parallel processing nature of an FPGA. This makes FPGAs the logical choice for prototyping the IEEE 802.11a baseband design. The Xilinx Virtex II FPGA [3] was chosen as the target hardware as it is one of the most advanced FPGAs on the market supporting large designs of around 6 million gates. Using the large Virtex II FPGAs minimizes the partitioning of a design, thereby reducing the complexity and greatly increasing the overall speed. The Xilinx Virtex II FPGA is also wellsupported by the Nallatech Extreme DSP system [3] which was chosen as the prototype platform in the design flow. The Extreme DSP platform enables real-time functional verification of a baseband design with system operating frequencies that exceed 100 MHz. The Extreme DSP prototyping platform supports the Virtex II FPGA and also has two ADC channels and two DAC channels. Extreme DSP also supports verification with the HP 16500C Logic Analyzer, which allows the capture of data from the baseband design running on Extreme DSP to compare to simulation results.
32
33
The computational models assumed by simulators and by prototyping environments explain the difference between the two approaches. In a simulation environment the host computers CPU provides all resources for performing signal processing. Most of the simulation languages and tools isolate the host-based simulation environment from the DSP-based execution environment. Real-time data acquired by the prototype hardware cannot be easily passed back to the host for analysis; likewise, simulated data generated on the host cannot be processed on the prototype. In this thesis we have reviewed different approaches to the rapid prototyping of communication protocols and propose a data flow optimized for SDR-based prototyping of Wireless LAN protocols.
34
diagram must be translated into a program suitable to execute on the prototype hardware. Mathematical equations are used in algorithm creation while block diagrams and simulation is preferred for system design. A digital signal processor is typically used in prototyping communication systems which require assembly language or C programming language to generate an executable routine. Both simulation and prototyping are inherent in communication system design, with initial design entry done on the host and final testing of the design is done on the prototyping hardware. In a block-based system level design, each block is represented by an equation, which specifies the algorithm implemented by that block. The proper operation of individual blocks and the entire system is verified by simulation. The system is then translated into C or HDL and compiled to run on the DSP or FPGA. The languages and design tools available today are largely incompatible with each other and are usually unable to execute both on the host and the prototype. Matlab, Simulink, C and hardware description languages interoperate poorly, and run either on a host or on a prototype. Matlab is the preferred programming language for algorithm designers, and is well suited to describe equations. Though add-on software packages exist that transform Matlab code to C code to run on a DSP, they are not very efficient. Though Simulink is the preferred choice for block diagram entry, it only runs on the host platform. C code written for a DSP typically uses DSP-only libraries, preventing it from executing on the host. Integrating C code or VHDL code with Matlab or Simulink is a very difficult task, and requires knowledge of the C-MEX interface or the Simulink S function interface.
35
The monolingual nature of todays languages and tools limits the rapid prototyping of complex communication designs. Though there is some flexibility offered by Matlab and Simulink to use both of them in designing a communication system and generating equivalent C code or HDL code, the compilers have limited capability to understand the design and the design is far from optimal. Another problem is the isolation of the hostbased environment from the prototype-based environment. This makes it difficult for prototype hardware to pass the acquired data to the host for further processing. Likewise, simulated data generated on the host cannot be passed to the prototype.
36
Another approach is to use fixed-point C code models which offer the advantage of fast simulation speed, but have the disadvantage of not being directly synthesizable to hardware. Using fixed-point C code in a prototype design flow requires another design flow step and more resources to translate the C code into a synthesizable HDL The EDA (Electronic Design Automation) industry is currently focusing on standardizing a fixedpoint subset of C code called System C which can be easily translated to a hardware description language in an effort to close the gap between system-level floating-point models and hardware. In comparison to standard HDLs, System C does not offer any significant advantages in terms of ease-of-use. Thus, fixed point C is not well suited to be used in the rapid prototyping design flow due to the additional step required to translate it to hardware description language. Simulink offers a graphical entry tool called System Generator. In this thesis we explored using Simulink for rapid prototyping of the Wireless LAN protocols. The Xilinx block set within the System Generator provides a system-level, fixed-point functional simulation environment as well as the means to create VHDL code that can be synthesized to a Xilinx Virtex FPGA. This VHDL code can be simulated on any of the various HDL simulation tools on the market. Since Xilinx Block-set is a component of Simulink, mixed-mode simulations with both floating-point and fixed point blocks are also possible. The disadvantage of System Generator is that the VHDL code generated within this tool is not easily readable and is not intended to be modified. This is a problem when transferring the tool-generated VHDL code to a development group that does not use System Generator. Another problem is that most of the Xilinx Cores used in the design are modeled as black boxes. Inside these black boxes the behavior of the
37
actual Xilinx cores are modeled for simulation purpose only and are not expanded until the Xilinx place and route stage in the design flow. The System Generator simulations are bit and cycle-true simulations of the actual hardware implementation, so it is very important to accurately emulate the timing and latency of the core black boxes. If the core behavior is not modeled correctly in System Generator, the resultant VHDL implementation will have timing errors thus causing unnecessary iterations through the design flow. Another problem was that this package was in its early stages of development and was not very efficient in using the available prototype resources. The developed executables and the VHDL code have very little ability to communicate with the original simulation. Another common approach for doing fixed-point modeling is to directly hand-code the HDL based on the floating-point algorithms and fixed-point algorithms developed in simulation. This approach has the advantage of creating optimized HDL that can be directly synthesized to hardware; however, it requires good knowledge of the system level design using Matlab and Simulink and the chosen HDL language, as well as the resources to code and simulate the HDL. Since creating HDL is a part of any prototype design flow which targets either an FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), the issue is to create HDL from the floating point design.
4.4. Summary
Today's traditional design methodology suffers from segregation. Each of the tools explained above have strengths, but these tools cannot be combined to avoid the
38
weaknesses inherent in each tool. An idea captured as a rough sketch of a block diagram, with equations noted beside each block, cannot be entered into a computer; the block diagram tool is incompatible with the equation tool. When debugging the system, particularly on hardware, the best analysis and debug tools available on the host must be abandoned for less usable hardware-specific tools designed for the hardware. This thesis addresses these problem areas by unifying different classes of tools, exploiting both block diagram design and equation entry to serve as a prototype for the final implementation. An easy path is described to migrate from equations coded in Matlab, which cannot be executed on a hardware platform, to synthesizable code written in VHDL that can be directly mapped into an FPGA. The system level design was carried out using floating-point algorithms in Matlab, and fixed-point simulations are done in Matlab where required. Each of the blocks is then coded in VHDL and the functionality of each block is verified with its Matlab counterpart. This development environment, where both simulation and prototyping environments are used enabled the implementation of the final design.
39
Chapter V
Table 5.1 Data rates and symbol rates for HRDSSS system
40
packet modulation depends on the choice of data rate. The preamble is used for synchronization while the header contains information necessary to demodulate and decode the message portion of the packet. For the 1 Mbit/s data rate, and for the header at all rates, the data encoder implements DBPSK coding by differentially encoding the serial data from the scrambler and driving the I and Q output channels together. For the 2 Mbit/s data rate, the data encoder implements DQPSK coding. This coding uses the differential coding of dibits (bit pairs). The serial data is formed into dibits in the differential encoder as described above. One of the bits from the differential encoder goes to the I channel and the other to the Q Channel. The I and Q channels are then both multiplied with the 11-bit Barker word at the spreading chip rate. This forms QPSK modulation at the symbol rate with BPSK modulation at the chip rate.
5.1.1.1. CCK Modulation
The spreading code length is 8 and is based on complementary codes. The chip rate is 11 Mchip/s and the symbol duration is exactly 8 complex chips long. The following formula is used to derive the CCK code words that are used for spreading at both 5.5 and 11 Mbit/s. c = {e j (1+ 2+ 3+ 4) , e j ( 1+ 3+ 4) , e j ( 1+ 2+ 4) , e j ( 1+ 4) , e j ( 1+ 2+ 3) , e j ( 1+ 3) , e j (1+ 2) , e j 1} (LSB to MSB), where c is the code word. The terms: 1, 2, 3, 4 are defined below for 5.5 Mbit/s and 11 Mbit/s [1], [14]. This formula creates 8 complex chips (LSB to MSB) that are transmitted LSB first. The coding is a form of the generalized Hadamard transform coding where 1 is
41
added to all code chips, 2 is added to all odd code chips, 3 is added to all odd pairs of code chips and 4 is added to all odd quads of code chips. For the 5.5 Mbit/s and 11 Mbit/s rates, the data is partitioned into nibbles (4 bits) or octets (8 bits), and one of the 16 possible CCK spread sequences is selected for 5.5 Mbit/s and one of the 256 possible CCK spread sequences is selected for 11 Mbit/s.
Scrambler
Symbol Modulator
Barker Spreading
Pulse Shaping To TX
CCK Spreading Processor Board Interface SFD detection Descramble CCK Correlator Barker Matched Filter
Acquisition/ AGC
correlation peak is used to establish the correct symbol timing. Peak energy detection is used to determine the AGC scaling factor to use prior to the channel matched filter. Peak detected energy for M samples =
M 1 i =0
+ Q2
(5.1)
Successive correlation peaks observed at 11*over (11 times) sample ratio, give the timing offset estimation and is used to correct symbol timing errors. Once signal acquisition is declared and symbol timing is achieved, the demodulator starts demodulating the data. The frame detection flag is set by comparing the demodulated header data to the Signal Frame Delimiter (SFD) for either long preamble mode or the short preamble mode.
43
The rest of the data demodulator and decoder use implementations developed by Richard Lynch at the University of New Hampshires InterOperability Laboratory. A description of the spread spectrum demodulator is provided below. To demodulate the data at 1 Mbit/s and 2 Mbit/s, the Barker matched filter method described in section 5.2.2 is used. The received signal samples are correlated with the known Barker sequence and the data bits are determined based on the highest correlation peak. For the 5.5 Mbit/s and the 11 Mbit/s modes, the data is demodulated using a CCK correlator. Once the header is demodulated, the phase of the last bit of the header is captured and used as a phase reference for the high rate data portion of the packet. The data is demodulated and decoded per symbol. In 5.5 Mbit/s mode, a bank of correlators compare the received data to the 16 possible CCK chip sequences. The biggest correlation of the above comparison demodulates 2 bits and 2 more bits are demodulated by the QPSK demodulation. For 11 Mbit/s mode, the input samples are processed in a bank of 256 correlators, and the biggest correlation peak is used to
44
demodulate 6 bits, while the last two bits are demodulated by the QPSK differential demodulation of the output.
45
X ( k )e
j 2
k n N
(6.1)
where X(k) is the complex modulation vector and n [-N/2, N/2]. At the receiver side, the data is recovered by performing an FFT on the received signal, i.e. 1 X (k ) = N where k [-N/2, N/2].
N / 2 1 n = N / 2
x (n)e
j 2
n k N
(6.2)
46
where
max indicates
47
Figure 5.2 shows the OFDM training sequence structure (PLCP preamble), where t1 to t10 denote short training symbols and T1 and T2 are long training symbols. The PLCP preamble is followed by SIGNAL and DATA. GI indicates the Guard Interval for the OFDM data symbols. The first 10 symbols are used for AGC, frame detection, and coarse frequency offset estimation. Each of the symbols is 16 samples long, or equivalently 0.8 s. The long training symbols are used for fine frequency offset estimation and also for channel coefficient estimation.
1. Generate the short training sequence and long training sequence 2. Generate the SIGNAL field bits, coding and interleaving SIGNAL field bits, and map them into frequency domain, insert pilots and transform into time domain. 3. Prepend the SERVICE field, and add pad bits to the octet stream and form the DATA. 4. Scramble and encode the DATA using convolutional encoding and puncture to get higher rates, and map them into complex BPSK, QPSK, QAM16 or QAM64 symbols followed by pilot insertion. 5. Transform from frequency domain to time domain and add a cyclic prefix and concatenate the OFDM symbols into a single time-domain signal.
49
and the architectures developed in this thesis for enabling an SDR based implementation of the OFDM baseband and the algorithms used.
6.4.1. Synchronization
Synchronization in OFDM has been the topic of much research in recent years and numerous algorithms have been developed in this area. Synchronization in OFDM contains the following stages: 1. Frame detection 2. AGC adjustment 3. Coarse frequency offset estimation and correction 4. Symbol timing error estimation and correction 5. Fine frequency estimation and correction
6.4.1.1. Frame Detection
The preamble structure explained in section 5.4 is used for frame start synchronization. The periodicity of the short symbols, combined with energy detection, is used for frame detection and for frame start synchronization. The energy of the received signal is calculated and the signal is correlated with the known short symbols. The problem with correlation approach is that multi-path components are combined before decision is made. A new algorithm is used instead where two correlations are employed. In the second correlation step, the received signal is also correlated with itself with a delay of one short symbol. The autocorrelation of these known symbols creates peaks. Detection can be declared if the peaks exceed a threshold. The self-correlation of the
50
signal creates peaks at the length of the short symbols. The last peak is used as the position from where start of the next symbol is determined
6.4.1.2. AGC
AGC is run in parallel with the frame detection. The first 6 symbols of the 10 short symbols are used for AGC. A simple AGC is used where measurements over a window of samples is taken and the delay-multiply values are compared to the power measured. After the signal detection, these auto-correlations are averaged to get average signal power and the gain is determined to scale the input power to 1. This gain is used to scale all the samples afterwards within the same packet frame.
6.4.1.3. Frequency Offset Estimation
Due to the RF carrier frequency difference of the transmitter and receiver, each signal sample at time t contains an unknown phase factor e j 2fc t ,where f c is the unknown carrier frequency offset. This unknown phase factor must be estimated and compensated for each sample before the FFT operation at the receiver; otherwise, the orthogonality between sub-carriers is lost. The two long symbols following the ten short symbols in the OFDM training structure are used for frequency offset estimation. The two long symbols are essentially the same sequence repeated twice. The samples corresponding to the long symbols are correlated to estimate the frequency offset. To estimate = f c t , the frequency offset, the correlation sum J is calculated. Where,
J=
y(l ) y *(l + N ) = e
l =0
N 1
j 2
y(l )
l =0
N 1
(6.3)
J * 1 arg 2 J
(6.4)
A coarse frequency offset estimation using short symbols is also performed. Correlating the adjacent short symbols we have, K= so that
N / 4 1
l =0
y (l ) y *(l + N / 4) = e j 2 / 4
N / 4 1
l =0
y (l )
(6.5)
K * 1 arg 2 K
(6.6)
since /4 is less than 1 even at 100 ppm, there would be no ambiguity in determining the value of . Since the metric K involves only 1/4th of the total samples in the symbol, the result is less accurate. In the proposed algorithm, the short preamble is used to obtain a rough estimation by accumulation and averaging over 16 samples and the long preamble is used to get the fine estimation by integrating across 64 samples. The combination of both frequency offset methods increases the robustness of the frequency estimation algorithm [7], [8]. Combining both coarse and fine estimations into the frequency offset estimation where we would get,
4 K * 1 J * arg arg + K J 2 2
(6.7)
Where indicates truncation towards zero. Research done by Jian Li, and Guoqing Liu [missing citation] has shown that using only the short OFDM symbols results in a 2.7 dB accuracy with respect to using only the long OFDM symbols. It has also been shown that
52
using both long and short OFDM symbols yields a performance gain of 1.9 dB compared to using only short symbols.
ADC
AGC
GI Removal
FFT
Remove Pilots
Channel Correction
Phase Tracker
Timing Acquisition
Channel Estimator
Descrambler
Viterbi Decoder
Depuncturer
Deinterleaver
Demodulator
Due to the difference in the sampling clocks between the transmitter and receiver, each signal sample is offset from its correct sampling time by a small amount which linearly increases with the index of the sample. The long preamble is followed by the SIGNAL symbol which contains information required to decode the data portion of the packet. Correct timing on the last symbol of the long training sequence is required, so the channel coefficient estimate from the header can be directly applied to demodulation of the PLCP header without any phase correction. Once the initial timing is acquired, the timing offset estimation is done in parallel with the frequency offset estimation. Due to the excellent autocorrelation properties of the preamble, the two correlation methods used in the detection process result in periodic strong peaks that enable the detection of the symbol boundary precisely. The timing offset estimation is verified by observing another 53
detection which would be 16 samples later. By using these periodic peaks, false alarm probability is reduced. By over-sampling the receive signal by a factor of two, better timing synchronization is achieved.
6.4.1.5. Channel Estimation
Channel estimation uses the two long symbols from the preamble, which are also used in frequency offset estimation. Once the frame start is detected, frequency offset is estimated, and the signal samples are offset compensated. They are transformed into the frequency domain by FFT. Each received sub-carrier is represented in frequency domain as, Y(k) = H(k) X(k) + N(k) and the channel H can be estimated as, (6.8)
H (k ) =
Y (k ) X (k )
(6.9)
The frequency offset correction is not perfect and the residual error tends to accumulate over samples. This effect is minor for short packets, but for large packets the accumulated error would be enough to cause orthogonality loss among sub-carriers. Four pilot sub-carriers provided in the data symbol are used for estimating the phase factor due to the residual error and for correction of the frequency offset.
54
6.5. Summary
The OFDM system design is explained. The original OFDM system designed by Kevin is enhanced to operate on a symbol by symbol basis. The various algorithms used in detection, synchronization, timing and frequency offset estimations are described. Enhanced algorithms are proposed to make the design more efficient. The enhanced system simulates a real hardware level implementation and provides easy migration from system-level design to hardware prototyping.
55
between intermediate operations leads to a much faster implementation at the cost of a few clock cycles of latency depending on the structure of the operation. In order to meet the clock speed requirement of 60 MHz for the real-time OFDM baseband implementation, it is absolutely necessary to use a highly pipelined architecture as much as possible. To achieve this goal, most of the HDL blocks designed in the system were replaced with Xilinx core library math blocks, which are pipelined and optimized for the Virtex FPGA.
57
step in the design was to develop synthesizable VHDL code to implement the design in the Xilinx Virtex II FPGA. Individual blocks of the transmitter were hand coded in VHDL and the entire system was synthesized. It was also verified that the design meets the timing requirements during functional simulation. The next step in the design flow would be to place and route the design on the target hardware using vendor specific tools and accomplish the working prototype of the design. Due to the non-availability of the Nalltech Extreme DSP platform, this functionality was not verified.
using the 64-point inverse FFT. A cyclic prefix is added to the generated time domain signal and a time domain OFDM symbol is generated for transmission..
Floating point modeling and simulation in MATLAB and SIMULINK
Comparison of Results
Comparison of Results
Data Retrieval
VHDL RTL synthesis, FPGA Place and route (Xilinx ISE 5.1)
Logic Analyzer
59
An important step is algorithm simplification. The trade-off between performance and computational complexity is crucial. Minor changes in performance often lead to a large reduction in hardware. It was found that using ROM-based look-up tables for preambles, encoder, interleaver and modulation mapping cuts the complexity of computing at the expense of increasing the memory requirement. The hardware size is dominated by memory and the FFT processor. A 16-bit fixed point format is used for the system in modulation mapping and FFT, as the Xilinx FFT core requires 16-bit fixed point data format. Using smaller bit widths would improve the performance of the receiver architecture, as it reduces the computational complexity in frame
part of the hardware resources used by the Data Encoder block. Encoding of the PLCP header and data are done in parallel. The two interleavers (general block interleaver and matrix interleaver) are combined into one block, and the interleaver indices for various block sizes are pre- computed and are stored in the dual-port RAM. The Pilot generator block also runs in parallel with the Signal Encoder and Data Encoder and generates the pilot signals to be inserted into the OFDM symbols. The Arrange Data For FFT block arranges the data into the correct format for the FFT.
Processor Board Interface Data Rate Baseband Controller (Timing/control)
Scrambler
Interleave
Encode/ Puncture
Pilot Generation
Serial to Parallel Frame Buffer/ Arrange Data for FFT FFT Control Unit
Modulaiton Mapping
FFT/IFFT
61
The significant processing stage in the OFDM transmitter is the FFT block. This block is used both on the transmit and receive sides by inverting the FWD/REV signal in the FFT core. The FFT control block and the baseband controller block together generate the writing of data samples into the FFT data input memory and reading data out of the data output memory. The output of the FFT block is written to a FIFO, which contains the time domain OFDM signal to be transmitted. Two dual-port RAMs are used at each stage, one for real data and the other for imaginary data. The FIFOs are implemented with dual-port RAMs. The FFT controller also generates addressing for the output FIFOs. All the signal processing is done on the positive edge of the clock and registered outputs are used at each stage. The FFT core developed for the transmitter employs a CooleyTukey radix-2 decimation-in-time (DIT) FFT to compute the DFT of a complex sequence. In general, this algorithm requires the calculation of columns or ranks of radix-2 butterflies. These radix-2 butterflies are sometimes are referred to as dragonflies. For N=64 there are 4 dragonfly ranks, with each rank comprising 8 dragonflies. The input data for the core is a vector of 64 complex samples. The real and imaginary components of each sample are represented as 16-bit 2s complement numbers. The data is stored in on-chip dual-port block RAM. The complex output samples are defined with 16 bits of precision for each of the real and imaginary components. The FFT processor required 256 clock cycles to generate a block of output. The FFT processor in the final prototype was replaced with a Xilinx FFT core, which provides a netlist which is optimized specifically for the Virtex II FPGA and is guaranteed to run at 60 MHz. The Xilinx 64-point transform engine employs a Cooley-Tukey radix-4 decimation-in-frequency (DIF) FFT to compute the DFT of a complex sequence and provides output every 192 clock cycles,
62
which equates to 3.2 s at 60 MHz. In terms of optimizing the baseband design for speed, pipelining using core blocks can contribute to a 3 to 4 times improvement in speed. Using a parallel processing architecture and the optimized Xilinx core library blocks directly contributed to reaching the goal of prototyping a real-time implementation of the IEEE802.11a OFDM baseband transmitter. All the blocks of the transmitter side can be used in the receiver. The receiver would require additional Synchronization and Frequency offset correction blocks in addition to the transmitter blocks. Multiple transmitter chains in conjunction with a spatial processor can be used to model a MIMO OFDM baseband processor.
63
was achieved and after further optimization the design achieved 160 MHz clock rates in functional verification. Full working prototype implementation on the FPGA (on the hardware platform) was not completed due to non-availability of required hardware.
64
65
This thesis also develops novel algorithms for detection and synchronization, capable of robust operation in the unknown noise environments encountered in wireless communication systems. Using double correlation detection and synchronization offers performance gain compared to simple correlation detection and reducing the false alarm probability at the cost of increase in hardware. Simulations of these algorithms demonstrated a performance gain which can be traded with implementation complexity.
66
LIST OF REFERENCES
[1] IEEE Std 802.11_R2003, IEEE Wireless LAN Edition A compilation based on IEEE Std 802.11TM -1999 (Reaffirmed 2003) and its amendments,2003. [2] <http://www.mathworks.com> [3] <http://www.xilinx.com> [4] Richard Van Nee, Ramjee Prasad, OFDM for Wireless Multimedia Communications (Artech House Publishers Boston, 2000) [5] John G. Proakis Digital Communications, McGraw Hill, New York, third edition 1995. [6] J. Cyrus Sy, Implementation of an OFDM transceiver using an SDR platform Wireless Design Magazine, July 2002 [7] P.H. Moose, A Technique for Orthogonal Frequency Division Multiplexing Frequency
Offset Correction, IEEE Transactions on Communications, vol. 42, no. 10, October 1994 [8] T. M. Schmidl and D. Cox, Robust frequency and timing synchronization for OFDM, IEEE Transactions on Communications, vol. 45, pp. 16131621, December 1997.
[9] Anthony Jamin et.all, Software Radio Implementability of Wireless LANs , The 12th Thyrrhenian International Workshop on Digital Communications 2001. [10] Yun Chiu et.all, OFDM receiver design Project Report, Berkeley University 2000 [10] H. Minn, M. Zeng, and V. K. Bhargava, On timing offset estimation for OFDM systems, IEEE Communications Letters, 2000. [11] < http://www.andraka.com/ >
67
[12] Yalamanchili, Sudhakar, VHDL Starters Guide (Prentice Hall, 1st edition September 1997) [13] Chang, K. C. (Kou-Chuan), Digital systems design with VHDL and synthesis: an integrated approach (Los Alamitos, Calif. : IEEE Computer Society, c1999). [14] < www.agilent.com > [15] Mark Webster, Carl Andren Code word description for the Harris-Lucent Updated Compromise Proposal for TGb, IEEE 802.11-98/246-r2. [16] S. Rajagopal, Baseband Architecture Design for Future Wireless Base-Station Receivers, Masters Thesis, Rice University, (May 2000). < http://cmc.rice.edu/docs/docs/Raj2000May2BasebandAr.pdf > [17] K. Chadha, A Reconfigurable Decoder Architecture for Wireless LAN and Cellular Systems, Masters Thesis, Rice University, (April 2001). < http://cmc.rice.edu/docs/docs/Cha2001Apr2AReconfigu.pdf> [18] B. Jones, Rapid Prototyping of Wireless Communications Systems, Masters Thesis, Houston, TX, Rice University (May 2002). < http://cmc.rice.edu/docs/docs/Jon2002May2RapidProto.pdf >
68
APPENDIX A
69
Packet detection 0.2 0.15 0.1 0.05 0 0 100 200 300 400 500 600 IQ data PacketDetect
0.1
20
40
60
80
100
120
140
160
180
0.2
20
40
60
80
100
120
140
160
180
Power in dBm for Short and Long training symbols 0 -20 -40 -60
0.2
0.4
0.6
0.8
1.2
1.4
1.6 x 10
-5
70
5 z H n i t e s f f o . q e r F
x 10
-5
0 x 10
4
50 samples ----->
100
150
5 z H n i t e s f f o . q e r F
-5
10
20
30 40 samples ----->
50
60
70
71
-5
-4
-3
-2
-1
0 In phase
1 e r u t a r d a u Q
12
14
15
13
-1
10
11
-2
-3 -4 -3
0 -2 -1
2 0 In phase 1
3 2
1 17 symbols 3 4
72
EVM per symbol. starting at service field 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
14
16
18
73
80 00 18 0d 2e d9 e0
00 40 4e 6d 64 25 73
00 00 00 30 b4 01 22 c8 00 40 96 46 84 03 00 40 96 46 84 03 30 96 46 84 03 ff 83 df 17 32 09 4e d1 e7 cd 8a 91 c6 d5 c4 c4 40 55 86 f4 dc 8a 15 a7 ec 92 df 93 53 30 18 ca 34 bf a2 c7 59 67 8f d8 2d 7d 54 0a 57 97 70 39 d2 7a ea 24 33 85 ed 9a 1d e1 ff 07 12 9d a3 cf 9b 15 23 8d ab 89 88 80 42 30 9c ab 0d e9 b9 14 2b bf 26 a6 60 31 94 69 7f 45 8e b2 cf 1f 74 1a db b0 5a fa a8 14 af a4 f5 d4 48 67 0b db 34 3b c3 bc 4e 89 aa
00 21 ba be 4f 2e
calculatedCRC32= 0x bc 4e 89 aa , receivedCRC32= 0x bc 4e 89 aa Decoded Frame (hex) FC: 80 00 (Beacon --------) Dur: 00 00 Addr1: 00 30 b4 01 22 c8 Addr2: 00 40 96 46 84 03 Addr3: 00 40 96 46 84 03 SeqCnt: 30 00 seq: 03 frg: 00 rcvdCRC32: bc 4e 89 aa calcCRC32: bc 4e 89 aa
74
75