Computer Organization and Architecture Module 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

08.

503 Computer Organization and Architecture Module 1

Module 1

Functional units of a Computer: Functional Units A computer consists of ve functionally independent main parts: input, memory, arithmetic and logic, output, and control units, as shown below.

Input Unit Computers accept coded information through input units. The most common input device is the keyboard. Whenever a key is pressed, the corresponding letter or digit is automatically translated into its corresponding binary code and transmitted to the processor. Many other kinds of input devices for human-computer interaction are: touchpad, mouse, joystick, trackball etc. These are often used as graphic input devices in conjunction with displays. Microphones can be used to capture audio input which is then sampled and converted into digital codes for storage and processing. Cameras can be used to capture video input. Digital communication facilities, such as the Internet, can also provide input to a computer from other computers and database servers. Output Unit The output unit is the counterpart of the input unit. It sends processed results to the outside world. A familiar example of output device is a printer. Some units, such as graphic displays, provide both an output function, showing text and graphics, and an input function, through touch screen capability. The dual role of such units is the reason for using the single name input/output (I/O) unit in many cases. Memory Unit The function of the memory unit is to store programs and data. There are two classes of storage, called primary and secondary. Primary Memory Primary memory, also called main memory, is a fast memory that operates at electronic speeds. Programs must be stored in this memory while they are being executed. The memory consists of a large number of semiconductor storage cells, each capable of storing one bit of information. These cells are rarely read or written individually. Instead, they are Dept. of ECE, VKCET Page 1

08.503 Computer Organization and Architecture

Module 1

handled in groups of xed size called words. The memory is organized so that one word can be stored or retrieved in one basic operation. The number of bits in each word is referred to as the word length of the computer, typically 16, 32, or 64 bits. To provide easy access to any word in the memory, a distinct address is associated with each word location. Addresses are consecutive numbers, starting from 0, that identify successive locations. A memory in which any location can be accessed in a short and xed amount of time after specifying its address is called a random-access memory (RAM). The time required to access one word is called the memory access time. This time is independent of the location of the word being accessed. It typically ranges from a few nanoseconds (ns) to about 100 ns for current RAM units. Cache Memory: As an adjunct to the main memory, a smaller, faster RAM unit, called a cache, is used to hold sections of a program that are currently being executed, along with any associated data. The cache is tightly coupled with the processor and is usually contained on the same integrated-circuit chip. The purpose of the cache is to facilitate high instruction execution rates. Secondary Storage Although primary memory is essential, it tends to be expensive and does not retain information when power is turned off. Thus additional, less expensive, permanent secondary storage is used when large amounts of data and many programs have to be stored, particularly for information that is accessed infrequently. Access times for secondary storage are longer than for primary memory. A wide selection of secondary storage devices is available, including magnetic disks, optical disks (DVD and CD), and ash memory devices. Arithmetic and Logic Unit Most computer operations are executed in the arithmetic and logic unit (ALU) of the processor. Any arithmetic or logic operation, such as addition, subtraction, multiplication division, or comparison of numbers, is initiated by bringing the required operands into the processor, where the operation is performed by the ALU. For example, if two numbers located in the memory are to be added, they are brought into the processor, and the addition is carried out by the ALU. The sum may then be stored in the memory or retained in the processor for immediate use. Control Unit The memory, arithmetic and logic, and I/O units store and process information and perform input and output operations. The operation of these units must be coordinated by the control unit. The control unit is effectively the nerve center that sends control signals to other units and senses their states. Control circuits are responsible for generating the timing signals that govern the transfers and determine when a given action is to take place. Data transfers between the processor and the memory are also managed by the control unit through timing signals. Control unit as a well-dened, physically separate unit that interacts with other parts of the computer. Dept. of ECE, VKCET Page 2

08.503 Computer Organization and Architecture

Module 1

Von-Neumann architecture Only one bus which is used for both data transfers and instruction fetches, and therefore data transfers and instruction fetches must be scheduled - they cannot be performed at the same time

Only one memory, holds data and programs Need minimum two cycles to complete memory fetching Poor memory throughput Pipelining is not possible Allows easy storing and loading of program between main memory and processor Older architecture, only used for economic and general purpose processors Harvard architecture Separate data and instruction busses, allowing transfers to be performed simultaneously on both busses

Possible to have two separate memory systems for data and program Allow two simultaneous memory fetching operations Greater memory bandwidth (throughput) It is easier to pipeline instructions Most of the DSP processor uses this architecture

Dept. of ECE, VKCET

Page 3

08.503 Computer Organization and Architecture

Module 1

Steps involved in Execution of an instruction CPU executes binary representation of instruction called machine codes. Program Counter (PC) is used to determine which instruction is executed and based on the instruction it is updated accordingly to the next instruction to be run. Consider the connection between processor and main memory:

There are five major steps to execute a single instruction. They are: Step 1: Fetch instruction To fetch an instruction involves the following steps: CPU place an address to the Memory Address Register (MAR) from PC. CPU place MAR contents to the address bus. CPU sends read signal to memory. Memory unit puts instruction on the data bus. Memory sends acknowledge signal to CPU. CPU loads the instruction to the Memory Data Register (MDR). CPU transfers instruction from MDR to Instruction Register (IR). CPU sends acknowledge signal to memory that fetching the instruction is over Step 2: Decode instruction and fetch operands: CPU decodes instruction in IR and if needed it fetches operands. To fetch an operand in memory involves the following steps: CPU places the address of operand to MAR. CPU place MAR content to address bus. CPU sends read signal to memory. Memory unit puts operand on the data bus. Memory sends acknowledge signal to CPU. Dept. of ECE, VKCET Page 4

08.503 Computer Organization and Architecture

Module 1

CPU loads the operand to the Memory Data Register (MDR). CPU moves the operand to ALU If one operand is in CPU, it is moved to ALU for operation Step 3: Perform operation CPU performs the operation encoded in instruction. If it is an ALU operation, CPU performs it using operands. Step 4: Store the result If the result is to store in memory, CPU involves the following steps: CPU places the address of the result to MAR and to address bus. CPU places the result to MDR. CPU sends the write signal to memory. Memory store the result to the address in address bus Memory sends acknowledge signal to CPU. Step 5: Update Program Counter (PC). CPU update (either incremented by 1, 2 or 4) the PC content to point next instruction to be executed. Instruction formats: Defines the layout of bits in an instruction Includes opcode and implicit or explicit operand(s) Usually there are several instruction formats in an instruction set Variety of instruction formats have been designed; they vary widely from processor to processor The length of instruction depends on: Memory size Memory organization Bus structure CPU complexity CPU speed Using large instruction set, small programs can be created and small instruction set results large programs Fixed length instructions of same size or multiple of bus width results fast fetch Variable length instructions may need extra bus cycles Survey of addressing modes: For a given instruction set architecture (ISA), addressing modes define how machine language instructions identify the operand (or operands) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere. Different types of addresses involve tradeoffs between instruction length, addressing flexibility, and complexity of address calculation Common addressing modes are: 1. Direct Addressing 2. Immediate Addressing Dept. of ECE, VKCET Page 5

08.503 Computer Organization and Architecture

Module 1

3. Indirect Addressing 4. Register Addressing 5. Register indirect Addressing 6. Displacement Addressing 7. Implied (stack) Addressing Direct Addressing The instruction tells where the value can be found, but the value itself is out in memory. The address field contains the address of the operand Effective address (EA) = address field (A) In a high level language, direct addressing is frequently used for things like global variables. Advantages: Single memory reference to access data and more flexible than immediate.

Immediate Addressing The instruction itself contains the operand to be used; located in the address field of the instruction The operand is stored in memory immediately after the instruction opcode in memory Similar to using a constant in a high level language

Indirect Addressing The memory cell pointed to by the address field contains the address of (pointer to) the operand

Dept. of ECE, VKCET

Page 6

08.503 Computer Organization and Architecture Register Addressing Operands are registers There is a limited number of registers Very fast execution Very limited address space Multiple registers can help performance Requires good assembly programming or compiler writing

Module 1

Register-Indirect Addressing Similar to memory-indirect addressing Operand is in memory cell pointed to by contents of register R Large address space

Displacement Addressing Combines register-indirect addressing and direct addressing EA = A + (R) Address field holds two values: A = base value and R = register that holds displacement or vice-versa

Types of Displacement Addressing 1. Relative Addressing 2. Base-register addressing 3. Indexing Page 7

Dept. of ECE, VKCET

08.503 Computer Organization and Architecture

Module 1

Relative Addressing EA = A + (PC) Address field A is treated as 2s complement integer to allow backward references Fetch operand from PC+A Can be very efficient because of locality of reference & cache usage. But in large programs code and data may be widely separated in memory Base-Register Addressing A holds displacement R holds pointer to base address R may be explicit or implicit Indexed Addressing A = Base R = displacement EA = A + R Stack Addressing Operand is implicitly on top of stack Performance measurement and benchmarking: The performance of computers can be differentiated by the response time - the time between the start and the completion of an event also referred to as execution time as well as throughput - the total amount of work done in a given time. To compare the relative performance of two different computers, X and Y. The phrase "X is faster than Y" is used to mean that the response time or execution time is lower on X than on Y for the given task. In particular, "X is n times faster than Y" will mean

The execution time is the reciprocal of performance, then

This shows that the performance of X is n times higher than Y. The execution time can be defined in different ways clock time, response time, or elapsed time, which is the latency to complete a task, including disk accesses, memory accesses, input/output activities, operating system overhead etc. Processor Performance Equation 1 All computers are constructed using a clock running at a constant rate. These discrete time events are called ticks, clock ticks, clock periods, clocks, cycles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a program can then be expressed as CPU time = CPU clock cycles for a program x Clock cycle time Dept. of ECE, VKCET Page 8

08.503 Computer Organization and Architecture

Module 1

This is referred as performance equation 1. Problem: A program runs in 20 seconds on a computer A which has a 8GHz clock. Another computer B is built that will run this program in 12 seconds. For this, computer B requires 2.4 times as many as clock cycles as computer A for this program. What is the clock rate of B? Solution: CPU time of A = 20 seconds Clock rate of A = 8GHz CPU clock cycles for a program of A = CPU time of A x Clock rate of A = 20 x 8 x 109 CPU time of B = 12 seconds CPU clock cycles for the program of B = 2.4 x CPU clock cycles for the program of A = 2.4 x 20 x 8 x 109 Clock rate of B = CPU clock cycles for the program of B / CPU time of B = (2.4 x 20 x 8 x 109) / 12 = 32GHz Processor Performance Equation 2 In addition to the number of clock cycles needed to execute a program, we can also count the number of instructions executed or instruction count (IC). If we know the number of clock cycles and the instruction count, we can calculate the average number of clock cycles per instruction (CPI). It is easier to work with and designers sometimes also use instructions per clock (IPC), which is the inverse of CPI. CPI is computed as

This allows us to use CPI in the execution time formula and is, CPU time = Instruction count x CPI x Clock cycle time

This called performance equation 2 Expanding the performance equation 2 into the units of measurement shows how the pieces fit together:

This formula demonstrates, processor performance is dependent upon three characteristics: clock cycle (or rate), clock cycles per instruction (CPI), and instruction count Page 9

Dept. of ECE, VKCET

08.503 Computer Organization and Architecture

Module 1

CPU time is equally dependent on these three characteristics: A 10% improvement in any one of them leads to a 10% improvement in CPU time. It is difficult to change one parameter in complete isolation from others because the basic technologies involved in changing each characteristic are interdependent: Clock cycle time - Hardware technology and organization CPI - Organization and instruction set architecture Instruction count - Instruction set architecture and compiler technology Problem: C1 C2 Clock 3GHz 3GHz CPI 1.5 1 No. of instructions 8 billion 9 billion Which computer is faster and by how much? Which has high MIPS rate? Solution: 1 1 8 109 1.5 1 = = = 4 1 3 109 2 2 9 109 1 2 = = = 3 2 3 109 1 2 3 = = = 0.75 2 1 4 The computer C2 is 1.33 time faster than C1 1 8 109 1 = = = 2000 1 106 4 106 2 9 109 2 = = = 3000 2 106 3 106 MIPS rate of C2 is more than C1 CISC and RISC: CISC (Complex Instruction Set Computer) CISC chips that are easy to program and which make efficient use of memory. Since the earliest machines were programmed in assembly language and memory was slow and expensive, the CISC philosophy made sense, and was commonly implemented in such large computers as the PDP-11 and the DEC system 10 and 20 machines. Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy. CISC was developed to make compiler development simpler. It shifts most of the burden of generating machine instructions to the processor. For example, instead of having to make a compiler write long machine instructions to calculate a square-root, a CISC processor would have a built-in ability to do this. Some common characteristics of CISC instruction are: 1. Two operand format, where instructions have a source and a destination. 2. Register to register, register to memory, and memory to register commands.

Dept. of ECE, VKCET

Page 10

08.503 Computer Organization and Architecture

Module 1

3. Multiple addressing modes for memory, including specialized modes for indexing through arrays 4. Variable length instructions where the length often varies according to the addressing mode 5. Instructions which require multiple clock cycles to execute. E.g. Pentium is considered a modern CISC processor CISC hardware architectures have several characteristics in common and are: 1. Complex instruction-decoding logic, driven by the need for a single instruction to support multiple addressing modes. 2. A small number of general purpose registers. This is the direct result of having instructions which can operate directly on memory and the limited amount of chip space not dedicated to instruction decoding, execution, and microcode storage. 3. Several special purpose registers. Many CISC designs set aside special registers for the stack pointer, interrupt handling, and so on. This can simplify the hardware design somewhat, at the expense of making the instruction set more complex. 4. A 'Condition code" register which is set as a side-effect of most instructions. This register reflects whether the result of the last operation is less than, equal to, or greater than zero and records if certain error conditions occur. Advantages of CISC processors: 1. Microprogramming is as easy as assembly language to implement, and much less expensive than hardwiring a control unit. 2. The ease of micro-coding new instructions allowed designers to make CISC machines upwardly compatible: a new computer could run the same programs as earlier computers because the new computer would contain a superset of the instructions of the earlier computers. 3. As each instruction became more capable, fewer instructions could be used to implement a given task. This made more efficient use of the relatively slow main memory. 4. Because microprogram instruction sets can be written to match the constructs of high-level languages, the compiler does not have to be as complicated. Disadvantages of CSIC processors: 1. Earlier generations of a processor family generally were contained as a subset in every new version - so instruction set & chip hardware become more complex with each generation of computers. 2. So that as many instructions as possible could be stored in memory with the least possible wasted space, individual instructions could be of almost any length - this means that different instructions will take different amounts of clock time to execute, slowing down the overall performance of the machine. 3. Many specialized instructions aren't used frequently enough to justify their existence approximately 20% of the available instructions are used in a typical program. 4. CISC instructions typically set the condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them. RISC (Reduced Instruction Set Computer) A type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures. Dept. of ECE, VKCET Page 11

08.503 Computer Organization and Architecture

Module 1

Some characteristic of most RISC processors are: 1. One cycle execution time: RISC processors have a CPI (clock per instruction) of one cycle. This is due to the optimization of each instruction on the CPU. 2. Pipelining: A technique that allows for simultaneous execution of parts, or stages, of instructions to more efficiently process instructions. 3. Large number of registers: The RISC design philosophy generally incorporates a larger number of registers to prevent in large amounts of interactions with memory. In comparison with CISC, RISC processors have the following features: 1. Reduced instruction set. 2. Less complex, simple instructions. 3. Hardwired control unit and machine instructions. 4. Few addressing schemes for memory operands with only two basic instructions LOAD and STORE. 5. Many symmetric registers which are organized into a register file. Comparison between CISC and RISC processors is: CISC Emphasis on hardware Includes multi-clock complex instructions Memory-to-memory: "LOAD" and "STORE" incorporated in instructions Small code sizes, high cycles per second Transistors used for storing complex instructions RISC Emphasis on software Single-clock, reduced instruction only Register to register: "LOAD" and "STORE" are independent instructions Low cycles per second, large code sizes Spends more transistors on memory registers

Computer Arithmetic: Addition/Subtraction: Addition and subtraction of two numbers are basic operations at the machine-instruction level in all computers. These operations, as well as other arithmetic and logic operations, are implemented in the arithmetic and logic unit (ALU) of the processor. Addition and Subtraction of Signed Numbers The truth table for the sum and carry-out functions for adding equally weighted bits xi and yi in two numbers X and Y is shown below.

Dept. of ECE, VKCET

Page 12

08.503 Computer Organization and Architecture

Module 1

The logic expressions for the sum and carry functions, along with an example of addition of the 4-bit unsigned numbers 7 and 6 is shown below.

The logic expression for sum bit si can be implemented with a 3-input XOR gate, and is a part of the logic required for a single stage of binary addition. The carry-out function, ci+1, is implemented with an AND-OR circuit. This is called full adder (FA) and is shown below.

A cascaded connection of n full-adder blocks can be used to add two n-bit numbers, as shown below. Since the carries must propagate, or ripple, through this cascade, the configuration is also called a ripple-carry adder.

Dept. of ECE, VKCET

Page 13

08.503 Computer Organization and Architecture

Module 1

Addition/Subtraction Logical Unit: The n-bit adder can be used to add 2s-complement numbers X and Y, where the xn1 and yn1 MSB bits are the sign bits. The carry-out bit cn is not part of the answer. Arithmetic overflow occurs when the signs of the two operands are the same, but the sign of the result is different. A circuit to detect overflow can be added to the n-bit adder. It can also be shown that overflow occurs when the carry bits cn and cn1 are different. Therefore, a simpler circuit for detecting overflow can be obtained by implementing the expression cn cn1 with an XOR gate. To perform the subtraction operation X Y on 2s-complement numbers X and Y , we form the 2s-complement of Y and add it to X The logic circuit shown below can be used to perform either addition or subtraction based on the value applied to the Add/Sub input control line.

The Add/Sub control line is set to 0 for addition, applying Y unchanged to one of the adder inputs along with a carry-in signal, c0, of 0. When the Add/Sub control line is set to 1, the Y number is 1s-complemented by the XOR gates and c0 is set to 1 to complete the 2s-complementation of Y. The 2s-complementing a negative number is done in exactly the same manner as for a positive number. An XOR gate can be added to the above circuit to detect the overflow condition cn cn1.

(Work out more 4-bit examples: a) 7+2 b) 5+2 c) 7+(-2) d) (-7)+2 e) (-7)+(-2 f) 7 - 2 Dept. of ECE, VKCET Page 14

08.503 Computer Organization and Architecture

Module 1

Fast Adders: An n-bit ripple-carry adder used for the addition/subtraction may have too much delay in developing its outputs, s0 through sn1 and cn Carry bit cn1 is available in 2(n1) gate delays, and sum bit sn1 have one XOR gate delay. The nal carry-out, cn, is available after 2n gate delays. Therefore, if a ripple-carry adder is used to implement the addition/subtraction; all sum bits are available in 2n gate delays, including the delay through the XOR gates on the Y input. Using the implementation cn cn1 for overow, the indicator is available after 2n + 2 gate delays. Two approaches can be taken to reduce delay in adders. One is to use the fastest possible electronic technology. Another one is to use a logic gate network called a carry-lookahead adder. Carry-Look Ahead Addition The logic expressions for si (sum) and ci+1 (carry-out) of stage i are:

Factoring the second equation into

Also

Where

The expressions Gi and Pi are called generate and propagate functions for stage i. If Gi is equal to 1, then ci+1 = 1, independent of the input carry, ci . This occurs when both xi and yi are 1. The propagate function Pi means that an input carry will produce an output carry when either xi is 1 or yi is 1. All Gi and Pi functions can be formed independently and in parallel in one logic-gate delay after the X and Y operands are applied to the inputs of an n-bit adder. Each bit stage contains an AND gate to form Gi , an OR gate to form Pi , and a three-input XOR gate to form si . A simpler circuit can be derived by observing that Pi can be realized as Pi = xi yi , which differs from Pi = xi + yi only when xi = yi = 1. But, in this case Gi = 1, so it does not matter whether Pi is 0 or 1. Then, using a cascade of two 2-input XOR gates to realize the 3-input XOR function for si , the basic B cell can be used in each bit stage is shown below.

Dept. of ECE, VKCET

Page 15

08.503 Computer Organization and Architecture

Module 1

Expanding ci in terms of i 1 subscripted variables and substituting into the ci+1expression, we can obtain Then the nal expression for any carry variable is

Let us consider the design of a 4-bit adder. The carries can be implemented as

The complete 4-bit adder is shown below.

The carries are produced in the block labeled carry-lookahead logic. An adder implemented in this form is called a carry-lookahead adder. Delay through the adder is 3 gate delays for all carry bits and 4 gate delays for all sum bits. In comparison, a 4-bit ripple-carry adder requires 7 gate delays for s3 and 8 gate delays for c4.

Add/Subtraction implementation: The elements of hardware used to implement Add/Subtraction are shown below. Dept. of ECE, VKCET Page 16

08.503 Computer Organization and Architecture

Module 1

A and B registers to hold operands, As and Bs registers hold the sign of operands, AVF hold the overflow bit when A and B are added, E hold the carry during A and B addition and A register also hold the result. Subtraction is done by adding A to 2s complement of B using parallel adder, when the mode control bit M is 1. Addition is done through the parallel adder, by disabling complementer using M bit as 0.

Flow chart for Add/Subtraction:

Dept. of ECE, VKCET

Page 17

08.503 Computer Organization and Architecture

Module 1

Multiplication of Unsigned Numbers The usual algorithm for multiplying integers by hand is illustrated below.

The product of two, unsigned, n-digit numbers can be accommodated in 2n digits, so the product of the two 4-bit numbers in above example is accommodated in 8 bits. In the binary system, multiplication of the multiplicand by one bit of the multiplier is easy. If the multiplier bit is 1, the multiplicand is entered in the appropriate shifted position. If the multiplier bit is 0, then 0s are entered, as in the third row of the example. The product is computed one bit at a time by adding the bit columns from right to left and propagating carry values between columns. Array Multiplier Binary multiplication of unsigned operands can be implemented in a combinational, 2-dimensional, logic array, as shown below for the 4-bit operand case. Illustration

Dept. of ECE, VKCET

Page 18

08.503 Computer Organization and Architecture Block diagram

Module 1

The main component in each cell is a full adder/half adder. The AND gate in each cell determines whether a multiplicand bit, mj , is added to the incoming partial-product bit, based on the value of the multiplier bit, qi . The circuit require 4 x 4 = 16 AND gates, 4 half adders and 8 full adders If tad is delay through adder and tg is longest AND gate delay, then maximum delay for completion of multiplication is 8tad + tg In general n x n bit array multiplier require n2 AND gates, n half adders and n(n-2) full adders and 2ntad + tg delay The worst-case signal propagation delay path is from the upper right corner of the array to the high-order product bit output at the bottom left corner of the array. This critical path consists of the staircase pattern that includes the two cells at the right end of each row, followed by all the cells in the bottom row (HA10 FA11 HA20 FA21 FA22 FA23 FA32 FA33).

Dept. of ECE, VKCET

Page 19

08.503 Computer Organization and Architecture

Module 1

Sequential Circuit Multiplier: The combinational array multiplier uses a large number of logic gates for multiplying numbers of practical size, such as 32 bit or 64 bit. Multiplication of two n-bit numbers can also be performed in a sequential circuit that uses a single n-bit adder. The block diagram of the hardware arrangement for sequential multiplication is shown below.

This circuit performs multiplication by using a single n-bit adder only. Registers A and Q are shift registers and they hold partial product PPi while multiplier bit qi generates the signal Add/Noadd The Add/Noadd signal causes the multiplexer MUX to select 0 when qi = 0, or to select the multiplicand M when qi = 1, to be added to PPi to generate PP(i + 1). The product is computed in n cycles. The partial product grows in length by one bit per cycle from the initial vector, PP0, of n 0s in register A. The carry-out from the adder is stored in ip-op C, shown at the left end of register A. Operation: Initially, the multiplier is loaded into register Q, the multiplicand into register M, and C and A are cleared to 0. At the end of each cycle, C, A, and Q are shifted right one bit position to allow for growth of the partial product PPi, as the multiplier is shifted out of register Q. Due to the shifting, multiplier bit qi appears at the LSB position of Q to generate the Add/Noadd signal at the correct time, starting with q0 during the rst cycle, q1 during the second cycle, and so on. After using qi, the multiplier bits are discarded by the right-shift operation.

Dept. of ECE, VKCET

Page 20

08.503 Computer Organization and Architecture

Module 1

The carry-out from the adder is the leftmost bit of PP(i + 1), and it must be held in the C ip-op to be shifted right with the contents of A and Q. After n cycles, the high-order half of the product is held in register A and the low-order half is in register Q. The multiplication example using the hardware is shown below.

Flow chart of sequential multiplier:

Dept. of ECE, VKCET

Page 21

08.503 Computer Organization and Architecture

Module 1

Multiplication of Signed Numbers: Different methods are there. Some common methods are: sign extension and booth algorithm. 1) Sign-extension method: Consider the case of a positive multiplier and a negative multiplicand. When we add a negative multiplicand to a partial product, we must extend the sign-bit value of the multiplicand to the left as far as the product will extend. An example in which a 5-bit signed operand, 13, is the multiplicand is multiplied by +11 to get the 10-bit product, 143 is shown below.

The sequential algorithm hardware can be used for negative multiplicands, if it is augmented to provide for sign extension of the partial products PPi. This method is not possible for negative multiplier. Such a case, a straightforward solution is to form the 2s-complement of both the multiplier and the multiplicand and proceed as in the case of a positive multiplier. This technique is applicable for both operands are negative.

2) Booth algorithm: This generates a 2n-bit product and works equally well for both positive and negative 2scomplement n-bit operands uniformly. Consider a multiplication operation in which the multiplier is positive and has a single block of 1s, for example, 0011110. To derive the product, we could add four appropriately shifted versions of the multiplicand, as in the standard procedure. In Booth algorithm, 1 times the shifted multiplicand is selected when moving from 0 to 1, and +1 times the shifted multiplicand is selected when moving from 1 to 0, as the multiplier is scanned from right to left.

Dept. of ECE, VKCET

Page 22

08.503 Computer Organization and Architecture

Module 1

Illustration of the normal and the Booth algorithms for an example is shown.

Another example of recoding a multiplier is shown below.

The case when the least signicant bit (LSB) of the multiplier is handled by assuming that an implied 0 lies to its right. The Booth algorithm can also be used directly for negative multipliers as shown.

Dept. of ECE, VKCET

Page 23

08.503 Computer Organization and Architecture Booth multiplier recoding table:

Module 1

The transformation 011 ... 110 +100 ... 0 10 is called skipping over 1s. Here only a few versions of the shifted multiplicand (the summands) need to be added to generate the product, thus speeding up the multiplication operation. A 16-bit worst-case multiplier, an ordinary multiplier, and a good multiplier is shown below.

Advantages of Booth algorithm: 1) It handles both positive and negative multipliers uniformly. 2) It achieves some efciency in the number of additions required when the multiplier has a few large blocks of 1s. Fast Multiplication: Two techniques for speeding up the multiplication operation. 1) Bit-Pair Recoding of Multipliers: The technique guarantees that the maximum number of summands (versions of the multiplicand) that must be added is n/2 for n-bit operands. 2) Carry-Save Addition of Summands: The technique leads to adding the summands in parallel. Bit-Pair Recoding of Multipliers: It is derived directly from the Booth algorithm. It uses grouping of the Booth-recoded multiplier bits in pairs. For example, the pair (+1 1) is equivalent to the pair (0 +1). That is, instead of adding 1 times the multiplicand M at shift position i to +1 M at position i + 1.

Dept. of ECE, VKCET

Page 24

08.503 Computer Organization and Architecture

Module 1

Bit-pair multiplier recoding table:

The multiplication operation by normal Booth algorithm and bit-pair algorithm are shown below.

Normal:

Dept. of ECE, VKCET

Page 25

08.503 Computer Organization and Architecture Bit-pair:

Module 1

Carry-Save Addition of Summands: Consider the 4 4 multiplication array shown below. Here all half adders are replaced by full adders with third input as 0.

Carry-save addition (CSA) can be used to speed up the process by introducing carry into the next row, at the correct weighted positions, as shown below.

Dept. of ECE, VKCET

Page 26

08.503 Computer Organization and Architecture

Module 1

This frees up an input to each of three full adders in the rst row. These inputs can be used to introduce the third summand bits m2q2 , m1q2 , and m0q2 . Then the two inputs of each of three full adders in the second row are fed by the sum and carry outputs from the rst row. The third input is used to introduce the bits m2q3, m1q3, and m0q3 of the fourth summand. The high-order bits m3q2 and m3q3 of the third and fourth summands are introduced into the remaining free full-adder inputs at the left end in the second and third rows. The saved carry bits and the sum bits from the second row are now added in the third row, which is a ripple-carry adder, to produce the nal product bits. The delay through this technique is somewhat less than the delay through the ripple-carry array. This is because the Sum and Carry vector outputs from each row are produced in parallel in one full-adder delay. Integer Division: Consider the examples of decimal division and binary division of the same values.

Consider the decimal version rst. The 2 in the quotient is determined by the following reasoning: First, we try to divide 13 into 2, and it does not work. Next, we try to divide 13 into 27. We go through the trial exercise of multiplying 13 by 2 to get 26, and, observing that 27 26 = 1 is less than 13; we enter 2 as the quotient and perform the required subtraction. The next digit of the dividend, 4, is brought down, and we nish by deciding that 13 goes into 14 once, and the remainder is 1. Binary division in a similar way. Hardware for division is shown below.

Dept. of ECE, VKCET

Page 27

08.503 Computer Organization and Architecture

Module 1

An n-bit positive divisor is loaded into register M and an n-bit positive dividend is loaded into register Q at the start of the operation. Initially register A is set to 0. After the division is complete, the n-bit quotient is in register Q and the remainder is in register A. The required subtractions are using 2s complement arithmetic. The extra bit position at the left end of both A and M accommodates the sign bit during subtractions. There are two types of algorithm for division using the above hardware: 1) Restoring division 2) Non-restoring division 1) Restoring division: This operation is as follows: Positions the divisor appropriately with respect to the dividend and performs a subtraction. If the remainder is zero or positive, a quotient bit of 1 is determined, the remainder is extended by another bit of the dividend, the divisor is repositioned, and another subtraction is performed. If the remainder is negative, a quotient bit of 0 is determined, the dividend is restored by adding back the divisor, and the divisor is repositioned for another subtraction. The steps for this algorithms are: (Do the steps for n times) 1. Shift A and Q left one bit position. 2. Subtract M from A, and place the answer back in A. 3. If the sign of A is 1, set q0 to 0 and add M back to A, this is for restoring A; otherwise, set q0 to 1. A 4-bit example processed by restoring division is shown below:

Dept. of ECE, VKCET

Page 28

08.503 Computer Organization and Architecture

Module 1

2) Non-restoring division: This algorithm can be improved the speed of division by avoiding the need for restoring A after an unsuccessful subtraction (Subtraction is said to be unsuccessful if the result is negative). Consider the sequence of operations that takes place after the subtraction operation in the preceding algorithm. If A is positive, shift left and subtract M, that is, we perform 2AM. If A is negative, restore it by performing A+M, and then shift it left and subtract M. This is equivalent to performing 2A+M. The q0 bit is appropriately set to 0 or 1 after the correct operation has been performed. The algorithm has two stages: Stage 1 is for getting quotient and requires n cycles. Stage 2 is for restoring reminder and is optional. At last q0 is 0, no need for this stage, otherwise a restore operation is required for a valid reminder. The steps are following: Stage 1: Do the following two steps n times: 1. If the sign of A is 0, shift A and Q left one bit position and subtract M from A; otherwise, shift A and Q left and add M to A. 2. Now, if the sign of A is 0, set q0 to 1; otherwise, set q0 to 0. Stage 2: If the sign of A is 1, add M to A. The division example executed by the non-restoring division algorithm is shown below:

Dept. of ECE, VKCET

Page 29

08.503 Computer Organization and Architecture

Module 1

The signed division can be performed by preprocessing the sign of the operands and obtain the sign of the result. Sign of the result can be obtained by XORing the sign of operands. The division is performed by above algorithms. Floating-Point Numbers and Operations Floating-Point Numbers: Floating point numbers are used to represent real numbers. In the 2s complement system the signed value F represented by n-bit binary fraction = 0 . 1 2 (1) is given by = 0 20 + 1 21 + 2 22 + + (1) 2(1) Where the range of F is 1 1 2(1) Using this number representation, a computer must be able to represent real numbers and operate on them. To represent real numbers, the position of the binary point is variable and is automatically adjusted as computation proceeds. This binary point is said to oat, and the numbers are cal led oating-point numbers. For example: in decimal scientic notation, numbers may be written as 6.0247 10 23, 3.7291 1027, 1.0341 102, 7.3000 1014, and so on. These numbers have been given to 5 signicant digits of precision. The scale factors 1023, 1027, 102, and 1014 indicate the actual position of the decimal point with respect to the signicant digits. The same approach is used to represent binary oating-point numbers in a computer, except that using 2 as the base of the scale factor. Because the base is xed. A binary oating-point number can be represented by: 1) a sign for the number 2) some signicant bits commonly called mantissa 3) a signed scale factor exponent for an implied base of 2. IEEE standard for floating-point numbers: General form of floating-point numbers is related to fractional decimal numbers. A usual form is 1 . 2 3 4 5 6 7 101 2 Where Xi and Yi are decimal digits and have 7 significant digits and exponent range 99. IEEE (Institute of Electrical and Electronics Engineers) Standard 754 uses this way to represent the floating-point numbers. There are 32-bit (single precision) and 64-bit (double precision) floating-point number representation using IEEE Standard. 32-bit (single precision) IEEE Standard is shown below.

First bit S is to represent sign of the number. Page 30

Dept. of ECE, VKCET

08.503 Computer Organization and Architecture

Module 1

Then E is to represent 8-bit signed exponent in excess- 127 format and E = E + 127, where E is actual exponent value. E is in the range 0 to 255, but E is in the range -126 to 127. The excess-x representation enables efficient comparison of the relative sizes of two floating point numbers. The last 23 bits represent mantissa. The number can represent in unnormalized and normalized form. In unnormalized form the number is represented in its actual form, but in normalized form the most significant bit is always equal to 1 and this bit is not explicitly represented, An example of single precision floating-point number is shown below.

S=0 E = E + 127 = -87+127 = 0010 1000 M = (1).0010 100 A 32-bit single precision has the scale factor of range 2-126 to 2+126 and is approximately 1038, 23-bit mantissa represent 7-digit decimal value. 64-bit (double precision) IEEE Standard is shown below.

S for sign Exponent E is 11 bit and E = E + 1023, where E has the range -1022 to 1023 providing the scale factor of 2-1022 to 2+1023 (approximately 10308) and 53-bit mantissa provides 16-digit decimal value. Floating-point number represented in unnormalized form is normalized by shifting the mantissa right or left by one bit position is compensated by an increase or decrease of 1 in the exponent respectively. If exponent less than -126, underflow has occurred and if exponent greater than +127 overflows have occurred. Consider an unnormalized value: +0.0010110 x 29 and it is normalized version is +1.0110. x 26. The single precision representations of both values are: +0.0010110 x 29 S=0 E = 9+127 = 136 = 1000 1000 M = (0).0010 110..

Dept. of ECE, VKCET

Page 31

08.503 Computer Organization and Architecture +1.0110. x 26 S=0 E = 6+127 = 133 = 1000 0101 M = (1).0110 ..0

Module 1

Problem 1: Find the decimal value of the following IEEE encoding: 1 01111100 1100 0000 0000 0000 0000 000 (Normalized) Answer: S = 1, so number is ve. Exponent E = E 127 = (01111100)2 127 = 124 127 = -3 Mantissa = (1).11000000000000000000000 = 20 +2-1 + 2-2 = 1+0.5 + 0.25 = 1.75 Decimal number is -1.75 x 2-3 = -0.2187510 Problem 2: Represent -0.437510 in IEEE single precision format. Answer: -0.437510 = -0.0111 x 20 Normalized version: -0.437510 = -1.1100 x 2-2 S=1 E =E+127= - 2+127= 11111012 IEEE format: 1 01111101 11000000000000000000000 Problem 3: Represent 0.510 in IEEE single precision format. Answer: 0.510 = 0.1x 20 Normalized version: 0.510 = 1.0000 x 2-1 E =E+127= - 1+127= 011111102 IEEE format: 0 01111110 00000000000000000000000 Special values: 1. Sign = 0, E = 00000000 and M = 000: the value is 0 2. Sign = 1, E = 00000000 and M = 000: the value is 0 3. Sign = 0, E = 11111111 and M = 000: the value is + 4. Sign = 1, E = 11111111 and M = 000: the value is 5. E = 11111111 and M 0000: the value is Not A Number (NaN), similar to 0.0/0.0

Dept. of ECE, VKCET

Page 32

08.503 Computer Organization and Architecture

Module 1

Arithmetic operations on floating-point numbers: Add/Subtract rule: 1. Choose the number with the smaller exponent and shift its mantissa right a number of steps equal to the difference in exponents. 2. Set the exponent of the result equal to the larger exponent. 3. Perform addition/subtraction on the mantissas and determine the sign of the result. 4. Normalize the result, if necessary Multiplication rule: 1. Add the exponents and subtract 127 2. Multiply the mantissas and determine the sign of the result. 3. Normalize the resulting value, if necessary. Division rule: 1. Subtract the exponents and add 127 2. Divide the mantissas and determine the sign of the result 3. Normalize the resulting value, if necessary. University Questions: 1. Showing the steps involved, multiply the following numbers. (Nov 2011) 1) 0.510 and -0.437510 2) 1.110x1010 and 9.200x10-5 Solution: 1. 0.510 x -0.437510 0.510 = +0.12 Normalized form: +1.0 x 2-1 IEEE format of 0.510 : 0 01111110 000..0 -0.437510 = -0.01112 Normalized form: -1.11 x 2-2 IEEE format of -0.437510: 1 01111101 1100 Add exponents: 01111110 + 01111101 11111011+ Subtract 127: 10000001 E = 01111100 Multiply mantissas:

Sign = 0 XOR 1 = 1 0.510 x -0.437510 in IEEE format (Normalized form): 1 01111100 11000000000000000000000 S=1 E = E - 127 = 01111100 127 = -3 M = (1).110..0 = 20+2-1+2- = 1.7510 Product = -1.75 x 2-3 = -0.2187510 Dept. of ECE, VKCET Page 33

08.503 Computer Organization and Architecture

Module 1

2. 1.110x1010 x 9.200x10-5 Add exponents: 10 5 = 5 1.110 x 9.200 10.212 x 105 Normalized result: 1.0212 x 106 2. Showing all the steps to add the numbers 0.510 and -0.437510 in binary (Nov 2012) Solution: 0.510 = +0.12 x 20 IEEE format: 0 01111111 10000..0 -0.437510 = -0.01112 x 20 IEEE format: 1 01111111 01110..0 Second operands sign is ve, so the operation can be performed by taking 2s complement of second operand. Adding mantissas: 1000 00 + 1001 00 (2s complement of second operand) 0001 00 (discard carry) IEEE format of result (Unnormalized form): 0 01111111 0001 00 S = 0 , so +ve E = E 127 = 127-127 = 0 M = (0).0001 00 Normalized form: E = 127 4 = 123 = 0111 1011 M = (1).0000 00 =1.0 0 01111011 00000 Result = +1.0 x 2123-127 = +1.0 x 0.0625 =+0.062510 3. Perform -0.2510 / 0.510 Solution: -0.2510 = -0.012 x 20 = -1.02 x 2-2 S=1 E = -2 + 127 = 125 = 0111 1101 M = (1).0000 00 1 01111101 00000 0.510 = +0.12 x 20 = +1.02 x 2-1 (Keep divisors mantissa dividends) S =0 E = -1 + 127 = 126 = 0111 1110 M = (1).0000 00 1 01111110 00000 Subtract the exponents and add 127: 0111 1101 + 1000 0010 (2s complement of second exponent) 1111 1111 + 0111 1111 0111 1110 E = 0111 1110 Dept. of ECE, VKCET Page 34

08.503 Computer Organization and Architecture S = 1 XOR 0 = 1 Divide mantissas: 1.0000..0 1.0000..0 1.0000..0 Result in IEEE format: 1 01111110 00000 E = 01111110 -127 = 126 127 = -1 Result = - 1.0 x 2-1 = -0.510

Module 1

MIPS Architecture Introduction: MIPS stand for Microprocessor without Interlocked Pipeline Stages. Implementation of RISC Instruction Set Architecture (ISA). Developed by MIPS Technologies (formerly MIPS Computer Systems, Inc.). The early MIPS architectures were 32-bit, and now 64-bit versions are also using. Multiple revisions of the MIPS instruction set exist, including MIPS I, MIPS II, MIPS III, MIPS IV, MIPS V, MIPS32, and MIPS64. The current revisions are MIPS32 (for 32-bit implementations) and MIPS64 (for 64-bit implementations). MIPS implementations were targeted at computer-like applications such as workstations and servers. Now MIPS implementations have had significant success in embedded applications MIPS32 and MIPS64 Architectures provide a substantial cost/performance advantage over microprocessor implementations based on traditional architectures. MIPS32 Architecture is based on the MIPS II ISA, adding selected instructions from MIPS III, MIPS IV, and MIPS V to improve the efficiency of generated code and of data movement The MIPS64 Architecture is based on the MIPS V ISA and is backward compatible with the MIPS32 Architecture. MIPS emphasizes: A simple load-store instruction set. Design for pipelining efficiency, including a fixed instruction set encoding.. Efficiency as a compiler target. Instruction Set Architecture (ISA): Critical Interface between hardware and software An ISA includes the following: Instructions and Instruction Formats Data Types, Encodings, and Representations Programmable Storage: Registers and Memory Addressing Modes: to address Instructions and Data Handling Exceptional Conditions (like division by zero)

Dept. of ECE, VKCET

Page 35

08.503 Computer Organization and Architecture

Module 1

Instructions: Instructions are the language of the machine. MIPS instruction set architecture: Based on RISC. Relatively simple design. Similar to RISC architectures developed in mid-1980 and 90s. Very popular, used in many products of Silicon Graphics, ATI, Cisco, Sony, etc. Comes next in sales after Intel IA-32 processors. Almost 100 million MIPS processors sold in 2002 (and increasing). Alternative design: Intel IA-32 (CISC). MIPS Architecture:

MIPS General-Purpose Registers (Execution & Integer Unit): 32 General Purpose Registers (GPRs) Assembler uses the dollar notation to name registers: $0 is register 0, $1 is register 1, , and $31 is register 31 All registers are 32-bit wide in MIPS32 Register $0 is always zero; Any value written to $0 is discarded Three special-purpose registers (SPRs) PCProgram Counter register Dept. of ECE, VKCET Page 36

08.503 Computer Organization and Architecture HIMultiply and Divide register higher result LOMultiply and Divide register lower result MIPS Register Conventions: Assembler can refer to registers by name or by number.

Module 1

MIPS FPU Registers: Coprocessor 1 for real number manipulations The MIPS32 Architecture denes the following FPU registers: 32 oating-point registers (FPRs). These registers are 32 bits wide. Five FPU control registers are used to identify and control the FPU. Trap and Memory Unit: Coprocessor 0 CP0 is incorporated on the MIPS CPU chip and it provides functions necessary to support operating system: exception handling, memory management scheduling and control of critical resources. Coprocessor 0 (CP0) registers:

MIPS Data Types: MIPS operates on: 32-bit (unsigned or 2s complement) integers. 32-bit (single precision floating point) real numbers. 64-bit (double precision floating point) real numbers. 32-bit words, half words and bytes can be loaded into GPRs. After loading into GPRs, bytes and half words is either zero or sign bit expanded to fill the 32-bits. Dept. of ECE, VKCET Page 37

08.503 Computer Organization and Architecture

Module 1

Only 32-bit units can be loaded into FPRs and 32-bit real numbers are stored in even numbered FPRs. 64-bit real numbers are stored in two consecutive FPRs, starting with even-numbered register.

MIPS Addressing Modes: 1. Register addressing: A source or destination operand is specified as content of one of the registers $0-$31. 2. Immediate addressing: A numeric value embedded in the instruction is the actual operand. 3. PC-Relative addressing: A data or instruction memory location is specified as an offset relative to the incremented PC. 4. Base addressing: A data or instruction memory location is specified as a signed offset from a register. 5. Register-direct addressing: The value of the effective address is in a register. 6. Pseudo-direct addressing: The memory address is (mostly) embedded in the instruction. Register addressing Operands are in a register. Example: add $s3,$s4,$s5

Immediate Addressing: The operand is encoded in instruction.

16-bit immediate value range: -215 1 = -32,769 to +215 = +32,768

Dept. of ECE, VKCET

Page 38

08.503 Computer Organization and Architecture

Module 1

PC-relative addressing: The value in the immediate field is interpreted as an offset of the next instruction (PC + 4 of current instruction) Example: beq $s0,$s3,Label

Immediate value is in 16 bits 2s compliment, so the range of values is -215 up to 215 - 1 Detail of PC-relative

Base addressing: Also called indirect addressing, the address of the operand is the sum of the immediate and the value in a register

Dept. of ECE, VKCET

Page 39

08.503 Computer Organization and Architecture Register-direct addressing: The effective address is in a register. Special case of base addressing where offset is 0. Used with the jump register instructions.

Module 1

Pseudo-direct addressing: 26 bits of the address is embedded as the immediate, and is used as the instruction offset within the current 256MB (64MWord) region defined by the PC.

MIPS Instruction Set Format Basic formats: 1. R-Type (Register) 2. I-Type (Immediate) 3. J-Type (Jump) Floating-point formats: 1. FR-Type (FP operations) 2. FI-Type (FP branches)

Dept. of ECE, VKCET

Page 40

08.503 Computer Organization and Architecture Register-Type Format:

Module 1

op (opcode): basic operation of instruction, also determines format. If op = 0 for all R-type instructions rs: First source operand rt: Second source operand rd: Destination shamt: Shift amount funct: Function variant (e.g. add and sub same op, but add has funct = 32 and sub has funct=34) Immediate-Type format:

op (opcode): basic operation of instruction (e.g. lw opcode = 35, addi opcode = 8, beq opcode = 4) rs/base: register containing source operand or base rt: destination register for addi or loads, source register for stores, second operand for beq. offset/immediate: immediate field in computation instructions, byte address offset (wrt rs) in load/store instructions, word address offset (w.r.t. PC) in branch instructions always sign extended to a 32-bit value. Jump-Type format:

op (opcode): basic operation of instruction (e.g. j opcode = 2) target: target word address of the instruction to jump to.

Examples: R-Type:

I-Type:

Dept. of ECE, VKCET

Page 41

08.503 Computer Organization and Architecture J-Type:

Module 1

FR-Type:

FI-Type:

fmt: Format, for single precision .S and double precision .D MIPS Instruction Set Different Class of MIPS Instructions: 1. Load and Store Instructions (Data transfer Instructions): These instructions load immediate values and move data between memory and general purpose registers. 2. Computational Instructions (Arithmetic and Logical Instructions): These instructions do arithmetic and logical operations for values in registers. 3. Jump and Branch Instructions (Control Instructions): These instructions change program control flow. 4. Coprocessor Interface Instructions: These instructions provide standard interfaces to the coprocessors. Load and Store Instructions:

Load/Store memory concept:

Dept. of ECE, VKCET

Page 42

08.503 Computer Organization and Architecture Computational Instructions: 1) Arithmetic Instructions:

Module 1

Example 1: Consider the C statement contains the five variables f, g, h, i and j: f = (g + h) (i + j); What is the MIPS assembly code produced by C compiler? (Assume g, h, i, j and f variables are in $s1, $s2, $s3, $s4 and $s5 registers respectively) MIPS assembly code: Comments add $t0, $s1, $s2 # t0 = s1 + s2, ie (g+h) add $t1, $s3, $s4 # t1 = s3 + s4, ie (i + j) sub $s5, $t0, $t1 # s5 = t0 - t1, ie f=(g+h) - (i + j) Example 2:

MIPS assembly code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 48($s3) Other instructions:

# t0 gets A[8], 32 is offset value of 8 x 4 and s3 has base # t0 = h + A[8] # A[12] = h + A[8]

Dept. of ECE, VKCET

Page 43

08.503 Computer Organization and Architecture 2) Logical Instructions:

Module 1

Jump and Branch Instructions:

Example 1: What is the MIPS code for the following C statements? Assume f, g, h, i and j variables are in registers $s1 through $s5. if (i == j) f = g + h; else f = g h; MIPS code: bne $s4, $s5, Else # go to Else if i j add $s1, $s2, $s3 # f = g + h skipped if i j j Exit # go to Exit Else: sub $s1, $s2, $s3 # f = g - h skipped if i = j Exit:

Dept. of ECE, VKCET

Page 44

08.503 Computer Organization and Architecture

Module 1

Example 2: Consider the following C code and what is the MIPS code of this C segment? while (save[i] = = k) i += 1; Assume that i and k corresponding to the registers $s3 and $s5 and the base of the array save is in $s6. MIPS code: Loop: sll $t1, $s3, 2 # t1 = i x 4 for offset value to get save[i] add $t1, $t1, $s6 # t1 = s6 + t1 for address of save[i] lw $t0, 0($t1) # t0 = save[i] bne $t0, $s5, Exit # go to Exit if save[i] k add $s3, $s3, 1 #i=i+1 j Loop # go to Loop Exit: Procedure and Transfer control: Procedure or function is a tool for C or Java programmers to use structure programs. It will make them easier to understand and allow code to be reused. Procedure allows programmer to concentrate on one portion of the task, in which the parameter acts as a barrier between procedure and rest of the program. The parameters are passed to the procedure and results in it are returned by parameters. For the execution of procedure, the program must follow the following steps: 1. Place the parameters in a place where the procedure can access them (By global variables). 2. Transfer the control to the procedure. 3. Acquire the resources for the procedure. 4. Perform the desired task. 5. Place the result values in a place where the calling program can access it. 6. Return control to the point of orgin. MIPS registers to support procedures:

MIPS instruction to support procedures:

Jump and link instruction: It jumps to an address and simultaneously saves the address of following instruction in register $ra called return address If MIPS compiler needs more registers for a procedure than 4 arguments and 2 return value registers, a data structure of registers used in main program can be created in stack memory, which is last in first out queue. To point stack memory, stack pointer $sp can be used. Placing data on to the stack is called push and removing data from the stack is called pop. Page 45

Dept. of ECE, VKCET

08.503 Computer Organization and Architecture

Module 1

Usually for each push operation stack pointer content decremented by one word and for each pop operation it is incremented by one word. Example 1: Consider the following C procedure code:

What is the compiled MIPS assembly code? MIPS code: For the given procedure: 1. We require 4 argument registers for g, h, i and j , assume that they are $a0, $a1, $a2 and $a3 respectively 2. We require 2 temporary registers for (g+h) and (i+h), assume that they are $t0 and $t1 respectively. Also we have save this registers content to stack memory at the beginning of procedure and restore it at the end. 3. We require 1 saved register for f, assume that it is $s0. The content of this register also saved and restore to/from the stack. leaf_example: addi $sp, $sp, -12 # stack to make room for 3 registers sw $t0, 8($sp) #push t0 sw $t1, 4($sp) #push t1 sw $s0, 0($sp) #push s0 add $t0, $a0, $a1 # t0 = g+h add $t1, $a2, $a3 # t1 = i+j sub $s0, $t0, $t1 # f = t0 t1 add $v0, $s0, $zero # returns f (v0 = s0 + 0) # restore old value of registers lw $s0, 0($sp) # pop $s0 lw $t1, 4($sp) # pop $t1 lw $t0, 8($sp) # pop $t0 # return from the procedure using return address jr $ra # return to calling routine

Dept. of ECE, VKCET

Page 46

You might also like