Fpga-Based Laboratory Assignments For Noc-Based Manycore Systems

180
IEEE TRANSACTIONS ON EDUCATION, VOL. 55, NO. 2, MAY 2012
FPGA-Based Laboratory Assignments for NoC-Based Manycore Systems

Christos Ttos, Student Member, IEEE, Theocharis Theocharides, Member, IEEE, and Maria K. Michael, Member, IEEE
AbstractManycore systems have emerged as being one of the dominant architectural trends in next-generation computer systems. These highly parallel systems are expected to be interconnected via packet-based networks-on-chip (NoC). The complexity of such systems poses novel and exciting challenges in academia, as teaching their design requires the students to understand a large number of NoC-based design-space parameters. Moreover, the industry has only recently attempted to design large-scale NoC-based manycore prototypes; the use of NoCs, therefore, has not yet reached a mature stage. Consequently, academia still lacks standardized tools and methodologies to teach NoC-based manycore systems, which, in turn, demand a solid educational background in a wide variety of areas, thus raising several teaching challenges. This paper presents an FPGA-based teaching framework composed of a sequence of laboratory assignments. The framework provides instructors with a practical teaching approach and helps them teach students how to emulate NoC-based manycore systems and how to evaluate and explore their design parameters. The proposed framework can be integrated into existing senior undergraduate courses or can be taught as an independent course. The course has been taught three times at the University of Cyprus, and initial course evaluation results, instructor observations, and suggested grading policies are also provided. Index TermsComputer architecture, embedded systems design, eld programmable gate arrays (FPGAs), manycore systems, networks-on-chip (NoC).
I. INTRODUCTION
ANYCORE systems have emerged as the dominant architectural trend in next-generation, high-performance, homogeneous, and heterogeneous microprocessors [1], [2]. These systems integrate multiple processor cores on a single die, yielding increased performance in a power-efcient manner [1], [2]. However, as the number of on-chip processor cores [processing elements (PEs)] increases, the design space for interconnecting these cores becomes more complicated. The use of global interconnects, such as buses and rings, causes severe on-chip synchronization errors, unpredictable delays, and inefcient power consumption [1], [2]. These effects give rise to new challenges in next-generation
Manuscript received December 02, 2010; revised March 11, 2011 and May 15, 2011; accepted June 02, 2011. Date of publication July 14, 2011; date of current version May 01, 2012. This work was supported in part by the Cyprus Research Promotion Foundation under Contract TE/HPO/0308(BIE)/04. The authors are with the Department of Electrical and Computer Engineering, University of Cyprus, Nicosia 1678, Cyprus (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TE.2011.2159795
manycore systems design, evidenced by the emergence of networks-on-chip (NoC) [3], [4]. These micro-networks, which provide packet-based communication, have been proposed to replace traditional buses and rings as the on-chip communication infrastructure, offering reusability, scalability, and predictability [3][8]. Traditionally, students learn multicore/manycore design as an introductory topic in the core undergraduate computer architecture course [9]. Rarely, however, do they see how such a system is implemented, as typical architectural curricula cover the architecture of the processor and topics such as instruction and thread-level parallelism. As NoCs are a relatively new interconnect architecture, students rarely experience NoC-based systems in the undergraduate curriculum; those that do only do so through some introductory theoretical aspects, as existing practical teaching covers only traditional, bus-based multiprocessor systems [9]. NoC-based manycore systems present novel and exciting challenges in both industry and academia [10]. NoCs were introduced in the early years of the 21st century, and while they received worldwide recognition as the anticipated on-chip communication infrastructure of future generations of multicore systems, they are still only to be found in various experimental prototypes and have yet to be widely adopted by the industry. As such, and given the paradigms immaturity, including NoC-based systems in undergraduate curricula had not been considered useful; such systems were instead included in graduate and research topics [11]. Recently, however, industry has successfully integrated tens of cores on a single die prototype using on-chip interconnection networks [12]. Furthermore, other commercial and research products have surfaced to support the NoC paradigm shift [13], [14]. Promising results stemming from these recent prototypes indicate that these systems will indeed be the de facto communication standard of the future [1]. As such, academia now faces the challenge of integrating NoCs into undergraduate curricula. NoC-based manycore systems design demands a solid educational background on multiple levels; students need to understand all aspects of system design, ranging from the on-chip interconnection infrastructure, the interaction of the network with the cores, and the overall system integration and evaluation methodology. Furthermore, it is anticipated that experiencing a hands-on design space exploration of the emerging design parameters involved in the development and sustainability of next-generation manycore systems will help students understand the issues and challenges inherent in designing such systems [15].
0018-9359/$26.00 2011 IEEE
TTOFIS et al.: FPGA-BASED LABORATORY ASSIGNMENTS FOR NoC-BASED MANYCORE SYSTEMS
181
This paper presents a course framework that introduces senior undergraduate students to the principles of manycore systems and their interconnect architecture from a practical designer point of view. This is expected to strengthen the students manycore design skills, which are required in several areas other than traditional microprocessor design, such as embedded and real-time systems, high-end server farms, and high-performance and scientic computing. The paper introduces a practical teaching framework consisting of a sequence of VHDL-based laboratory assignments, intended to help students learn the design principles of NoC-based manycore systems. The framework utilizes eld programmable gate array (FPGA)-based emulation, particularly targeting design parameter exploration associated with the interconnection network supporting the manycore system. The course framework can be either taught as an independent senior-level undergraduate course or integrated as a laboratory component of an undergraduate senior-level computer architecture course (or a related, more practical course such as FPGA design). The framework can also be taught at the graduate level as a fundamental advanced computer architecture course. Furthermore, the course provides a debugging and evaluation suite allowing students to explore and evaluate many aspects of manycore architectures. A preliminary description of this laboratory course, which had at that time been taught once, was presented at MSE 2009 [16], receiving a Best Paper Award. This paper presents additional details related to the targeted curriculum topics and the individual laboratory assignments. Furthermore, this paper presents the monitoring, debugging, and evaluation suite used in the assignments, and since the course has been taught twice more, includes signicant feedback from student evaluations and the instructors experience. The rest of this paper is organized as follows. Section II presents the organization and characteristics of the course. Section III presents an overview of the targeted manycore curriculum topics providing the learning outcomes of the course. Section IV describes the sequence of laboratory assignments. Section V presents an evaluation of the laboratory as well as feedback from students. II. COURSE ORGANIZATION AND CHARACTERISTICS A. General Course Description The NoC-based manycore design course presented here is expected to be an elective course offered to undergraduate seniors and is designed for a 14-week semester. It is also structured so that it can be integrated into an existing senior undergraduate course that teaches principles of hardware design, such as an FPGA design course. The laboratory-oriented course aims to teach the theory, design, implementation, and validation of NoC-based manycore systems in a way that goes beyond the typical lecture-then-test format of teaching. Through a combination of weekly, short (hour-long) lectures held in the lab, and a series of laboratory assignments, the proposed lab-based learning approach can speed up the learning process and motivate students to learn actively [15], [17]. During the lectures, students are introduced to the fundamentals behind NoC-based
Fig. 1. NoC-based manycore architecture that students are expected to implement by the end of the course.
manycore design. They are then expected to turn these fundamentals into practice by completing a series of FPGA-based laboratory assignments covering principles learned during the lectures. Moreover, the lab assignments are completed in twoor three-student teams, emulating the behavior of industrial design teams. This improves students motivation for cooperative design and helps them understand the complexity of such large-scale systems [17]. Given the relative novelty of NoC-based systems, their recent adaptation by industry, and lack of standardized design ow and architecture for such systems, there currently does not exist a single textbook really well suited to cover all aspects of such a new topic. Therefore, the concepts mentioned previously are taught through a combination of technical books [8], [18], [19], book chapters [5], [7], research articles [1][4], [20], [21], and instructor notes. Course prerequisites include the courses Computer Architecture and Computer Organization, basic programming skills, and basic knowledge of hardware description languages. B. Targeted Manycore Architecture By the end of the course, students will be able to design a manycore architecture like that shown in Fig. 1, which can be represented as a set of PEs (processor cores, memory cores, etc.) that communicate via a packet-based communication network (NoC). The NoC provides global chip communication by employing a grid of routers, which are connected to each other, forming a specic network topology. The main function of the routers is to route packets from sources to destinations according to a chosen communication protocol [18]. More precisely, when a source router sends a packet to a destination router (see Fig. 1), the packet is rst generated and transmitted from the local core to the attached router via a network interface (NI). The NI is needed to connect each core to the NoC, serving as the interfacing mechanism between the interconnection architecture and the cores. Given the variety of processor and memory cores, as well as the variety of data formats supported by each processor, the NI acts as a translator that sends and receives data from each core in miscellaneous formats
182
Fig. 2. Course sequence diagram showing the logical order of the labs.
by assembling and disassembling packets, following the NoC communication protocol. The packet is then stored at the input channels, and the router starts servicing it. This service time includes the time needed to make a routing decision, allocate a channel, and traverse the switch fabric (a crossbar switch). After being serviced, the packet moves to the next router on its path, and the process repeats until the packet arrives at its nal destination [22]. C. Course Curriculum Organization The course (and the lab assignments) is divided into three major curriculum topics: Curriculum Topic 1: NoC Infrastructure: Lab1Design of the NoC Router; Lab2Design and Verication of a 4 4 NoC; Lab3Network Interface. Curriculum Topic 2: System Integration: Lab4Integrating Processor Cores to the NoC. Curriculum Topic 3: Evaluation/Debugging System and Methodology: Lab5Debugging and Evaluation. The rst curriculum topic covers the design of the on-chip interconnection network and the network interface, the second covers integration of processor and memory cores with the NoC, and the last teaches a methodology for testing and evaluating the manycore architecture. The three topics are taught in this order during the lecture sessions before each lab. The block diagram in Fig. 2 shows the course sequence and the logical order of the labs. The dashed arrow shows the typical sequence of the curriculum topics, while the dark-shaded arrows show the relationship between the lab assignments. The diagram also shows the curriculum topic under which each lab assignment is conducted.
III. NOC-BASED MANYCORE DESIGN CURRICULUM TOPICS A. NoC Infrastructure (Curriculum Topic 1) The design of the on-chip interconnection network and its impact on the manycore system performance is among the primary learning outcomes of the course. As such, it is important to understand the benets of emerging network-on-chip interconnects over traditional bus-based interconnect architectures and to realize the differences in their design process. From this topic, students are expected to learn how to design the on-chip router, the links connecting the routers of the network, and the links connecting the network interfaces to the routers and the PEs. Emphasis is placed on understanding how each parameter associated with the design of the routers and links impacts the performance of the manycore architecture, which is typically measured based on a standard set of metrics, such as throughput, latency, power consumption, and area constraints imposed on the system [23]. Students are engaged in early design exploration of a range of network characteristics (topology, routing, ow control, quality of service) and learn how to associate these characteristics with the performance of each system. The material related to the network topic is classied in two subtopics: the interconnection network that covers the routers and links, and the network interface. 1) Interconnection Network: The primary objective of this subtopic is to teach the fundamental goals of the NoC design process and the importance of designing in a modular plug-and-play way that permits the development of a variety of NoC implementations, and subsequently of manycore architectures. This subtopic covers several design parameter tradeoffs that help students develop a exible NoC. These parameters include the following.
183
Topology exploration: Network topology is an important part of the network architecture as it determines the network distance of all communication nodes in the network [18] and thus impacts the performance, choice of routing algorithms, and the associated power/area overhead. Students are guided to understand this by being asked to explore networks of various dimensions and topologies. Recongurability: Recongurability is particularly essential in order to integrate the NoC with different types of PEs and adapt to various application constraints. Students learn how to make NoC designs congurable in terms of data width, FIFO depth, and packet length using HDL packages and generics. Router modularity: Students learn the benets of designing the architecture of the NoC router in a modular manner; they learn how to create well-dened interfaces and how to follow black-box design approaches for each router component for debugging and experimentation purposes. Virtual channels (VCs): One of the major learning objectives of this subtopic is the concept of VCs [7] and their associated area/energy overhead costs and throughput benets. VCs have been used extensively in traditional networks and have been adopted by on-chip networks for priority and quality of service (QoS), as well as for blocking and deadlock avoidance. Their importance in the course is emphasized both in theory and in practice. Students learn about the theoretical issues of VCs and apply them in practice by being asked to design the NoC router with and without VCs. They can subsequently experiment with different NoC implementations, featuring VCs and QoS algorithms, as well as simple implementations without VCs. VCs, while very useful, consume a large amount of hardware overhead and energy; it is thus important to determine whether to include VCs in an NoC design or not, depending on the targeted application. Switching mechanisms: Students learn different types of switching mechanisms such as circuit switching and packet switching in order to understand how packets are delivered in the network. Particular emphasis is given to the concept of wormhole switching [18], a concept that is well understood only through practical implementation; wormhole switching reduces the amount of storage buffers within the routers [20], and therefore has been adopted as the de facto switching mechanism by the industry [12]. This is a particularly useful concept to cover here since the course features educational FPGAs, which typically are also area-constrained. Packet format: Each packet traveling in the network is organized into various elds, much like traditional networks. Currently, there is no standard packet format adopted. Hence, the packet format used in the course was chosen to facilitate scalability and exibility and help students understand the associated overheads. Each packet is split into its (a it being the smallest bufferable packet chunk [7]). For simplicity, this has a xed length since a xed-length packet leads to a very simple packet parsing process, simplifying the hardware implementation. The
packet elds contain the data payload, destination node address, sequence number, error correction eld, and application-specic elds. The size of each eld is left for students to choose in order to help them understand scalability and how to add application-specic elds to meet the needs of each application running on the system. Routing algorithm: The impact of the routing algorithm on the throughput and area/power overhead is one of the most important learning outcomes. Students learn the characteristics of the different routing schemes (source versus distributed routing, deterministic versus adaptive routing). The concept of a deadlock-free routing algorithm, and how VCs are used in order to avoid channel dependency and thus make a routing algorithm deadlock-free, is also explained. Through the use of VCs, students explore the exibility of designing different deadlock-free routing algorithms at the cost of hardware complexity, more area, and thus higher power consumption. Router-PE communication: The last topic teaches how each router interfaces with the other routers and with the PEs. The functionality of the crossbar switch and the concept of link pipelining are also taught as well. 2) Network Interface: The NI translates the packet-based communication of the network into a form that is required by the PEs. A simple and generic NI is a key issue for a rapid and cost-effective realization of an NoC-based manycore system, whether homogeneous or heterogeneous. Furthermore, decoupling computation from communication is another key ingredient in managing the complexity of NoC-based manycore systems, as this allows the cores and the interconnection network to be designed independently [24]. Both these issues are major teaching objectives of this subtopic and are taught by asking students to divide the internal design of the NI into two parts: one dealing with the core (core-dependent) and one dealing with the network (protocol-dependent). The core-dependent part relies on the data format and I/O interface of each specic core, and its major functionality includes assembling and deassembling packets. The core-dependent part usually offers a generic, parametrizable interface with two kinds of signals: the handshake signals that synchronize the data transfer between the core and the NI, and the payload signals that carry the data. The protocol-dependent part complies with the NoC communication protocol and is responsible for timing, buffering, and synchronization aspects during data communication, as well as packetization and depacketization of the data to/from the NoC. It is important to emphasize how having a standardized NI similar to the Open Core Protocol [25] facilitates a wide range of PEs, whether they be processor cores, memory, or any other computational structure. B. System Integration (Curriculum Topic 2) Understanding the process of interfacing PEs to the on-chip network is another important course objective. Taking into consideration that students should be already familiar with fundamental microprocessor and memory architecture design, this topic focuses on integration issuesspecically, how to interface existing processor and memory cores with the NoC. As such, two cores were selected as examples of processor
184
cores: the MIPS32 5-stage RISC core [26], widely popular in education, and the Microblaze processor core [27], available with the Xilinx FPGA design tools. Xilinx additionally provides Core Generator [28], which can be used to generate memory cores and other custom processor cores. 1) MIPS32 Core: The MIPS32 core is a 32-bit RISC processor core, widely used in the teaching of computer architecture courses [26]. Due to its simple architecture and platform-independent nature, it is used in several applications in both industrial and academic environments, and students should be familiar with its architecture and organization. A VHDL implementation of the MIPS32 core is provided to the students and features a basic ve-stage pipeline that supports 32-bit load/ store instructions and that detects and corrects hazards and reduces stall cycles by data forwarding and cache prefetching. The processor copes with branches by always assuming that branches are not taken and uses an L1 direct-mapped instruction cache and an L1 direct-mapped data cache. The core is used to illustrate how the overall processor operation and design is affected when the processor is connected to the NoC and to describe the design process for the NI. The given core supports only a subset of the instructions used in the original MIPS ISA [26]. However, instructions can be easily added or removed, depending on whether the instructor wants to extend the course to cover specic applications. 2) Microblaze Core: The course also provides students with the opportunity to integrate the Xilinx Microblaze core into their designs. Microblaze is another 32-bit RISC processor used in FPGA designs targeting supported Xilinx Spartan or Virtex families of physical FPGA devices and is licensed as part of the Xilinx Embedded Development Kit (EDK) [29] that accompanies popular academic FPGA boards. The EDK tool provides a range of choices when conguring the Microblaze architecture, allowing the students to specify the memory depth (number of 32-bit words), the ALU functionality, the number and types of peripherals, and memory address space parameters at design time [27], [29]. Most importantly, however, Microblaze comes with a compiler, allowing high-level software to be used on the NoC system. This allows the students to run high-level software on their designs using a Microblaze-compatible NI, which can translate from dedicated Fast Simplex Link (FSL) bus [30] protocol used in the Microblaze core to the NoC communication protocol. The Microblaze core along with the Microblaze-compatible NI are both provided to students. C. Evaluation/Debugging System and Methodology (Curriculum Topic 3) Verication and debugging of NoC-based manycore systems is an important learning outcome of the course. Being able to trace and debug the design is a skill acquired through experience and, hence, intensive lab work. Taking into consideration that visual learning is an important method for exploiting students visual senses to enhance learning and engage their interest [31], the course uses a visual evaluation and debugging environment purpose-developed by the course instructors, called Video-Aided Debugging (ViAD). In addition to system-level functional verication done in Xilinx ISE Simulator (ISim) [32]
Fig. 3. ViAD exploration.
monitoring/debugging
environment
eases
design-space
using VHDL test benches, ViAD can be used as a real-time evaluation tool, allowing students to visualize selected high-level (system-level) signals of their NoC-based manycore system as they actually occur in real time. Using dedicated hardware counters within the system modules, students obtain high-level information about the operation of the system (e.g., delays, throughput, and so on) and can potentially identify performance bottlenecks through their system. ViAD can also be modied to act as a debugging tool, by displaying characters and signals on the screen during debugging of components, by mapping VHDL signals as inputs to the ViAD environment. The generic nature of ViAD allows it to be synthesized and implemented on most FPGA platforms that have a VGA-compatible port and user-dened switch inputs with very little hardware overhead. It is therefore appropriate for small-sized educational FPGAs. The basic characteristics of the ViAD environment are illustrated through a number of practical examples intended to help students understand how to integrate ViAD in their NoC-based manycore designs. Students learn to identify possible design errors by monitoring only those signals that are most critical for correct system operation, and therefore, help them to form a preliminary picture of what is happening in the manycore design. Such a high-level picture may include events such as the arrival and departure of packets from the input/output port of a router. This is easy to observe in ViAD by just monitoring the handshaking signals of the router (valid/ready bits). In this way, students can detect aws in the design of the router. Similarly, ViAD can be used to identify design errors in internal components of the routers (e.g., buffers, crossbar, and so on). Additionally, the result of each computation performed by the system is known a priori, allowing identication of design errors by using ViAD to monitor the architectural registers of the processors. ViAD can easily be congured to monitor the architectural registers of all processors running in parallel, as well as to monitor specic memory locations. Fig. 3 illustrates how the ViAD environment can ease design-space exploration, evaluation, and debugging of potential manycore architectures. Fig. 4 shows an experimental setup of the ViAD environment. IV. INDIVIDUAL LABORATORY ASSIGNMENTS Each of the curriculum topics covered in the course is implemented and explored practically in a sequence of FPGAbased lab assignments, tightly coupled with the lectures. The
185
TABLE I LAB ASSIGNMENTS INFORMATION AND ASSESSMENT CRITERIA
overhead versus the throughput. This report is used as a quantitative evaluation document. Furthermore, students are also required to take an individual oral quiz at the end of each lab, where the learning outcomes are evaluated, ensuring that even though the students operate in groups, each student does independently learn the expected course outcomes. The grades allocated to each individual lab assignment, as well as the suggested grading weights associated with each outcome of each lab, are also shown in Table I (the quiz grade is part of the total lab grade shown in the table). A description of each of the laboratory exercises follows. A. Lab 1 (3 Weeks)Design of the NoC Router The rst assignment focuses on the design of an NoC router using the design principles and parameters discussed in Section III-A-1. Students are required to design the architecture of an NoC router, which typically consists of ve bidirectional ports: one dedicated to the PE and four (North, South, West, and East) to communicate with the rest of the network [5]. Each input port should contain a routing decision unit (RDU) that implements the routing algorithm, a demultiplexer that forwards the packet (it) from the RDU to the appropriate FIFO queue (VC), circular FIFO queues (VCs) for storing the incoming its, a VC arbitrator, and the crossbar switch. A sample block diagram illustrating a single input port and a single output port of an NoC router is shown in Fig. 5. It should be noted that there are several variations in the structure of the router architecture shown in Fig. 5 (for example, the VCs could be placed at the output port of the router instead at the input port), and that students are not restricted to design that specic architecture. Instead, students are free to make their own choice of design, but are required to explain their choice in their lab
Fig. 4. Conguration of a system consisting of the FPGA board (XUP Virtex-2 Pro) and a VGA monitor.
labs are structured for two popular academic FPGA boards: the XUP Virtex-II Pro and the XUP Virtex-5 LX110T, both part of Xilinxs University Program [34], [35]. Each lab expects students to download and evaluate their implementations on the FPGA board. In this way, students can understand how their design selections are also impacted by the targeted FPGA constraints. Table I summarizes the relevant information for each lab assignment, including what students learn/explore in each lab, which parts of the system architecture are designed by students, and which parts are provided by the instructor. At the end of each lab, students are asked to do an individual cost-implementation analysis report that details the reasoning behind their selection of network parameters in terms of the area/power
186
of which are shown in Table I) and explore the impact of each parameter on system performance, thus gaining a deeper understanding of the NoC operation. C. Lab 3 (3 Weeks)Network Interface Having completed the interconnection network, the next lab concentrates on the design of the NIs. The students are rst asked to design the NI that interfaces the 32-bit MIPS core with the NoC. Next, they are asked to create an NI that allows a memory core to be attached to the NoC. In both cases, the signals and synchronization mechanisms for the communication protocols between the processor cores and the memory cores are implemented, so as to emphasize the importance of having a communication protocol that is standardized, yet exible enough to accommodate various PEs. The MIPS NoC NI is expected to be designed in such a way as to facilitate concurrent data and instruction requests from the simplied MIPS32 core. This is necessary since a processor core such as MIPS can generate data read and data write requests, and instruction requests concurrently, if the typical ve-stage pipeline operation is followed. Hence, the NI should be able to deal concurrently with both types of requests by creating packets containing addresses of data and addresses of instructions. Moreover, students have to design the memory (RAM) network interface (RNI) that interfaces the memory core with the NoC. The RNI has to be designed so as to target generic memory cores with standard memory signals. The RNI should be able to handle processor requests, acting as a packet assembler/disassembler. It should rst disassemble incoming packets received from the NoC by decoding them into read or write requests for the memory to serve, and then assemble packets containing the return data and/or instructions from the memory so that they can be returned back to the requesting PE through the network. D. Lab 4 (2 Weeks)Integrating Processor Cores to the NoC Once they create the NIs, students are given a basic MIPS32 core as described in Section III-B-1 and are expected to integrate it with the NoC designed in Lab 2, using the network interfaces developed in Lab 3. They are also given memory cores, which they also have to integrate in their system. The objective of the lab is to create a complete 16-core NoC-based manycore system. The students are asked to include at least eight processor cores. However, they can select a larger number of cores, provided they took into consideration the area constraints when they were designing their network parameters in Labs 13. Students need to select the number of cores, which t on the targeted FPGA. They are expected to supplement the remaining available slots with memory cores (as L2 caches). This allows students to see the tradeoffs involved in the NoC versus the core area overhead and how the NoC design decisions impact the overall system. Additionally, students are given the Microblaze core and a Microblaze-compatible NI and are expected to follow the same methodology to build a 3 3 Microblaze-based multicore architecture (due to space limitations on the FPGAs, as the Microblaze core is larger than the MIPS core).
Fig. 5. Example block diagram of a typical NoC router.
report. Students are also free to choose the packet format. This format, however, must comply with the requirements discussed in Section III-A-1. The implementation of the routing algorithm, and the way in which the routing logic (RDU) is modied so that other routing algorithms and topologies can be adopted, is the primary learning objective of this lab. As such, students are asked to design a generic RDU that can easily be adapted to implement different routing protocols. Among the algorithms given to the students for exploration are dimension-ordered, deterministic algorithms, such as XY and YX [21], since these algorithms lead to a simple, fast, and compact routing logic. In addition to the routing algorithm, the complexity of the VCs and the allocation-arbitration hardware, as well as the power inefciency of the crossbar switch, are also learning objectives targeted in this lab. Students are required to implement a round-robin algorithm to be integrated in the arbitration logic of the VCs and the crossbar switch. The arbitration for both the VCs and the crossbar is expected to be implemented as independent VHDL modules so that alternate arbitration schemes, which can feature QoS criteria, can be added if the instructor wishes. Moreover, students are required to experiment with both wormhole and virtual-cut-through switching in order to understand how the buffer size depends on the switching mechanism. B. Lab 2 (2 Weeks)Design and Verication of a 4 4 NoC
Once the design of the router is completed, students are asked to create a 16-tile NoC, where they are expected to address communication and synchronization signals between the routers and experiment with a chosen topology. The router created in the previous lab can be used to rapidly build mesh and torus networks. Students are asked to design at least these two topologies so that they can understand their impact on performance and associated hardware/energy overhead. Students are also asked to evaluate and debug the constructed NoC using the methodology described in Section III-C. To do this, they are asked to generate 16 dummy PEs, each with its own memory created with Xilinxs Core Generator. These memories store randomly generated packets created with C++ code (which is provided). The students are required to experiment with 4 4 mesh or torus networks, by downloading the NoC along with the dummy memories onto the FPGA board, and use the ViAD environment to monitor the network trafc during execution. ViAD allows students to experiment practically with several NoC parameters (all
187
E. Lab 5 (4 Weeks)Debugging and Evaluation The nal lab targets system integration, debugging, evaluation, and analysis of the NoC-based manycore systems developed in Lab 4 (MIPS-based and Microblaze-based systems). Students are given a set of benchmark applications in MIPS assembly and C, suitable for testing and evaluating their implementations and capable of providing them with useful observations on design parameter exploration. The applications consist of four popular algorithms: a matrix multiplication algorithm, a producerconsumer problem, addition of operands fetched in sequential and random memory accesses, and summation of 100 000 numbers using parallel processing. Students are asked to load the programs onto the equivalent system (assembly for the MIPS32-based system and C for the Microblaze-based system) in order to evaluate their platforms. They are also expected to integrate the ViAD environment in their system in order to debug and monitor the systems operation through the procedure outlined in Section III-C. In addition to the lab report and the oral quiz upon completion of this lab, students are also expected to complete a comprehensive lab report that covers all lab assignments completed so far. V. LAB EVALUATION A. Grading Policy While Table I can be used as an indicative lab-by-lab evaluation scheme based on individual lab objectives, the framework presented can follow a exible grading policy, where students can be graded based on individual lab-by-lab deliverables or comprehensively at the end of the ve-lab sequence. The course instructors, however, believe that students should be given the chance to complete their lab assignments, obviously within some reasonable time deadline. However, if they have not completed an individual lab, they must be supplied with the operating modules that they should have completed in each lab so that they can move on to the next lab. Given the labs intensive coding and complexity, simple HDL errors can spoil an otherwise extremely good effort by a student. As such, measures need to be taken to ensure that students are not discouraged by such scenarios and that they understand that they can still follow the lab assignments. A small penalty can be exacted upon students receiving help in the form of lab deadline extensions or ready-made modules so that students give their full effort in each lab. The proposed lab sequence has been offered three times so far (end 2010) as part of the senior undergraduate FPGA-based course within the Department of Electrical and Computer Engineering at the University of Cyprus, Nicosia, Cyprus. The host course features principles of FPGA design and used the lab assignments as a practical approach to FPGA design through the use of NoC-based manycore systems. The course followed a lab-by-lab grading policy, with the suggested weights shown in Table I for each lab. One of the policies followed was to penalize students with a 20% late penalty if they did not nish the lab on time, but they were allowed to continue working on each lab until they were able to complete it. However, they were also given the option of receiving a 20% penalty for each lab they did not complete and move to the next lab with the instructors
providing working modules to the students. Through feedback, students indicated that this was a fair grading scheme, as the grading policy rewarded the most successful students during each lab while remaining encouraging for students who had some errors appearing in their designs. Students who were consistently late in nishing the lab assignments, or who needed completed modules for more than two or three labs, obviously received a lower grade. Of the 46 students who took the course over the three semesters, only ve failed the course at their rst attempt, and four of those ve students passed the course at their second attempt. The average mark for all three semesters was 71.6% (68.5% for 13 students the rst semester, 69.8% for 19 students the second, and 76.4% for 14 students the third). It would be interesting to evaluate how well students who have taken the course perform in a course that features multicore systems, such as a graduate computer architecture course or another senior undergraduate course, but at this point, this data is not available. B. Evaluation The course was a controlled elective, available only to fourth-year senior undergraduate/graduate-level Computer Engineering students at the University of Cyprus. These students had already taken introductory computer architecture and organization courses and had some introductory knowledge of VHDL design. The course featured two weekly lectures of 60 min each as well as a 4-h lab session; at least one of the two weekly lectures was given in the lab and included a questions-and-answer section, giving hands-on answers to student problems. The course was evaluated at the end of each semester as part of the general course evaluation policy followed at the University of Cyprus, which is based on Florida State Universitys Student Perception of Teaching (SPOT) system [33]. The SPOT evaluation form integrates general course questions related to the course, such as whether it is a mandatory or elective course, general questions related to the quality of the instructors, the department and the university, expected grades, and other standard SPOT questions as in [33]. The form provided the instructors with the opportunity to include their own questions, related to the course content, which are shown in Table II. The form is completed online through the universitys evaluation system offered by the independent Center for Teaching and Learning, and for which anonymity is secured. All students enrolled in the course completed their evaluation forms, and all students completed the course-related questions. The average score for each question (accumulated from answers from the three course offerings) is also shown in Table II. Fig. 6 shows a semester-by-semester analysis of the questions Q1Q5. Q6 was omitted for readability purposes (its average results are shown in Table II), and Q7 essentially cannot be compared on a semester-by-semester basis as fourth-year students typically take different courses every semester. Furthermore, the students were also asked to provide typed (for anonymity purposes) remarks about the course. The general conclusion that was drawn is that students did appreciate the practical nature of the framework, as well as the teach-then-implement policy applied and the interactive teaching during each lab. A large
188
TABLE II AVERAGE SCORE (15)

OF THE ADDITIONAL QUESTIONS OF THE EVALUATION FORM. FAIR, AS EXPECTED, GOOD, MORE THAN I EXPECTED,
POOR, NEEDS IMPROVEMENT, EXCELLENT
Fig. 6. Analytical evaluation results for Q1Q5 (average scores and scores per semester).
were asked instead to design the VC arbitration module. Lab 1 evaluations from the second time indicated that students understood how demanding the VC subsystem of an NoC router is. At the end of the second offering, the student feedback revealed that several students had trouble understanding the assembly programs given as part of Lab 5. Consequently, the third time the course was given, the instructors introduced the Microblaze core, which adds high-level software functionality to the system using the associated supplied compiler. VI. CONCLUSION While manycore systems are emerging as the dominant trend in future microprocessor design, teaching these large-scale but very complex systems requires curriculum modications. Such modications include the practical teaching of NoC-based manycores. This paper presented a framework consisting of a sequence of lab assignments, with the objective of giving the students a practical, hands-on experience in NoC-based manycore design. The labs were structured to facilitate learning the fundamental concepts of the interconnection network and the way in which such networks can support and impact the performance of manycore architectures. The assignments included the evaluation and visual monitoring and debugging of system operation for better understanding of NoC-based manycore architectures. REFERENCES
[1] S. Borkar, Thousand core chipsA technology perspective, in Proc. 44th DAC, 2007, pp. 746749. [2] K. AsanovicR. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, The landscape of parallel computing research: A view from Berkeley, University of California, Berkeley, CA, Tech. Rep. No. UCB/EECS-2006-183, Dec. 18, 2006. [3] L. Benini and G. D. Micheli, Networks on chips: A new SoC paradigm, IEEE Computer, vol. 35, no. 1, pp. 7078, Jan. 2002. [4] W. J. Dally and B. Towles, Route packets, not wires: On-chip interconnection networks, in Proc. 38th DAC, Las Vegas, NV, Jun. 2001, pp. 648689. [5] T. Theocharides, G. M. Link, N. Vijaykrishnan, and M. J. Irwin, Networks on Chip (NoC): Interconnects of next generation systems on chip, Adv. Comput., vol. 63, pp. 3692, 2005.
number (78%) indicated that the ViAD environment helped them debug and understand system operation better. A small number of students (18%) thought that the material was a bit excessive for a senior-level elective course; approximately a quarter of the students also thought that the lab framework should be taught as an independent course. The grading policy of the course provoked a minor disagreement in student evaluations. A large number of students agreed with the grading policy (more than 70%). However, a percentage of around 22% thought that the course did not reward students who completed everything on time and by themselves (i.e., not needing completed modules from the instructor). Those students argued that a student failing to complete one or two labs, or a student given a ready-made component, could still receive a grade above 75%, which was not signicantly less than that for a student receiving 90% overall. Each time the course was offered, there were some revisions to the course content based on student feedback and course evaluations. In particular, the rst time the course was offered, students were given the VC arbitration module already made and were asked to create all the memory elements in the course. It was then observed in the evaluation of Lab 1 that students did not appreciate the hardware overhead associated with the concept of VCs. Moreover, through the course evaluation at the end of the course, students suggested that they spent a lot of time implementing memory modules, something that was not an associated learning objective. Therefore, the second time the course was offered, students were provided with directions on how to easily generate memories using Xilinx Core Generator and
189
[6] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, and A. Hermani, A network on chip architecture and design methodology, in Proc. IEEE ISVLSI, Apr. 2002, pp. 105112. [7] C. Nicopoulos, V. Narayanan, and C. R. Das, Network-on-Chip Architectures: A Holistic Design Exploration, ser. Lecture Notes in Electrical Engineering, 1st ed. New York: Springer, 2010, vol. 45, XXII. [8] A. Jantsch and H. Tenhunen, Networks on Chip. Norwell, MA: Kluwer, 2003. [9] J. L. Hennessy and D. A. Patterson, Computer architecture: A quantitative approach, 4th ed. San Mateo, CA: Morgan Kaufman. [10] G. Martin, Overview of the MPSoC design challenge, in Proc. 43rd ACM/IEEE DAC, New York, NY, 2006, pp. 274279. [11] Bradley Department of ECE, ECE 5514Design of systems on a chip, Virginia Tech, Blacksburg, VA, Accessed Feb. 2011 [Online]. Available: http://www.ece.vt. edu/graduate/courses/viewcourse.php?number=5514-93 [12] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, An 80-tile sub-100-W TeraFLOPS processor in 65-nm CMOS, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 2941, Jan. 2008. [13] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook, TILE64Processor: A 64-core SoC with mesh interconnect, in Proc. IEEE ISSCC, Feb. 37, 2008, pp. 88598. [14] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, S. W. Keckler, and D. Burger, On-chip interconnection networks of the TRIPS chip, IEEE Micro, vol. 27, no. 5, pp. 4150, Sep.Oct. 2007. [15] L. D. Feisel and A. J. Rosa, The role of the laborator in undergraduate engineering education, J. Eng. Educ., vol. 94, no. 1, pp. 121130, 2005. [16] C. Ttos, C. Kyrkou, T. Theocharides, and M. K. Michael, FPGAbased NoC-driven sequence of lab assignments for manycore systems, in Proc. IEEE MSE, San Francisco, CA, Jul. 2527, 2009, pp. 58. [17] B. A. Oakley, D. M. Hanna, Z. Kuzmyn, and R. M. Felder, Best practices involving teamwork in the classroom: Results from a survey of 6435 engineering student respondents, IEEE Trans. Educ., vol. 50, no. 3, pp. 266272, Aug. 2007. [18] G. D. Micheli and L. Benini, Networks on Chips. San Mateo, CA: Morgan Kaufmann, 2006. [19] F. Gebali, H. Elmiligi, and M. W. El-Kharashi, Networks-on-Chips: Theory and Practice. Boca Raton, FL: CRC Press, 2009. [20] P. Mohapatra, Wormhole routing techniques for directly connected multicomputer systems, Comput. Surv., vol. 30, no. 3, pp. 374410, Sep. 1998. [21] R. Gindin, I. Cidon, and I. Keidar, NoC-based FPGA: Architecture and routing, in Proc. 1st NOCS, May 79, 2007, pp. 253264. [22] R. Marculescu, U. Y. Ogras, P. Li-Shiuan, N. E. Jerger, and Y. Hoskote, Outstanding research problems in NoC design: System, microarchitecture, and circuit perspectives, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 28, no. 1, pp. 321, Jan. 2009. [23] P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, Performance evaluation and design trade-offs for network-on-chip interconnect architectures, IEEE Trans. Comput., vol. 54, no. 8, pp. 10251040, Aug. 2005. [24] S. P. Singh, S. Bhoj, D. Balasubramanian, T. Nagda, D. Bhatia, and P. Balsara, Generic network interfaces for plug and play NoC based architecture, Lecture Notes Comput. Sci., vol. 3985, pp. 287298, 2006. [25] Open Core Protocol specication 2.1, OCP-IP Association, Beaverton, OR, Doc. rev. 1.0, 2005. [26] D. A. Patterson and J. L. Hennessy, Computer Organization and DesignThe Hardware/Software Interface, 3rd ed. San Mateo, CA: Morgan Kaufman.
[27] MicroBlaze, Xilinx, San Jose, CA, Accessed Jun. 2010 [Online]. Available: http://www.xilinx.com [28] Xilinx core generator system, Xilinx, San Jose, CA, Accessed Jun. 2010 [Online]. Available: http://www.xilinx.com/tools/coregen.htm [29] Xilinx embedded development tool, Xilinx, San Jose, CA, Accessed Aug. 2010 [Online]. Available: http://www.xilinx.com/ tools/embedded.htm [30] Connecting customized IP to the MicroBlaze soft processor using the Fast Simplex Link (FSL) channel, Xilinx, San Jose, CA, XAPP529 (v1.3), May 12, 2004. [31] M. B. McGrath and J. R. Brown, Visual learning for science and engineering, IEEE Comput. Graph. Appl., vol. 25, no. 5, pp. 5663, Sep.Oct. 2005. [32] Xilinx ISE simulator (ISim), Xilinx, San Jose, CA, Accessed Aug. 2010 [Online]. Available: http://www.xilinx.com/tools/isim.htm [33] Center for Teaching & Learning, Instruction at FSU: A Guide to Teaching & Learning Practices, The Florida State University, Tallahassee, FL, Accessed Feb. 2011 [Online]. Available: http:// www.learningforlife.fsu.edu/ctl/explore/onlineresources/[email protected] [34] Xilinx university program Virtex-II pro development system, Xilinx, San Jose, CA, Accessed Aug. 2009 [Online]. Available: http://www. xilinx.com/products/devkits/XUPV2P.htm [35] Virtex-5 LXT FPGA ML505 evaluation platform, Xilinx, San Jose, CA, Accessed Aug. 2009 [Online]. Available: http://www.xilinx.com/ products/devkits/HW-V5-ML505-UNI-G.htm
Christos Ttos (S10) received the B.S. and the M.S. degrees in computer engineering from the University of Cyprus, Nicosia, Cyprus, in 2009 and 2011, respectively, and is currently pursuing the Ph.D. degree in electrical and computer engineering at the University of Cyprus. His research interests mainly focus on the area of embedded systems design, with emphasis on the design of digital hardware architectures targeting real-time and low-power computer vision and image processing applications. Mr. Ttos is a recipient of a Best Paper Award of the MSE 2009.
Theocharis Theocharides (M09) received the Ph.D. degree in computer science and engineering from Pennsylvania State University, University Park, in 2005. He is currently an Assistant Professor with the Department of Electrical and Computer Engineering, University of Cyprus, Nicosia, Cyprus. His research focuses on the broad area of intelligent embedded systems design, with emphasis on the design of reliable and low-power embedded and application-specic processors, media processors, and real-time digital articial intelligence applications. Dr. Theocharides is a co-recipient of a Best Paper Award of the MSE 2009.
Maria K. Michael (S01M03) received the B.S. and M.S. degrees in computer science and Ph.D. degree in electrical and computer engineering (ECE) from Southern Illinois University, Carbondale, in 1996, 1998, and 2002, respectively. She taught as a Lecturer with the ECE Department, Southern Illinois University, from 2001 to 2002, and as an Assistant Professor of computer science and engineering with the University of Notre Dame, Notre Dame, IN, from 2002 to 2003. She is currently an Assistant Professor with the Electrical and Computer Engineering Department, University of Cyprus, Nicosia, Cyprus. Her research interests are in the area of computer-aided design for the design, test, and reliability of modern digital VLSI circuits and embedded systems, including SoCs, NoCs, and chip multiprocessors. Dr. Michael is a co-recipient of a Best Paper Award of the MSE 2009.

Fpga-Based Laboratory Assignments For Noc-Based Manycore Systems

Uploaded by

Fpga-Based Laboratory Assignments For Noc-Based Manycore Systems

Uploaded by

180

IEEE TRANSACTIONS ON EDUCATION, VOL. 55, NO. 2, MAY 2012