# Programming of the T-CREST real-time multi-processor platform

Rasmus Bo Sørensen



Kongens Lyngby 2013 IMM-MSc-2013-5

Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673 reception@imm.dtu.dk www.imm.dtu.dk IMM-MSc-2013-5

# Abstract (English)

The goal of this thesis is to integrate the T-CREST components into a tool chain for programming a multi-processor real-time platform. We present our view of the T-CREST tool chain, and we implement an initial tool chain, restricted by the current state of the T-CREST project. The T-CREST project is an ongoing research project supported by the European Union's 7th Framework Programme, aiming to develop a homogeneous time-predictable multi-processor platform. To integrate the components into a common tool chain, we define the interfaces between the components of the tool chain, and extend the components to implement the specified interfaces. With the intuition gain from integrating the components we propose extensions to improve performance or reduce cost. ii

# Resumé (Danish)

Målet for denne afhandling er at integrere T-CREST komponenterne ind i en værktøjskæde til programmering af en multi-processor sandtids platform. Vi præsenterer vores blik på T-CREST værktøjskæden, og vi implementerer en foreløbig værktøjskæde, begrænset af den nuværende tilstand af T-CREST projektet. T-CREST projektet er et igangværende forsknings projekt støttet af den Europæiske Unions 7. rammeprogram, hvis mål er at udvikle en homogen tidsforudsigelig multi-processor platform. For at integrere komponenterne ind i en fælles værktøjskæde, definerer vi grænsefladerne mellem komponenterne i værktøjskæden. Vi udvider komponenterne til at implementere de specificerede grænseflader. Med udgangspunkt i den intuition vi har vundet ved at integrere komponenterne, foreslår vi udvidelser for at forbedre ydeevne eller reducere omkostningerne.

iv

# Preface

This thesis was carried out at the Department of Informatics and Mathematical Modelling, at the Technical University of Denmark, in fulfillment of the requirements for acquiring an M.Sc. (Hons.) in Informatics.

During my master studies I was enrolled in the Honors program, at the Technical University of Denmark. The objective of the Honors program is to familiarize the student with research projects on an international level. During the main part of my studies, I have been participating in the early stages of the time-predictable multi-core architecture for embedded systems project (T-CREST), supported by the European Union's 7th Framework Programme. I have been taking part in the weekly project meetings, and most of the projects I have worked on during my master studies have been related to the T-CREST project. This thesis deals with the programming of the T-CREST platform. As a whole the thesis presents a first attempt at developing a coherent tool chain for software programming and hardware configuration of the T-CREST platform. The tool chain draws upon, and extends, several of the projects I have worked on during my master studies.

This thesis should be seen as a stand alone report on the early and first iteration of the tool chain for the T-CREST platform. The T-CREST processor and the T-CREST worst-case execution time (WCET) compiler are under development, and are not yet stable for integration. These components in the tool chain are therefore replaced with a stable processor and compiler. This compiler is not optimizing for WCET, and the thesis will therefore not go into details with WCET-aware compilation. I have concentrated on integrating the tools into the tool chain, to enable the developers of the T-CREST platform to test various ideas. vi

My work for this thesis has been carried out simultaneously with the first iteration of the T-CREST project. The uncertainty of when and which hardware components will become available, has proved an additional unforeseen challenge. I spent the first month of my thesis working on the S4NoC platform, before the T-CREST NoC platform was available. The work I did on the S4NoC platform is published in [1]. The S4NoC is only mentioned briefly in the thesis.

Lyngby, 07-December-2012

Rasmus Bo Sørensen

# Acknowledgements

I would like to thank my supervisor Jens Sparsø for all his great input to my work and my report, and my co-supervisor Martin Schoeberl for his comments and his advice on the JOP processor. Also I would like to thank the rest of the T-CREST members at the Technical University of Denmark for the good discussions we have had about the T-CREST project. Finally I would like to thank Mark Ruvald Pedersen, Lars Bo Sørensen, Tabita Niemann Kristensen and Madava Dilshan Vithanage for their sparring. viii

# Contents

| Al       | Abstract (English) |                                                   |          |  |  |
|----------|--------------------|---------------------------------------------------|----------|--|--|
| Re       | esum               | é (Danish)                                        | iii      |  |  |
| Pr       | Preface v          |                                                   |          |  |  |
| Ac       | cknov              | wledgements                                       | vii      |  |  |
| 1        | Intr               | oduction                                          | 1        |  |  |
| <b>2</b> | Bac                | kground                                           | <b>5</b> |  |  |
|          | 2.1                | Real-time systems                                 | 5        |  |  |
|          | 2.2                | Network-on-Chip                                   | 7        |  |  |
|          |                    | 2.2.1 Static routing in real-time Network-on-Chip | 9        |  |  |
|          |                    | 2.2.2 Source routing $\ldots$                     | 9        |  |  |
|          |                    | 2.2.3 Distributed routing                         | 9        |  |  |
|          | 2.3                | Message passing                                   | 10       |  |  |
| 3        | Тоо                | l chain                                           | 13       |  |  |
|          | 3.1                | Our tool chain                                    | 14       |  |  |
|          | 3.2                | The T-CREST tool chain                            | 16       |  |  |
| 4        | Har                | dware platforms                                   | 19       |  |  |
|          | 4.1                | Related work                                      | 19       |  |  |
|          | 4.2                | The S4NoC platform                                | 20       |  |  |
|          |                    | 4.2.1 Programming the platform                    | 21       |  |  |
|          | 4.3                | The T-CREST NoC platform                          | 22       |  |  |
|          |                    | 4.3.1 Configuration interface                     | 23       |  |  |
|          |                    | 4.3.2 Integration of the hardware platform        | 25       |  |  |

|              | $4.4 \\ 4.5$ | Storage of static routing information26Discussion30        |
|--------------|--------------|------------------------------------------------------------|
|              | 1.0          | 4.5.1Scheduling limitations304.5.2Backwards flow control31 |
| <b>5</b>     | TD           | M scheduler 33                                             |
|              | 5.1          | Related work                                               |
|              | 5.2          | Static routing                                             |
|              | 5.3          | All-to-all scheduling 35                                   |
|              | 5.4          | Application specific scheduling                            |
|              | 5.5          | Schedule converter                                         |
|              | 5.6          | WCET-aware compiler                                        |
|              | 5.7          | Discussion                                                 |
| 6            | Mes          | ssage passing interface 41                                 |
|              | 6.1          | Related work 41                                            |
|              | 6.2          | Communication primitives                                   |
|              | 6.3          | MPI in the T-CREST platform                                |
|              |              | 6.3.1 Address space                                        |
|              |              | 6.3.2 Communication primitives                             |
|              | 6.4          | Discussion                                                 |
|              |              | 6.4.1 Dynamic allocation of buffering space                |
|              |              | 6.4.2 Compiler optimizations                               |
| <b>7</b>     | Test         | t 49                                                       |
|              | 7.1          | Hello World! 49                                            |
|              | 7.2          | Microbenchmarks                                            |
| 8            | Cor          | nclusion 57                                                |
|              | 8.1          | Summary of findings                                        |
|              | 8.2          | Project contribution                                       |
|              | 8.3          | Future work                                                |
| $\mathbf{A}$ | S4N          | NoC paper 61                                               |
| в            | T-C          | CREST NoC source code 69                                   |
| $\mathbf{C}$ | JOI          | P infrastructure 93                                        |
| D            | тD           | M scheduler source code 111                                |
|              |              | I source code 133                                          |
|              |              |                                                            |
| H.           | Les          | t and benchmark source code 141                            |

### Bibliography

xi

### Chapter 1

# Introduction

This thesis is concerned with programming the T-CREST[2] real-time multi-processor platform.

The T-CREST research project is supported by the European Union and the goal of the T-CREST project is to develop a real-time multi-processor hardware platform in which all components (processors, interconnection network, and compiler) are designed to facilitate a predictable worst case execution time of the application executing on the platform. The T-CREST project creates a novel hard real-time multi-processor platform.

To program the T-CREST platform four tasks need to be performed in order:

- 1. Creation of a static schedule for the time-predictable interconnect.
- 2. Compilation of the source code.
- 3. Calculation of the worst-case execution time (WCET) of the application,
- 4. Configuration of the hardware platform.

The work of this thesis is carried out in close interaction with the T-CREST project. As we often need to describe details of the T-CREST work, it can be



Figure 1.1: Conceptual view of the homogeneous multi-processor T-CREST platform with the three components, a processor (P), a network interface (NI) and a router (R).

confusing to the reader to distinguish our work from the work of T-CREST. We will refer to the work done in this thesis as "our *something*" or "we have *done*". The work of the T-CREST project will be referred to as "the T-CREST *component name*". When referring to this thesis we mean both the report and the work behind it.

In this thesis we integrate the T-CREST tools into a coherent tool chain. Our goal is to provide a tool chain that allows developers to investigate architectural features and flaws in the system. Our tool chain can help the developer to gain insight into the challenges of developing the T-CREST platform, and to improve the platform. Our tool chain will be a modular design, enabling individual testing of the modules and testing of the evolving T-CREST platform. Even though our tool chain is not the final T-CREST tool chain it can help avoid flaws in the final T-CREST platform. To support portability of applications between hardware revisions, a message passing interface (MPI) is needed.

A conceptual view of the homogeneous T-CREST hardware platform is shown in Figure 1.1; A regular grid of identical processors connected by a networkon-chip. The interconnect in the T-CREST platform is statically scheduled to achieve time-predictability. The static schedule is created by a scheduler at compile time. Each processor has a local memory to support message passing between processors. An application calls the MPI when it needs to communicate to another processor. The T-CREST platform was under development during the work of this thesis, so the requirement to align our work with the availability of the T-CREST components, had an impact on the topics to which we have contributed to. The following points describe the decisions we took to limit the scope of our work in this thesis.

- Allocation and mapping The bandwidth allocation and hardware mapping of an application could be found doing static analysis on the source code. In this thesis we assume the bandwidth allocation and hardware mapping for an application is supplied to the tool chain along with the application source code. The bandwidth allocation depends on the hardware mapping.
- **Code generation** The T-CREST processor Patmos[3] and its compiler is at the time of writing still unstable. In this project we use the stable JOP[4] processor with a compiler. As the processor we use is not the final T-CREST processor, we will in this thesis not be concerned with code generation for this processor. The final compiler should optimize worst-case paths and not average case paths as regular compilers do.
- Worst-case execution time analysis Analyzing the worst-case execution time (WCET) of an application is very dependent of the processor architecture. In this thesis we do not integrate WCET analysis in the tool chain, but we will make the tool chain ready for it.
- **Evaluation** The tool chain should enable the designers of the T-CREST tools and the hardware platform to evaluate them. It is difficult to evaluate each component in the tool chain without the whole tool chain. We create a modular tool chain where each module can be evaluated and optimized while decoupled from the other components. In this thesis we concentrate on functional evaluation.

**Contributions** In this thesis we have worked in three main areas of research; A time-predictable multi-processor platform, a time division multiplexing scheduler for routing traffic statically and a message passing interface. The contributions of this thesis are described in the following bullet points, along with an indication of in which chapter the given contribution is described.

- Proposal of interfaces between the T-CREST tools. [Chapter 3]
- An implementation and publication of the minimalistic time-predictable S4NoC[1] platform. [Chapter 4, Appendix A]
- Integration of the T-CREST NoC platform and the JOP processor. [Chapter 4]

- A theoretical comparison of the demand for storage bits in source routing and in distributed routing. [Chapter 4]
- Proposals of extensions to the T-CREST NoC platform. [Chapter 4]
- Extension of the TDM scheduler to integrate it into our tool chain. [Chapter 5]
- Proposal of an extension to the TDM scheduler reducing the worst-case latency in the static schedule. [Chapter 5]
- Implementation of an MPI for the T-CREST platform in Java. [Chapter 6]
- Proposal of improvements to the MPI to reduce the buffering space. [Chapter 6]
- The first working tool chain for programming our homogeneous multiprocessor platform. [Chapter 7]

This thesis consists of: An explanation of the main terms and concepts related to the T-CREST platform in Chapter 2. The tool chain and its components are outlined in Chapter 3. In Chapter 4 we present two hardware platforms, the S4NoC platform and the T-CREST Network-on-Chip platform and discuss improvements to the latter. In Chapter 5, a time division multiplexing (TDM) scheduler is presented and the implementations of its interaction with the WCET-aware compiler is described. Chapter 6 contains a discussion of which communication primitives to implement in the MPI library and a description of the implementation. A test of our tool chain and hardware platform is presented in Chapter 7. A conclusion of the project is given in Chapter 8. The related work is presented in the beginning of the chapters it relates to. The source code we have written or changed is shown in the appendices, and in the beginning of each appendix we have written a short explanation of what we have done. The source files are also available online at http://rbscloud.dk/sourcecode.zip. We reference the appendices from where it is relevant.

## Chapter 2

# Background

In this chapter we introduce the main topics of our project: real-time systems, network-on-chip and message passing.

### 2.1 Real-time systems

There are basically two different kinds of real-time systems; soft real-time and hard real-time. Soft real-time systems may miss a deadline once in a while and is typically applied in non-safety critical systems, such as TV set-top boxes or other streaming applications. Hard real-time requires all timing requirements to be met at all times, and is typically applied in safety-critical systems, such as aviation and train-control systems. The systems addressed in this thesis belong to the hard real-time category. Implementing hardware for real-time systems requires that the hardware is time-predictable and analyzable.

In Figure 2.1 we show how the different run times of a program relate to each other on the time axis. Due to different inputs to the program, the execution time can vary. Also a varying system state before and during the execution can vary the execution time of the program. A varying system state can be the state of the caches and the state of other programs running in parallel. The



Figure 2.1: Relating the best-case execution time (BCET), the avaragecase execution time (ACET), the worst-case execution time (WCET)and the calculated WCET.

shortest possible execution time is called the best-case execution time (BCET), which is generally not interesting in any kind of computer systems. The average execution time when the program is executed multiple times is called the average-case execution time (ACET), which in general purpose systems is regarded as the performance of a program. The longest possible execution time of the program is called the worst-case execution time (WCET). The WCET is reached when the system receives the worst-case input in the worst-case system state. The performance of a real-time system is equal to the WCET of the given application.

The WCET is found by analyzing the application together with the hardware platform. To find the actual WCET all possible inputs and systems states must be analyzed. An exhaustive analysis is usually not feasible, and in that case a pessimistic estimate of the WCET can be calculated. Depending on the complexity of the systems, the calculated WCET might be far from the actual WCET. The gap between the calculated WCET and the actual WCET can be minimized by using a more accurate model or by making the hardware easier to predict. A more accurate model results in a more complicated WCET analysis. The calculated WCET is regarded as the system performance in realtime systems. More accurate WCET models may result in tighter and lower WCET bounds, but will increase the complexity of the calculation. In parallel real-time systems the WCET analysis is known to be difficult, maybe even impossible. The T-CREST project addresses analysability in parallel real-time systems. The hardware for real-time systems must be deterministic. If the hardware is non-deterministic the analysis must always rely on worst-case latencies. The worst-case in non-deterministic systems might not even be bounded making the analysis impossible.



Figure 2.2: The basic Network-on-Chip component a tile (T) contains: A router, a network interface and a processor. The interface between the processor and the network interface is a transaction based master slave interface, and the interface between the network interface and the router is a streaming interface.

### 2.2 Network-on-Chip

A Network-on-Chip (NoC) is a type of interconnect supporting many communicating nodes. The basic component of a multi-processor platform with a NoC interconnect is illustrated in Figure 2.2. There are two main types of components in a NoC: routers and network interfaces. The processing cores connected to a NoC are each connected through a network interface to the network of routers. In this structured design we group a processor, a network interface and a router into a Tile. The connections between routers are called links. To ease the understanding when describing the ports of a router the ports are usually named after the corners of the world. So the north port of one router is connected to the south port of the router "above".

These routers can be connected in many different topologies. In this thesis we will only concentrate on grid topologies, such as torus, bidirectional-torus (bitorus) and mesh shown in Figure 2.3. In network-on-chip there are no general restrictions on the topology of the network of routers. The network of routers consists of the routers and the links between them. The links are just wires, but as the wires can be very long they can infer a considerable latency in the path.



Figure 2.3: The grid topologies relevant to this thesis: (a) a torus network, (b) a bidirectional-torus network and (c) a mesh network.

To increase the clock frequency the links can be pipelined, which increases the amount of buffering in the network. When packets are sent through the network they are chopped into smaller pieces. The smallest amount of data that the network enforces flow control on is called a flit and each flit can be divided into smaller chunks called phits. The phits are the smallest physical data units transmitted over a link, usually equal to the link width.

The basic functionality of the router is very simple. When the router receives a flit on an input port, the router decides to which output port it is sent. Implementing the logic to make routing decisions dynamically can be a complex problem, because the hardware has to ensure that no deadlocks can happen, and that all packets are routed to their destination. Dynamic routing can be implemented in many different ways. In the T-CREST project all routing decisions are made statically. This simplifies the hardware as well as the WCET analysis, because the latency is known in advance. The router implementation scales quite well, as the number of ports stay constant when the size of the whole systems grows.

The basic functionality of the NI, is to convert the transactional requests from a processing core, to the streaming interface of the network. The detailed functionality of the NI depends very much on the routing scheme in the network. The scalability issues of Network-on-Chip are most dominant in the NI. This is where the hardware needs to consider all cores that it is communicating with.

#### 2.2.1 Static routing in real-time Network-on-Chip

Static routing is applied and enforced in such a way that the communication behavior of one processor cannot affect the communication of other processors. Making the scheduler responsible of avoiding colliding flits, allows the hardware to be very simple and efficient. Deciding the routes statically is done by dividing access to the transmission medium in time. This approach is called time division multiplexing (TDM). Only one communication channel must be scheduled on a link in a given time slot. The scheduling of communication channels in the network is done at compile time, before the WCET analysis.

If the communication behavior of one processor can affect the communication behavior of another processor, the WCET will increase dramatically. If for example a real-time system runs on a general purpose platform, the communication between one pair of communicating processors can interfere with the communication of another pair of communication processors. In this case the WCET analysis will have to assume the worst-case of interfering communication, this will increase the WCET estimate by orders of magnitude.

#### 2.2.2 Source routing

The static route of a flit is stored in the flit header. The route in the flit header is read on its way through the network. The sending NI needs to store a route for each time slot and the destination ID of that route. When a flit reaches a router, the router reads the header to determine whereto the incoming flit should be routed. The receiving NI can see the origin of a flit in its header.

#### 2.2.3 Distributed routing

In distributed routing the static routing information is distributed to the routers where it is needed. This implies that the flits does not need a header, which increases the bandwidth. The sending NI stores an entry table with the destination ID of the flit that is allowed to enter the network in the given time slot. The router stores how its input ports are connected to its output ports in each time slot. The receiving NI stores an exit table with the source ID of the flits which can arrive from the network in the given time slot.

### 2.3 Message passing

Message passing is a way of communicating between tasks, as opposed to shared memory communication. The conceptual difference of communicating with message passing and shared memory in multi-processor systems is illustrated in Figure 2.4.







Figure 2.4: Conceptual illustration of message passing and shared memory communication. (a) With message passing the processors (CPU) communicate directly through the network interface (NI) and the interconnect. (b) With shared memory communication the communication goes through the main memory (MM).

Message passing is when a packet of data is sent directly from one processing core to another. When tasks are being executed on different processors, which have no locally shared memory, a message can be sent to the other processor through the interconnect. Data that is transported via message passing must only reside in the local memory of the processor or the internal registers of the processor. Message passing is a benefit when the data is transferred directly from one local memory to another local memory. It is only allowed to write the data to the main memory once, when it should not be used any longer. Thus applications with a high level of interprocessor communication, such as streaming applications, are well suited for message passing architectures.

Shared memory is widely used in many kinds of computer systems today. Communicating between processing cores using shared memory the data will go through the main memory, or a lower level cache, which adds additional complexity due to cache coherency. Message passing can increase the bandwidth and lower the latency of interprocessor communication compared to shared memory communication. Some systems will have both the possibility of message passing and shared memory communication. When the local on-chip memory is no longer sufficient, the off-chip main memory will come in to play.

### Chapter 3

# **Tool chain**

In this chapter we present two tool chains. The first tool chain is the tool chain we have implemented, we refer to this tool chain as our tool chain. The second tool chain is how we imagine the final version of the T-CREST tools will work together, we refer to this tool chain simply as the T-CREST tool chain. Our tool chain is slightly different from the T-CREST tool chain because not all the T-CREST tools are at a stable state.

The component based structure of the tool chains helps the integration and testing of new components. Each component can be replaced, by a component with the same interfaces. The T-CREST project dictates that the tool chain should be compatible with multi-processor platforms using TDM in the interconnect.

In the T-CREST platform, the interprocessor communication and the communication to the main memory are decoupled because there is a dedicated interconnect for each of the two communication types. The NoC is only used for interprocessor communication, in the form of message passing. Managing the main memory and the caches is all done by the compiler, therefore we have not focused on main memory access, as the compiler is out of the scope of this thesis. In the following section, we describe how our tool chain is implemented, and in the next section we describe how the implementation the T-CREST tool chain differs.



Figure 3.1: An illustration of our tool chain programmed in Java.

### **3.1** Our tool chain

A block diagram of our tool chain is shown in Figure 3.1. The arrows between blocks are the flow of data, and the labels on the arrows are the file formats of the interfaces. The diamond shapes are inputs to the tool chain, describing the application and its requirements. The elliptic shape is a platform specific library, mapping an abstract application interface to the given hardware platform. The rectangular shapes are the tools in the tool chain. The tool chain takes source code and a bandwidth graph as inputs. The source code describes the desired application, and the bandwidth graph describes the bandwidth requirements between all processing cores and the topology of the NoC.

First, the schedule is generated by the TDM scheduler. Then the schedule converter converts the schedule to either hardware or software tables, depending on the hardware platform. The compiler then compiles the application source code along with the MPI (message passing interface) and possibly the software tables. In the end the application is loaded on to the hardware platform. Loading the application onto the hardware, can be done before or after synthesis depending on the hardware platform. Since we use the Java programmed JOP processor,

we also use the JOP compiler.

The mapping between processes and processors are done by the application programmer. The mapping is specified in the source code and in the input to the TDM scheduler. The input to the TDM scheduler is an XML formated file, describing the topology of the network-on-chip and the communication pattern of the application. We use XML because it is human readable, flexible and expendable. An example of the XML input format can be seen in Listing 3.1. In the example a **bitorus** topology is specified, other possible topology types are the **mesh** type and **arbitrary** type. If a **arbitrary** topology is specified, each link in the topology must be specified inside the **graph** tag using a link tag. When **bitorus** or **mesh** is specified the **link** tags are ignored. The TDM scheduler schedules the specified traffic in the topology graph. The amount of traffic scheduled between two nodes in the NoC is specified in the **bandwidth** attribute of the **channel** tag in the input XML file.

Besides the schedule, the TDM scheduler also calculates the worst-case latency (WCL) of each communication channel. The WCL of a communication channel is the worst-case time separation in time slots between access to two communication paths. If the scheduled latencies are not sufficient to meet a given real-time deadline, re-scheduling with a different bandwidth specification could decrease the WCL. The output of the scheduler is an XML file. This XML file describes how the router and network interface of each tile should be configured in every time slot and the WCL of each communication channel. An example of the output XML file can be seen in Listing 3.2, specifying a schedule of length 9. The data is separated into tiles, and in a tile each time slot describes the NI and the router. In each tile the WCL for each destination is specified.

Our MPI implements the communication primitives. The MPI hides all the

**Listing 3.2:** XML example of the scheduled communication channels in a 3 by 3 bitorus.

```
<?xml version="1.0"?>
  <schedule length="9">
    <tile id="(2,0)">
3
      <timeslot value="0">
        <ni rx="(2,1)" tx="(1,1)" />
        <router>
          <output id="N" input="D" />
7
          <output id="S" input="D"</pre>
                                     />
          <output id="E" input="L"</pre>
                                     />
9
          <output id="W" input="D" />
          <output id="L" input="S" />
        </router>
      </timeslot>
      <latency>
        <destination id="(0,0)" WCL="8"
                                          />
        <destination id="(1,0)" WCI="8" />
        <destination id="(0,2)" WCL="8" />
19
      </latency>
    </tile>
  </schedule>
```

hardware specific implementation details of the communication primitives from the application programmer. In this case of embedded real-time systems, the MPI could just as well be called the communication driver. If the hardware is changed, the driver also needs to be changed, but the application source code will not have to be changed. The WCET-aware compiler compiles and analyzes the application source code, along with the MPI library, and the timing information from the TDM scheduler.

### 3.2 The T-CREST tool chain

A block diagram of the T-CREST tool chain is shown in Figure 3.2. The difference from our tool chain is the programming language and the WCET-aware compiler. The T-CREST platform is programmed in C and the WCET-aware compiler optimizes the worst-case execution path in the control flow graph (CFG) instead of the average-case execution path. The WCET-aware compiler uses the information from the TDM scheduler to find the worst-case execution path.



Figure 3.2: An illustration of how we imagine the T-CREST tool chain.

### Chapter 4

### Hardware platforms

In this chapter we present two hardware platforms and a theoretical comparison of methods to store static routing information. The current state-of-the-art real-time NoC platforms, presented in the related work section, are complex hardware devices. The first platform we present is the S4NoC platform, it is an attempt to design a minimalistic hardware platform. We made this minimalistic NoC to gain intuition on how simple a NoC can be, and as the first hardware platform for our tool chain. The second hardware platform is the first version of the T-CREST platform.

### 4.1 Related work

Network-on-chip has been an active research area for many years, in this project we need time-predictability to enforce real-time. The following NoC platforms provide time-predictability.

Æthereal lite The Æthereal[5] NoC was developed at Philips. Æthereal provides guaranteed service and best-effort traffic. Guaranteed service is provided using TDM. A lite version of Æthereal called aelite has been developed only

providing guaranteed service. aelite is an application specific NoC, which can be instantiated in a topology that fits to the application. The Æthereal design flow is proprietary and application specific. The hardware generation and mapping is carried out in one step. Several versions of Æthereal has been made, some using source routing and others using distributed routing.

**Mango** The MANGO[6] NoC was developed at the Technical University of Denmark. MANGO is an asynchronous NoC with delay insensitive links. MANGO provides both best effort traffic and guaranteed service. The guaranteed service is provided using virtual channels and rate control, opposed to TDM.

**Nostrum** The Nostrum[7] NoC was developed at the Royal Institute of Technology in Sweden. Nostrum implements guaranteed service with their concept of looped container, which are statically scheduled containers looping in the network. A flit can be sent via a looped container to its destination.

### 4.2 The S4NoC platform

The S4NoC<sup>1</sup> [8, 1] is a light-weight time-predictable NoC using distributed routing. The paper we wrote about the S4NoC is presented in Appendix A. To enable time-predictability the S4NoC implements TDM. The S4NoC consists of a very simple router and network interface(NI). We show a 64 core FPGA implementation of the S4NoC connected to Leros processors in [1]. The Leros processor is presented in [9]. Leros is an accumulator machine programmed in assembler or in Java. A block diagram of an S4NoC tile is shown in Figure 4.1. The router is very simple, containing one slot counter, one slot table and 5 output ports. One output port is a register and a 4-to-1 multiplexer. The NI has one word queues for each communication channel in and out of the node. These are placed in the RX and TX buffers, implemented in block RAM. The processor can read and write single words to the RX and TX buffers. Status registers indicate when words are sent and received, these status registers can also be accessed by the processor. The limited buffering makes this platform hard to program in such that the full bandwidth is utilized for all communication channels.

 $<sup>^1{\</sup>rm The~S4NoC}$  is open source and is publicly available at <code>https://github.com/t-crest/s4noc</code>.



Figure 4.1: Block diagram of the S4NoC. The Leros processor can read the status bits, write the Tx buffer and read the Rx buffer.

#### 4.2.1 Programming the platform

The Leros processors run from a local instruction ROM, no code can be loaded into it at run time. The NI is connected through the 8 bit I/O address space of Leros. Figure 4.2 shows the address space of the NI connected to Leros. The communication channel to and from each other processor in the system is mapped to the address matching its core ID. Writing to that address sends a data word to the given core, reading from the address receives a data word from the given core. The status registers indicate the receive and transmit status of each communication channel, on a word level. Flow control on a higher level than single data words needs to be implemented by the processor. The upper addresses of the address space are for the UART, the CPU ID and the total number of cores in the system.

To synchronize with the TDM slots, the NI has a counter, an entry table and an exit table, dictating when and which packets enter or exit the network. The router has a routing table dictating which input port should be routed to which



Figure 4.2: The address space of the S4NoC NI connected to the I/O address space of the Leros processor.

output ports. The tables in the NI and the router are ROMs, which in an FPGA implementation can be implemented in look-up-tables (LUTs). These tables are generated at design time, and the whole design has to be re-synthesized to change the tables. The S4NoC tables can be generated by our application specific TDM scheduler.

### 4.3 The T-CREST NoC platform

The T-CREST NoC, presented in [10], is a time-predictable interconnect using source routing. A block diagram of the T-CREST NI and router connected to the JOP processor can be seen in Figure 4.3. The T-CREST NoC uses direct memory accesses (DMA) to move data form one processor the another. To utilize as much of the bandwidth as possible several DMAs can be interleaved, each waiting for their time slot. This enables utilization of time slots from different communication channels at the same time. Controlling these interleaved DMAs is the core function of the NI. The interleaved DMA controller moves data from its local scratch pad to the local scratch pads of other processors. The local scratch pad memory is a dual ported memory, with one port connected to the processor, and one port connected to the NI. The individual DMA transfers are stored in a DMA table along with the route to the destination. To keep the



Figure 4.3: A block diagram of the DMA NoC. In the router, each input port is connected to a header parsing unit (HPU), each output port is connected to the crossbar (Xbar). The JOP processor can access the data in the local scratch pad memory and configure the slot tables (ST) and the DMA table in the NI.

synchronization with the TDM slots, a slot table is indexed by the slot counter. The slot table indexes into the DMA table. When a flit is sent from the NI, its route is stored in the flit header, along with the write pointer. The router needs to decode the header of a flit before it routes the flit to a output port. The decoding of the flit header is done in the header parsing unit (HPU) in the router. The flits are sent to the crossbar and multiplexed to the output port.

#### 4.3.1 Configuration interface

The configuration of the DMA table, and the slot table, is carried out by the processor. Before the processor can send any packets through the NoC, the processor has to configure the slot table and the routes in the DMA table. The first operation is to write the routes of each DMA entry to the DMA table.



Figure 4.4: The local address space for each processor in the system. The 21 st bit is the protection bit, indicating whether to access the protected part of the local address space or not. The protected part of the address space is only for configuration of the TDM schedule.

Then it is written in each entry of the slot table, which DMA is allowed to send in that given time slot.

There are 4 segments in the local address space of each processor, the scratch pad, the DMA tables, the protected DMA routes, and the protected slot table. The number of accessible addresses in the 4 segments of the address space are not constant, nor is the ratio between them, this is because they vary depending on the current system configuration. We decide to make the simple division of the address space as seen in Figure 4.4. This is a flexible solution, but it wasts address space. The most significant bit of the 21 bits of the address space is the protection bit. The part of the address space addressed with the protection bit, should only be changed during the configuration phase of the NI.

The task of configuring the NI is done in software. Each processor needs access to the configuration data for the TDM schedule. To give access, we write the configuration data in static arrays, and compile it along with the application source code.



Figure 4.5: Conversion from SimpCon to OCP. The reset (rst), set and enable (en) are all synchronous signals. The circuit adds 2 clock cycles of latency to a request. The latency can be removed by adding multiplexers to bypass the registers, delivering stable signals one clock cycle earlier in each direction.

#### 4.3.2 Integration of the hardware platform

The T-CREST processor Patmos[3] was not in a stable state at the time we were ready for integration into our tool chain. To ease our work of integrating all these alpha state components, we decided to use the well tested JOP[4] processor, with good support. The JOP processor has a SimpCon[11] interface, whereas the T-CREST NoC has a simple subset of the open core protocol (OCP)[12] interface, thus some conversion is needed. To integrate the JOP<sup>2</sup> processors and the T-CREST NoC<sup>3</sup> platform we have wrapped the NoC in an array of SimpCon interfaces. The wrapped NoC is then instantiated in the JOP top level and connected to the JOP processors. The source files we have written and changed for wrapping the T-CREST NoC platform in SimpCon interfaces can be seen in Appendix B. These files also include a testbench for the NoC wrapped in SimpCon interfaces. The source files we have changed to connect the JOP processors and the T-CREST NoC platform can be seen in Appendix C.

<sup>&</sup>lt;sup>2</sup>The JOP processor is open source and is publicly available at https://github.com/jop-devel/jop.

<sup>&</sup>lt;sup>3</sup>The T-CREST NoC platform is open source and is publicly available at https://github.com/t-crest/t-crest-noc.

A diagram of the conversion between the SimpCon and the OCP interfaces can be seen in Figure 4.5. The SimpCon interface supports pipelined accesses, it holds the master signals stable for one clock cycle and waits for the rdy\_cnt to be 0. The reply data of the SimpCon interface is expected to be stable until the next request is started. The OCP interface needs the master signals to be stable until the slave acknowledges the request. The reply data of the OCP interface is high in one clock cycle. The incompatibilities of the SimpCon interface and the OCP interface adds 2 clock cycles of extra latency to a request. The latency can be removed by adding multiplexers to bypass the registers. These bypass multiplexers should be controlled by the enable signals, and would deliver stable signal one clock cycle earlier. Since the final T-CREST processor has an OCP interface we have focused on getting this preliminary platform to work, and not to optimize it.

### 4.4 Storage of static routing information

Through our work with the hardware platforms described in this chapter and the TDM scheduler, we have made two observations:

- Storage of the static routing information is a considerable part of the total resource consumption, this can be seen in [1] for the S4NoC and in [10] for the T-CREST NoC platform.
- The port configurations of the router in distributed routing stores redundant information.

The first observation has made us interested in optimizing the second observation. The following comparison, has not been verified in an implementation, but is a theoretical comparison, which could be the target for future T-CREST experiments. In the literature there are several methods to compress routing tables, such as [13] and [14]. These methods does not work in TDM NoCs. In [14] it is described how static routing tables are compressed, this kind of static routing means that the path of a flit is static, but not the arrival time as in our TDM NoCs. We have not found any compression methods in the literature that works for TDM NoCs. We compare the required storage bits for source routing, distributed routing and compressed distributed routing. A summary of the storage bits for each routing method is shown in Table 4.1

**Storage bits in source routing** In source routing one node in the network needs to store a route and a destination ID in each time slot. The number

of storage bits for storing a route depends on the size of the network, in the T-CREST platform 2 bits are used for each hop in the network. The route can be stored in  $\lceil \log_2(2 \times H) \rceil$  bits, where H is the maximum number of hops between two processors in the network.

Storage bits in distributed routing In distributed routing one node in the network needs to store a destination ID, a source ID and the router configuration in each time slot. An ID of a processor can be stored in  $\lceil \log_2(N) \rceil$  bits, where N is the number of CPUs in the system. The common way of storing the router configuration in a time slot is to store 2 bits for each output port per time slot. The 2 bits describe which one of the 4 possible input ports is connected to the given output port. With 5 ports this is 10 bits, which is 1024 possible combinations, not all these combinations are valid.

**Compressing the distributed routing tables** The distributed routing tables can be compressed because there are not 1024 possible router configurations. A router configuration can be perceived as a permutation of the 5 input ports, connected to the 5 output ports. An example of a port permutation is shown in Figure 4.6a. The sequence of the output ports is static and the sequence of the input ports change. In the following we calculate the number of valid port permutations.

In the actual system there are 3 restrictions on the router configurations:

- 1. No output port must be connected to multiple input ports.
- 2. No input port must be connected to multiple output ports.
- 3. No incoming flit must be routed out the same direction that it arrived, e.g., a flit arriving at the south input port must not depart from the south output port.

The first and second restriction implies that a valid router configuration must be a permutation of the 5 distinct input ports, such a valid router configuration can be seen in Figure 4.6a. We call a router configuration a port permutation.

The third restriction implies that if an input port is connected to the output port in the same direction it is considered as unconnected.

A port permutation where two or more ports are unconnected is is redundant because it can be represented by a permutation where the ports are swapped, i.e.,



Figure 4.6: Router configurations, perceived as permutations of the 5 input ports. (a) is a valid port permutations and (b) is a redundant port permutation because it can be represented by (b)

the port permutation in Figure 4.6b can be represented by the port permutation in Figure 4.6a. The static schedule guarantees that no flit is routed through the ports that are swapped.

In combinatorics a port permutation with no unconnected ports is called a derangement, and a permutation with one unconnected port is called a partial derangement with one fixed point. We call the number of all valid port permutations V. V can be stored in  $\lceil \log_2(V) \rceil$  bits. V is found using equation 4.1, the general formula is shown in [15].

$$V = D_{5,0} + D_{5,1} = \left[\frac{5!}{e}\right] + {\binom{5}{1}} \cdot \left[\frac{4!}{e}\right] = 89$$
(4.1)

Where  $D_{5,0}$  is the number of derangements of 5 elements, and  $D_{5,1}$  is the number of derangements of 5 elements with one fixed point. The number of bits to store the router configuration in one time slot is:

$$\lceil \log_2(\mathbf{V}) \rceil = \lceil \log_2(89) \rceil = 7 \tag{4.2}$$

**Comparing the storage of the routing methods** A summary of the storage bits for distributed routing and source routing is shown in Table 4.1.

Table 4.1: Storage bits per time slot per core for source routing (Src), distributed routing (Dist) and distributed routing with compression (Dist w/comp).

|             | NI ID (Bit)                                 | NI Route (Bit)                     | Router (Bit) |
|-------------|---------------------------------------------|------------------------------------|--------------|
| Src         | $\lceil \log_2(N) \rceil$                   | $\lceil \log_2(2 \times H) \rceil$ | _            |
| Dist        | $2 \times \lceil \log_2(\mathbf{N}) \rceil$ | _                                  | 10           |
| Dist w/comp | $2 \times \lceil \log_2(N) \rceil$          | _                                  | 7            |



Figure 4.7: The number of storage bits for one node per time slot for source routing in a bitorus and in a mesh, and distributed routing with and without compression, as a function of network size. The storage requirements for distributed routing is the same for bitorus and mesh.

In Figure 4.7 we show the storage requirements for distributed routing and source routing as a function of the network size. The number of storage bits with source routing in a mesh network scales very poorly compared to the other routing methods. Using source routing in a bitorus scales better, due to a smaller maximum distance of two nodes in a bitorus network. The storage requirements in distributed routing is invariant for the bitorus topology and the mesh topology. The difference is in the number of time slots needed. The number of time slots when routing in a bitorus is smaller than routing the same communication pattern in a mesh topology. As can be seen in Figure 4.7, distributed routing with compression is the most efficient way in terms of storage bits for networks as small as 64 nodes. In the case of 36 and 49 nodes it can be argued that the increase in bandwidth due to distributed routing makes distributed routing with compression the most efficient routing method. This only comes at the expense of one more storage bit per time slot for each network node.

One could argue that it is not a fair comparison, because the source routing information could also be compressed. The problem with compressing the source routing information is that it adds latency through a router. The router can not start decompressing the route before the flit arrives, whereas the decompression of the router configuration with distributed routing can be pipelined.

Compressed distributed routing requires less storage bits, the decompression adds no latency to the routing, and the bandwidth is higher because no header is sent through the network.

#### 4.5 Discussion

In this section we discuss the current limitations of the T-CREST platform, and we propose possibilities to avoid these limitations.

#### 4.5.1 Scheduling limitations

We have found two limitations of the current T-CREST NoC platform that limit the schedules:

- The platform does not support TDM schedules where one communication channel can send in multiple time slots on different routes.
- The platform only allows for reconfigurable schedule period lengths, at synthesis.

The first limitation adds another restriction to the scheduler, this additional restriction can increase the TDM period. At this point the T-CREST NoC platform, together with the TDM scheduler we are using, only supports schedules where each communication channel has a bandwidth of one time slot per schedule period. The second limitation requires to re-synthesis to change the schedule period. For larger systems synthesis can be very time consuming, and in an ASIC, which is the target for the T-CREST project even impossible.

A block diagram of a redesigned version of the T-CREST hardware platform can be seen in Figure 4.8. To resolve the mentioned restrictions we propose to extend the architecture by moving the routes out of the DMA table, and into the slot table. The packets of a communication channel can be routed on different



Figure 4.8: Extended block diagram of the DMA noc NI. The JOP processor can access data in the scratch pad memory, and configure the slot counter (SC), the slot table (ST) and the routes.

paths in each time slot of a schedule period. This extension requires more configuration storage, but is more flexible and can decrease the TDM period. One slot entry and one route can be written in the same configuration write, reducing the configuration time. Also the size of the static array is reduced, because the two values can be saved in the same 32 bit integer. We propose to make the TDM period configurable in run time to support variable length TDM schedules at run time. This can be supported by extending the counter to have a variable reset, configured along with the slot table.

#### 4.5.2 Backwards flow control

In the current T-CREST NoC platform there is no backward flow control. In real-time systems where performance is analyzed, we can guarantee that tokens can be consumed at a certain rate. As long as this rate is higher than the rate at which tokens are produced there are no problems. The problem arises during application development. The developer might not want to analyze the prototype because it takes time, or might need to lower the speed with debugging info. In these cases backwards flow control can ease development. The backwards flow control can be implemented in hardware or software. In hardware the backwards flow control can be implemented by sending empty tokens back when a place in the receiving buffer is freed by software. These empty tokens can be sent back as a specially formatted package that is processed by the NI. In software the backwards flow control can be implemented by sending a normal message back to a special address that the software in the other end is polling when it tries to send. Analyzing the systems with backwards flow control might be difficult, so backwards flow control should not be used when the application is analyzed.

# Chapter 5

# **TDM** scheduler

In this chapter we describe the scheduling problem and two types of schedulers. We also describe how we integrate an application specific scheduler into our tool chain.

### 5.1 Related work

The scheduling problem that the TDM scheduler needs to solve, is known as a integer multi-commodity flow problem. This problem has been proven to be NP-hard in [16]. A scheduler for scheduling all-to-all communication in these kinds of networks is shown in [17]. The advantage of this scheduler is that the schedules are symmetric, meaning that the routing tables for each router are the same, allowing for resource sharing. A scheduler for the Æthereal is shown in [18]. This scheduler schedules in two phases, the first phase is path allocation and the second is time slot allocation to TDM slots. The Æthereal scheduler instantiates extra hardware to increase the capacity on links if needed. Our tool chain needs a scheduler for scheduling application specific communication requirements on to the homogeneous T-CREST NoC platform.

## 5.2 Static routing

In the TDM interconnect we route packets statically, to guarantee that no packets collide. This guarantee enables us to make very simple hardware, with no arbitration mechanism or buffering. We need a TDM scheduler to create virtual end-to-end circuits. The interconnect in the T-CREST platform is a timepredictable TDM NoC. Time-predictability in the TDM NoC is enforced by a static routing. In the following we define the routing terms.

A static routing is a mapping of communication channels to the TDM links fulfilling the specification. This mapping is performed by a TDM scheduler. The communication channels are specified by the application.

**Definition 5.1** The communication channel from a to b is a collection of communication paths which can route data from a to b.

The TDM scheduler finds a collection of communication paths that satisfy the specification of the communication channels. The bandwidth of a communication channel is the number of communication paths in the communication channel, divided by the schedule period.

**Definition 5.2** A communication path from a to b, is a sequence of neighboring links mapped to consecutive time slots. This sequence starts in a and ends in b.

A valid communication path is mapped to one of the shortest paths from a to b. In regular topologies the length of a valid communication path is equal to the Manhattan distance from a to b.

**Definition 5.3** The schedule period is equal to the length of the complete schedule in time slots. The complete schedule is a schedule that satisfies all communication channels given in the application requirements.

The requirements to a schedule is specified in the XML input of the TDM scheduler. It is specified which communication channels the application needs and the bandwidth for each communication channel. When the application developer needs a schedule for an application, the developer specifies the platform topology and the communication channels in the XML input. There are two types of schedules, there is the all-to-all schedule and the application specific schedule.

### 5.3 All-to-all scheduling

An all-to-all schedule is a schedule where all processors in the network can communicate directly to all other processors in the network with equal bandwidth. All-to-all schedules have advantages and disadvantages. It is an advantage that the schedule is application independent, and there is only need to configure the schedule once, and it can be implemented in hardware. For small networks the latency of an all-to-all schedule is quite small, and it is more likely that all processors need to communicate to all other processors. The all-to-all schedule is ideal in systems where the communication pattern is very uniform between all processors. It could also be an advantage to use the all-to-all schedule when prototyping a system, as long as the developer is testing functionality and not runtime requirements. An all-to-all scheduler is shown in [8], the advantage of this approach is that the schedule for each router is the same, allowing for resource sharing, and the generated all-to-all schedules are close to optimal in terms of a short schedule period.

In large networks the number of processors each processor talks to is very dependent on the application, therefore an all-to-all schedule might wast a considerable amount of bandwidth. In systems with low latency requirements or high bandwidth requirements, an application specific schedule should be calculated.

#### 5.4 Application specific scheduling

An application specific schedule is a schedule where only processors that are specified to communicate can communicate. Creating an application specific schedule can lower the schedule period compared to an all-to-all schedule. The lowered schedule period decreases the latency and increases the bandwidth.

In our tool chain we used the Static NoC TDM scheduler<sup>1</sup> (SNTs)[19] to schedule the communication channels described in the XML input. The SNTs is a metaheuristic scheduler using adaptive large neighborhood search (ALNS)[20] and greedy randomized adaptive search procedures (GRASP)[21] to optimize the schedule period. The SNTs schedules the static routes in a time expanded graph of the NoC topology. A scheduled communication path in a time expanded graph of a 3 by 3 mesh is shown in Figure 5.1. The communication path marked with yellow is routed from processor 0 to processor 5. Each consecutive link of the routed communication path is routed in a consecutive time slot. The metaheuristic optimization algorithms break down part of the initial solution

<sup>&</sup>lt;sup>1</sup>The SNTs is open source and is publicly available at https://github.com/t-crest/SNTs.



Figure 5.1: Time expanded graph of a 3 by 3 mesh topology, with a communication path routed from processor 0 to processor 5.

and rebuild it trying to make the new solution shorter than the initial. The SNTs is designed to run for days, or as long as the application designer wants, continuously trying to optimize the solution.

To integrate the scheduler into the tool chain, we have made the following extension to the scheduler:

- Support of arbitrary bandwidths for any communication channel.
- Calculation of the WCL for all communication channels.
- XML formatted output of the calculated schedule and WCL.

The files we have added to the scheduler are the .cpp and .h files listed and described in Appendix D. Support for arbitrary bandwidth is given, by allowing multiple communication paths to be routed from source to destination of a communication channel. The arbitrary bandwidth is specified in the input XML file. The WCL time for a communication path of a given communication channel is calculated when the scheduler is done. When the schedule has been created, the scheduler goes through the schedule and counts the maximum space between any two consecutive communication paths belonging the the same communication channel. The scheduler writes the schedule and WCL together into an XML formated output file. The output from the scheduler is written to an XML file

with the open source pugixml[22] library. The representation of the schedule in the TDM scheduler is router centric. The schedule describes the configuration of each router in each time slot, this description aligns with distributed routing. A description that aligns with a source routing is a schedule describing the NI and the flit routes from the NI, called an NI centric schedule. To avoid multiple conversion back and forth between router centric and NI centric schedules the schedule in the XML file is router centric.

#### 5.5 Schedule converter

To keep the design modular, we have made a schedule converter that writes the platform specific details of the schedule. The source files of the schedule converter are the .java files in Appendix D. It converts the XML file into the format that is supported by the JOP processor and the T-CREST NoC platform. This conversion involves a conversion from a router centric schedule to a NI centric schedule. The conversion is performed by following the outgoing routes from each NI in each time slot. The static routing can be configured in the NoC in two ways: compile time configuration and run time configuration.

Compile time configuration is done by configuring the schedule in hardware tables at compile time. Compile time configuration is normally used in FPGA implementations. For compile time configuration we convert the schedule to VHDL tables for each node in the network, connecting it directly to the resource it is controlling. Conversion to VHDL tables is integrated into the SNTs scheduler. It prints out one VHDL entity containing a table for each router, the tables are indexed with the node ID.

Run time configuration is done by configuring the schedule by programming it from the processor at systems startup. For run time configuration we convert the schedule to a static array of integers such that each processor indexes the array with their processor ID and loads the contents of the array into the hardware configuration tables. The conversion of the XML file to a static Java array is done by a small Java program, which wraps the static array in a Table class that also defines methods for loading and verifying the schedule.

## 5.6 WCET-aware compiler

To get good real-time performance the compiler needs to optimize the WCET path in the control flow graph (CFG). Optimizing the WCET path, the compiler

knows the WCET path, thus the analysis and compilation could benefit from being performed by one tool. To find the WCET path, the compiler needs to make a pessimistic estimate of the run time of the given path. With more precise models the estimate can be less pessimistic. The estimate of the WCET path is found by assigning worst-case latencies to each instruction. The worst-case latencies might vary with the state of the system. Routing the interprocessor communication statically decouples this communication from the state of the system, reducing it to the WCL and the bandwidth between the two communication processes. For many core systems such a reduction in the state space is a great benefit and makes it possible to analyze the system. As an example, the latencies of a memory access to the communication scratch pad vary depending on the latency and bandwidth given by the scheduler. The latency does not have to be constant even for the same instruction, it can vary with the system's state. The latency of transferring a message ML in the system can be calculated as follows:

$$ML = WCL + \frac{MSG_{Size}}{Channel_{Bandwidth}}$$
(5.1)

Where WCL is the worst-case latency of waiting the a channel time slot,  $MSG_{Size}$  is the size of the message and the Channel<sub>Bandwidth</sub> is the bandwidth of the channel to the message destination. In the case where the WCET is higher than allowed the ML can be lowered by scheduling more communication paths for the given communication channel or spread out the communication paths in the schedule.

#### 5.7 Discussion

In this section we suggest two improvements to the scheduler that could decrease the latency of a communication channel in a schedule. If the bandwidth of a communication channel needs to be increased, more communication channels are routed. If the latency of a communication channel needs to be decreased, more communication paths can be added, but adding more communication paths does not guarantee this. If all the communication paths of a communication channel are routed closely together, the latency of the communication channel is worse than if the communication paths would be evenly distributed throughout the schedule. The first improvement is to make the scheduler aware of the proximity of other communication paths, when routing. A low latency channel could be specified by a low latency tag in the XML file. The second improvement will decrease the latency of a complete transaction through the network, the scheduler could be made to support reply messages. A reply message is a message sent from a to b followed by a reply from b to a. If we know the time separation of the first message arrival and the reply message departure, called the response delay, we can schedule two communication channels, such that only the departure of the first message needs to wait for its time slot. When the reply message is ready for departure it gets its time slot right away. This could be useful when slave components are accessed with known response delays, especially for a single word reads where the WCL is the largest contributer to the latency.

# Chapter 6

# Message passing interface

In this chapter we will create a message passing interface (MPI) for use with our tool chain. We will discuss the communication primitives in communicating sequential processes (CSP)[23] and Kahn process network (KPN)[24] and choose which communication type to implement. We will describe the software for transferring data from one processing core to another. This software takes care of the low level, hardware specific details of data transfers.

### 6.1 Related work

For message passing in large computer systems, the MPI [25] standard has been made. The MPI standard specifies an interface for sending and receiving messages in a large computer system without shared memory. The MPI standard defines a set of operations for communication through message passing and run time management of processes on massively parallel systems. An open source implementation of the MPI standard is the Open MPI [26]. The MPI standard is made for large computer systems made up of many computers connected together in a cluster. What we need for our tool chain at this point is a very simple MPI with only the most basic communication primitives.

#### 6.2 Communication primitives

To design a correct and efficient parallel application the parallelism should be considered from the early design phase. A specification of the application could be written in a formal language that supports message passing natively, such as CSP or KPN. Our hardware platform is designed to run one process on one processor. This design feature comes from the fact that running multiple processes on one processor will make the processes interfere, and the uncertainty of this interference will increase the WCET. When mapping an application onto the platform of this thesis, the application should be divided into different processes, to utilize multiple processing cores. The number of processors to map one application to, is determined by the timing requirements of the application and the resources available to the application.

Processes in both KPNs and CSP communicate by passing messages between each other. The CSP semantics implement synchronous message passing and the KPN semantics implement asynchronous message passing. Synchronous message passing is when the two processes synchronize when they exchange a message. The two processes are connected directly. This means that the sender and receiver returns from the execution of the send and receive function calls at the same time. In asynchronous message passing the two processes are connected by an infinite FIFO, meaning that the sender can send multiple messages without the receiver attempting to receive anything. Infinite FIFOs can of course not be implemented and in practice the FIFOs are bounded in size. Asynchronous message passing makes it possible to interleave calculation and communication. Both synchronous and asynchronous message passing can be implemented on top of each other. We chose the style of message passing with the lowest implementation overhead. The hardware implements asynchronous message passing with bounded FIFOs, so this is our choice. If needed, synchronous message passing can be implemented on top of our MPI, but this results in poor performance. The communication primitives we have chosen to implement are:

- Send() The Send() primitive sends the specified data to the specified recipient. If the bounded FIFO towards the recipient is full the Send() primitive blocks until there is room in the FIFO.
- Receive() The Receive() primitive receives data from a specified sender. If the bounded FIFO from the sender is empty the Receive() primitive blocks until there is data in the FIFO.
- RdySend() The RdySend() primitive is a way of avoiding blocking Send() calls. The RdySend() primitive checks if there is room in the bounded FIFO

towards the specified recipient. RdySend() returns true if there is room in the FIFO and false if the FIFO is full.

RdyReceive() The RdyReceive() primitive can be used to avoid blocking Receive() calls. The RdyReceive() primitive checks if there is data in the bounded FIFO from the specified sender. RdyReceive() returns true if there is data in the FIFO and false if it is empty.

#### 6.3 MPI in the T-CREST platform

As the programming language in our tool chain is Java, and the programming language in the T-CREST tool chain is C, we will only use basic Java for our MPI, which can easily be ported to C. Many of the observations we make will also be applicable in C. The source code for our MPI can be seen in Appendix E. The Tables.java file is the static array written by the schedule converter. In this embedded Java ported to JOP it is difficult to manage the location of variables and objects, this is a problem because the performance of message passing depends on placing the data for communication locally. In this JOP multiprocessor system, Garbage collect was not available, which limits the memory footprint and the run-time of the applications running on the system.

In the T-CREST platform, processors setup DMAs to transfer data from its local scratch pad to other processors' local scratch pad. The hardware platform we use in our tool chain is limited because it has to copy data in and out of the local scratch pad memory. Setting up a DMA requires a read pointer and a write pointer. The sending and receiving processors of a DMA transfer has to agree on the write pointer. One way of agreeing on the write pointer is to let the receiver send the next write pointer to the sender, each time it is ready to receive. Another way to agree is to layout the address space of which buffers are placed where. Allocating buffers statically is easy to analyze for the WCETaware compiler, and it avoids the extra latency of sending new write pointers back. The downside of allocating buffers statically, is that it might waste space in the already limited local scratch pad if not all buffers are used.

#### 6.3.1 Address space

The size of the local scratch pad of a single processor varies with the configuration of the system. Therefore the systems should be designed not to depend on a specific address space. Local scratch pad memory is very limited and accesses to main memory is very time consuming because many cores need to share the



Figure 6.1: The static DMA NI address space of the n<sup>th</sup> processor in a systems with N processors.



Figure 6.2: The address space in the local DMA NI of the buffers for one processor.

same off-chip memory. This means that address space of the local scratch pad should be compact. In this first version we support all-to-all communication by having buffers for all cores in the network in each tile. The static address space of the  $n^{\rm th}$  processor is shown in Figure 6.1

In a network with N nodes each NI has buffers for the N-1 other nodes. The nodes are zero indexed. We need to know the addresses statically, and we need to compact the address space. The buffers for the  $N - 1^{\text{th}}$  node is positioned in place of the local tile buffers. In this way all nodes can calculate their buffer address in all other cores. The address space of the buffers for one processor is shown in Figure 6.2.

The hardware does not support any way of signaling that a DMA transfer is finished. To signal that a DMA transfer is done we wrap the data in a header and a footer phit. The header phit carries the length of the complete DMA transfer and the footer carries 0xFFFFFFF. The size of the maximum data

Listing 6.1: Pseudo code for the Send() primitive.

```
Send()
while not RdySend() do
    do nothing
d od;
    copy message to mem
swap receive buffer
    setup DMA
```

Listing 6.2: Pseudo code for the Receive() primitive.

```
    Receive()
while not RdySend() do
    do nothing
od;
    copy message from mem
swap receive buffer
```

message is 8 words (32 bytes).

#### 6.3.2 Communication primitives

In this section we describe how the communication primitives are implemented.

Send() The Send() primitive, setup a DMA transfer to transmit the data to the recipient. To send a packet we need to check that there is not a DMA transfer in progress. If no DMA is in progress the message is copied into the transmit buffer, and the buffer in the receiving end is swapped. To complete the send operation we need to set up the DMA transfer. The pseudo code for the Send() primitive is shown in Listing 6.1.

**Receive()** The **Receive()** primitive, waits until a DMA transfer has completed. When the transfer has completed, the message is copied out and the receive buffer is swapped. The pseudo code for the **Receive()** primitive is shown in Listing 6.2.

RdySend() The RdySend() primitive checks if the DMA is ready to setup. To check the status of the DMA, the DMA done bit is read from the NI. The pseudo

|   | Listing 0.0. I setud code for the haybena() primitiv |
|---|------------------------------------------------------|
|   | RdySend()                                            |
| 2 | read DMA done bit                                    |
|   | if done bit equals 1                                 |
| 4 | return true                                          |
|   | fi;                                                  |

Listing 6.3: Pseudo code for the RdySend() primitive.

Listing 6.4: Pseudo code for the RdyReceive() primitive.

```
RdyReceive()

<sup>2</sup> read header

if footer equals -1

<sup>4</sup> return true

fi;

<sup>6</sup> return false
```

return false

code for the RdySend() primitive is shown in Listing 6.3.

RdyReceive() The RdyReceive() primitive checks if a DMA transfer has completed. To check if a DMA transfer has completed we read the header for the length of the transfer. We wait for the footer of the transfer to be 0xFFFFFFFF. The pseudo code for the RdyReceive() primitive is shown in Listing 6.4.

#### 6.4 Discussion

In this section we discuss possible improvements to our MPI.

#### 6.4.1 Dynamic allocation of buffering space

To make better use of the scratch pad, dynamic allocation of the buffering space can be applied. Then the first step in sending a message would be to allocate a buffer of the size of the message. After the message was sent the buffering space would then be freed. Allocating the message buffers to the local scratch pad could be done using the first fit algorithm, starting from the lowest address finding the first possible place to allocated the buffer. In real-time systems dynamic behavior can make analysis more difficult, because the system's state

6

is more complicated. One way of modelling the state is to fix the maximum message size and then only allocate buffers of this size. Then the compiler can keep a worst-case count of the number of outstanding packets. The first problem with dynamic allocation is to determine who will free the allocated buffers. The hardware is the last to use the transmit buffers, and the software is the last to use the receive buffers. If the software is freeing the buffers, then it needs to poll the transmit buffers to check if they are done. If the hardware is freeing the buffers, the bookkeeping needs to be in the communication buffer, which is already crowded. This will infer an unwanted overhead into the communication primitives. The dynamic allocation still suffers from having to send information about its receive buffers to the processors trying to transmit to it. A compromise to avoid sending addresses of receive buffers back is to allocate the receive buffers statically and the transmit buffers dynamically. This would also simplify the analysis because the dynamic behavior is local and independent of other processors.

#### 6.4.2 Compiler optimizations

If the WCET-aware compiler can find the message sizes when it analyzes an application, it can allocate both the static and dynamic buffers in the local scratch pad, avoiding unused buffering space. The tool chain and programming model should help the programmer to parallelize the applications. The WCET-aware compiler should help the programmer by giving feedback. Such feedback could be information on the load of the different processors helping the programmer to load balance the application. In this first iteration, where we do not have a compiler with these abilities we choose to implement the dumb all-to-all address space and communication primitives.

Message passing interface

# Chapter 7

# Test

In this chapter we show how a Hello World program is implemented with our tool chain, and show how the system could be benchmarked.

## 7.1 Hello World!

We will show how a Hello World program is taken through our tool chain and finally implemented on the T-CREST NoC platform. Our Hello World program sends a message through all processors in a ring. In Listing 7.1 we show a piece of the source code for processor zero. The full source code can be seen in Appendix F. Processor zero initializes the DMA NI by writing the static tables. Then it initializes the other processors by setting a runnable. When the other processors are started, processor zero starts by sending the message to the processor with the highest ID. When the message reaches processor zero again "Hello World!" is written to the console. The stringbuffer is a message queue for the other processors for writing out their start messages.

Listing 7.2 shows the XML input for the TDM scheduler. In the XML file we specify that the topology is a 3 by 3 bitorus and the communication pattern of the application is all-to-all. For the Hello World application the actual communication pattern is a ring. We will show application specific schedules in the

Listing 7.1: Source code the Hello World application.

```
// Initialization of DMA NI
  Tables.load(0);
2 System.out.println("Core 0 started");
  for (int i=0; i < sys.nrCpu-1; ++i) {
4
    Runnable r = new HelloDMA(i+1);
    Startup.setRunnable(r, i);
6 }
  // start the other CPUs
s sys.signal = 1;
  int[] message = \{0, 1, 2, 3, 4, 5, 6, 7\};
10 int [] rmessage = \{0, 0, 0, 0, 0, 0, 0, 0, 0\};
  NoC.send(message, sys.nrCpu-1,0);
12 for (;;) {
    int size = msg.size();
    if (size!=0) \{
14
       StringBuffer sb = (StringBuffer) msg.remove(0);
      System.out.println(sb);
16
    }
    if(NoC.recvRdy(1,0)){
18
      NoC.recv(rmessage,1,0);
       for (int i = 0; i < message.length; i++){
20
         if (message[i] != rmessage[i]) { System.exit(1);}
       System.out.println("Hello World!");
    }
24
  }
```

Listing 7.2: XML input for the Hello World application.

Listing 7.3: Console output of the Hello World application running on the hardware.

|    | JOP start V 20110107        |
|----|-----------------------------|
| 2  | 60 MHz, 2048 KB RAM, 9 CPUs |
|    | Core 0 started              |
| 4  | Core 1 started              |
|    | Core 2 started              |
| 6  | Core 3 started              |
|    | Core 4 started              |
| 8  | Core 5 started              |
|    | Core 6 started              |
| 10 | Core 7 started              |
|    | Core 8 started              |
| 12 | Hello World!                |
|    |                             |

following section. The application output to the console is shown in Listing 7.3. First we see the JOP processor starting, then each processor in the network starts. At last the message is passed around the network and "Hello World!" is written to the console.

#### 7.2 Microbenchmarks

To measure the performance of a system real applications should be used as benchmarks. For our current tool chain we do not have any real applications, so microbenchmarks are the only way of benchmarking the system. With the current state of the T-CREST platform, microbenchmarks is a good way of characterizing specific design features. Microbenchmarks are good for evaluating different design alternatives, which is what the T-CREST project needs in its current state. The NoC Benchmark by OCP-IP states that microbenchmarks[27, sec. 2.3] for a NoC should benchmark:

• Packets and transactions

- Temporal and spatial distribution
- Best effort and guaranteed services
- Network size

Not all these benchmarks apply for our statically scheduled NoC. The latency and bandwidth of communication channels are invariant of the communication load in the network. Therefore it is also irrelevant to benchmark temporal and spatial distributions of the traffic in the network. Also the system does not provide best effort services. For our network it is only relevant to do microbenchmarks for packet transfers and complete transactions as a function of the network size. Our FPGA only fit 9 JOP processors and the T-CREST NoC, so varying the network size is not interesting.

In general purpose systems benchmarks are used to measure the performance of the system. In real-time systems, where the performance is equal to the calculated WCET, the benchmarks should be analyzed by the WCET-aware compiler. In the current state of the T-CREST project, where we do not have a WCET-aware compiler, we do the measurements in hardware. In our benchmark we will measure the following operations:

- **Send()** The time it takes to perform a send operation when there is room in the transmit buffer.
- **Recv()** The time it takes to perform a receive operation when there is a message in the receive buffer.
- Echo() The time from sending a message to another processor till a message is received from the given processor.
- **Roundtrip()** The time it takes to send a message through all processors in the network.

The source code of the microbenchmark can be seen in Appendix F. In Figure 7.1 we show the execution time of the microbenchmarks as a function of the message size. For echo and roundtrip we also show two interleaved operations. The execution time of the interleaved roundtrip operation is around 6 percent larger than the normal operation, sending twice the data around in the network. The execution time of the interleaved echo operation is around 60 percent larger than the normal operation, sending twice the data.



Figure 7.1: Measured execution time of the microbenchmarks as a function of the message size with an all-to-all schedule.

To optimize the execution time of the microbenchmarks we make an application specific schedule, and the XML file specifying this application specific communication pattern can be seen in Listing 7.4. In the XML file only the communication channels that are needed are specified. Using an application specific schedule decreases the execution time slightly. A plot of the execution times of the roundtrip benchmark and the echo benchmark, with and without an application specific schedule, is shown in Figure 7.2. One reason why the execution time with the application specific schedule is not a greater improvement could be the size of the network. For small networks such as the 3 by 3 bi directional torus, the all-to-all schedules have quite low latencies, and an application specific schedule can not improve the run time by an order of magnitude. The application specific schedule will result in larger improvements for larger networks. Since the design is optimized for WCET we expect to see a larger improvement when the benchmarks are analyzed.

Another reason could be that the execution time of the benchmarks is dominated by processor I/O. We have measures the execution time of a single read from the local scratch pad memory to be 28 clock cycles. With a schedule period of 10 or less these I/O capabilities suppress the significance of the NoC delay.

Listing 7.4: XML bandwidth specification for the microbenchmark.

```
<?xmlversion="1.0" encoding="UTF-8"?>
                                 <topology width="3" height="3">
    2
                                                   <graph type="bitorus"></graph>
                                 </topology>
    4
                                  <channels type="arbitrary">
                                 <channel from="(0,0)" to="(2,2)" bandwidth="1" /> <channel from="(0,0)" to="(1,1)" bandwidth="1" />
    6
                              \begin{array}{l} (channel from="(0,0)" to="(1,1)" bandwidth="1" \\ (channel from="(0,0)" to="(0,1)" bandwidth="1" \\ (channel from="(0,0)" to="(0,1)" bandwidth="1" \\ (channel from="(0,0)" to="(2,0)" bandwidth="1" \\ (channel from="(0,0)" to="(1,0)" bandwidth="1" \\ (channel from="(2,0)" to="(1,0)" bandwidth="1" \\ (channel from="(2,0)" to="(0,0)" bandwidth="1" \\ (channel from="(2,0)" to="(0,0)" bandwidth="1" \\ (channel from="(0,1)" to="(0,0)" bandwidth="1" \\ (channel from="(1,1)" to="(0,0)" bandwidth="1" \\ (channel from="(0,1)" to="(0,0)" bandwidth="1" \\ (channel from="(1,1)" to="(0,0)" bandwidth="1" \\ (channel from="(1,1)" to="(0,0)" bandwidth="1" \\ (channel from="(1,2)" to="(1,1)" bandwidth="1" \\ (channel from="(2,2)" to="(1,2)" bandwidth="1" \\ (channel from="(2,2)" bandwidth="1" \\ (channel from="(2,2)" bandwidth="1" \\ (channel from="(2,2)" bandwidth="1" \\ (channel from=
                                                                                                                                                                                                                                                                                                                                                                                                                            />
                                                                                                                                                                                                                                                                                                                                                                                                                            />
    8
                                                                                                                                                                                                                                                                                                                                                                                                                                />
                                                                                                                                                                                                                                                                                                                                                                                                                                />
                                                                                                                                                                                                                                                                                                                                                                                                                                 />
                                                                                                                                                                                                                                                                                                                                                                                                                                 />
                                                                                                                                                                                                                                                                                                                                                                                                                                 />
                                                                                                                                                                                                                                                                                                                                                                                                                                 />
14
                                                                                                                                                                                                                                                                                                                                                                                                                                 />
                                                                                                                                                                                                                                                                                                                                                                                                                              />
16
                                                                                                                                                                                                                                                                                                                                                                                                                              />
                                                                                                                                                                                                                                                                                                                                                                                                                              />
18
                                                                                                                                                                                                                                                                                                                                                                                                                            />
                                                                                                                                                                                                                                                                                                                                                                                                                            />
20
                                                                                                                                                                                                                                                                                                                                                                                                                              />
                                 </channels>
```



Figure 7.2: Measured execution time of a message round trip, with all-to-all and application specific schedules (APS).

# Chapter 8

# Conclusion

In this chapter we conclude the work carried out in this thesis. First we summarize our findings and describe the contributions of the thesis. Finally we point to future areas of work.

## 8.1 Summary of findings

In this thesis we have presented our tool chain for programming a real-time multi-processor platform. This tool chain is very similar to how we imagine the final T-CREST tool chain. Our tool chain is ready for integration in the full T-CREST platform.

We used our tool chain to implement the first application sending messages around between processors with the T-CREST NoC platform in an FPGA. During the work of integrating the hardware platform into the tool chain, we have identified limitations and suggested how they can be removed. The current platform is limited by low run-time configurability, no optimization of WCL and static buffer allocation in the MPI. We have changed the interface of the TDM scheduler to be compatible with our tool chain. We have implemented a Kahn process network style message passing interface in Java to communicate asynchronously between processors. We have also tested our tool chain by implementing a "Hello World" program using 9 processors to send one message around. Another test was to implement a few microbenchmarks and measuring their execution time. The benchmarks shows that the tool chain is ready to help the developers of the T-CREST project to evaluate their components.

## 8.2 Project contribution

The contribution bullets from the introduction are elaborated in the following bullets in a one to one correspondence.

- We have defined the file format of the interfaces in the block diagram for both the tool chains. We have presented the structure and tags of the XML files.
- We have implemented a minimalistic time-predictable NoC platform that is published in [1].
- We have integrated the the T-CREST NoC platform and the JOP processor to one hardware platform that can be programmed by our tool chain.
- We have shown that distributed routing is more efficient than source routing in terms of storage bits for larger networks. In networks of 36 nodes or larger, we have shown that distributed routing is a better trade-off, because bandwidth is higher, hardware is simpler, and the storage bits are about the same or less.
- Working with the T-CREST NoC platform we have suggested to extend the hardware by making it more configurable in run-time. We proposed to move the routes in the DMA table to the slot table, and to make a programmable reset of the slot counter. These extensions will make the hardware more flexible and enable better utilization of the hardware.
- We have changed the scheduler to support arbitrary bandwidths for communication channels, calculate the WCL and output schedule and latency information to the WCET-aware compiler in XML format. To configure the network interfaces we convert the schedule into a static array, which is loaded into the network adapters at run-time. This approach can be extended to enable loading of a new schedule at run-time, if the mode of operation changes.
- To reduce the WCL we propose to make the scheduler aware of the location of other paths in the same communication channel when scheduling a path.

When the scheduler knows the path locations it can spread out the paths minimizing the the WCL for the given bandwidth.

- We have implemented a message passing interface using statically allocated buffers. These statically allocated buffers can be made more efficient with the suggested compiler support. The MPI implements asynchronous message passing with bounded buffers.
- We have proposed to change the MPI to use dynamic allocation of transmit buffers and static allocation of receive buffers.
- We have implemented a Hello World program on the T-CREST NoC platform using our tool chain. We have made microbenchmarks to enable evaluation of design features for the developers of the T-CREST platform.

### 8.3 Future work

The T-CREST NoC platform should be updated with the suggested improvements, to increase the flexibility of the platform. A version of the T-CREST platform using compressed distributed routing should be investigated. Our results indicate that the resource consumption should decrease and the bandwidth should increase. Also the hardware complexity of the router should decrease to something similar to the router of the S4NoC platform. When the T-CREST WCET-aware compiler and the Patmos processor are stable, they should be integrated into the T-CREST tool chain.



# S4NoC paper

Our published paper about the S4NoC is attached in the following pages of this appendix.

### A Light-Weight Statically Scheduled Network-on-Chip

Rasmus Bo Sørensen, Martin Schoeberl, Jens Sparsø Department of Informatics and Mathematical Modeling Technical University of Denmark Email: rasmus@rbscloud.dk, masca@imm.dtu.dk, jsp@imm.dtu.dk

Abstract—This paper investigates how a light-weight, statically scheduled network-on-chip (NoC) for real-time systems can be designed and implemented. The NoC provides communication channels between all cores with equal bandwidth and latency. The design is FPGA-friendly and consumes a minimum of resources. We implemented a 64 core 16-bit multiprocessor connected with the proposed NoC in a low-cost FPGA.

#### I. INTRODUCTION

For chip-multiprocessor (CMP) systems used in real-time systems we need time-predictable processors, memories, and communication channels. For on-chip core-to-core communication, a network-on-chip (NoC) is a scalable solution. In order to build a time-predictable CMP, the NoC is time-divisionmultiplexed (TDM). The NoC uses a static schedule; tables implementing this schedule are stored in each router and each network adapter. We use a schedule that provides all-to-all communication between all nodes, as depicted conceptually in Figure I.

In [11] we have shown that a router for a statically scheduled NoC is very small. In this paper we explore the full design, containing a processor, the network adapter, and the router. We explore how small this system can be and still represent a usable architecture. In other words we aim at a many-core architecture in a medium size FPGA. With our size-optimized processor Leros [10], which can be implemented in about 190 logic cells (LC), we set a very low bar for a NoC. One expects that the communication infrastructure is smaller than the processing node.

One TDM based router and one minimalistic network adapter consumes 665 LCs and 2 on-chip memory blocks for an 8x8 bi-torus configuration. Therefore, we were able to synthesize and run a 8x8 CMP system, containing 64 processors, network adapters, and routers, in the low-cost Cyclone II FPGA EP2C70 on the DE2-70 board. The contributions of the paper are:

- The design of a minimal network adapter for a TDM based NoC
- A 64 core CMP, running a simple test application
- · Providing the NoC in open-source form

The paper is organized as follows: The following section presents related work in the area of real-time NoCs. Section III presents the design of the TDM scheduled network-on-chip. A minimal network adapter is described in Section IV. The simple implementation of the system is presented in Section V.



Fig. 1. A conceptual interconnect providing all-to-all connection between micro processors ( $\mu$ P).

An alternative implementation of the system is described in Section VI. We present our results in Section VII. In Section VIII we discuss the strengths and weaknesses of the design. The paper is concluded in Section IX.

#### II. RELATED WORK

Æthereal [4] uses TDM, i.e., reserves resources for certain points in time. In each time slot a block of data is forwarded through a router without waiting or blocking traffic, hence, contention cannot occur. Slot tables with routing information are contained in the routers and no arbitration or link-tolink flow control is required. Instead, a credit-based flow control is applied for end-to-end control, saving buffer space between links. Guaranteed services are combined with best effort routing in order to utilize unreserved resources. aelite, a light version of Æthereal, only offers guaranteed services resulting in a simpler router design [5]. Slot tables are placed in the network adapter and routing is done through message headers. In the latest version of aelite, called dAElite [12], the static routing tables are back in the routers to support multicast routing.

SoCBUS [13] and the NoC presented in [14] use a circuitswitching NoC, i.e., no resources, such as wires and router buffers, are shared between connections. This lowers utilization and increases costs. However, once a connection has been established, real-time guarantees are trivially achieved. It is, however, possible that a requested connection cannot be set



Fig. 2. The network architecture: (a) the bidirectional torus topology, (b) a node/tile, and (c) the router.

up due to lack of resources (links) – this may compromise the real-time properties.

MANGO [1] is an asynchronous NoC, which supports both guaranteed service (GS) and best effort (BE) traffic, by using non-blocking routers and rate control. A non-blocking router requires a separate physical buffer for each virtual circuit, an elaborate arbitration mechanism for each router output port, and a credit-based flow control mechanism among output buffers in neighboring routers. This indicates a considerable hardware cost of the rate-controlled routers.

A time-triggered NoC (TT-NoC) applies the concepts of the time-triggered architecture (TTA) [6] to NoCs [9]. The TT-NoC consists of a ring structure and is therefore only intended for a small number of IP-cores. As the ring is built out of simple multiplexers and registers, it is clocked at double the frequency of the computation nodes. Similar to our presented design, the communication schedule is static and predetermined.

Paukovits and Kopetz use a time-triggered NoC for the time-triggered system-on-chip (TTSoC) architecture [7]. The messages use wormhole routing and the TDM slotting is based on complete message transmissions. The TTSoC is topology agnostic. The prototype uses an uncommon version of a mesh topology: a 3x2 mesh supporting 10 computation nodes. Therefore, the corner routers are connected to two computation nodes. Our design shares the idea of static scheduling based on TDM. However, we base our schedule on the finer granular network clock and take pipeline effects into account in the network.

#### III. A STATICALLY SCHEDULED NOC

In [11] we presented the idea of a statically scheduled TDMbased NoC, called S4NoC, that provides all-to-all communication in regular topologies (e.g., mesh, torus, bi-torus, tree). We presented results on the minimum period of a schedule that provides all-to-all communication and derived first resource estimates for the routers. All-to-all communication schedules, which are only 15% to 20% longer than theoretical lower bounds, can be calculated with a heuristics [2].

In this paper we design a simple network adapter to go along with the simple router and we implement the whole system. The network adapter has to do some bookkeeping and buffering of data and thus the design will be more complex than that of the router. We still aim to keep the design as simple as possible.

A router for the NoC is very simple, which is one of the motivations for using a statically scheduled TDM-based NoC. For a mesh or a bi-torus a router has 5 bi-directional ports (north, east, south, west, local) and each output port is a pipeline stage consisting of a register with a 4-to-1 multiplexer on its input (in and out of the same port is not allowed). The multiplexers are controlled by schedule tables indexed by a slot counter. This avoids the need to transmit address information with the packet. Without the pressure to amortize for the header overhead we can use arbitrary short packets. Therefore, we transmit and schedule single words as packets, which helps to keep the schedule period short and the latency moderate.

For the evaluation described in the following sections we assume a bi-torus topology, as shown in Figure 2(a). Each node consists of a processor with local instruction and data memories, a network adapter, and a router, as shown in Figure 2(b). The processors execute from their local memories and communicate by sending messages across the network. The NoC provides (virtual) channels, all with the same bandwidth, allowing a processor to send messages to and receive messages from all other processors. For simplicity we restrict to single word messages, and by using the same width of the links and routers in the NoC we get a simple design, as illustrated in Figure 2(c), where a message traverses a router in one clock cycle.

The router is obviously very simple (i.e. small and fast) and



Fig. 3. A tile including the Leros processor, the network adapter with receive, transmit, and status registers, the interface to the router, and the router.

the sizes of different processors targeting FPGA implementations are also quite well known. The third and last component in a tile is the network adapter. Despite our aim for simplicity its function is non-trivial, and its size and speed is difficult to assess. This is one of the main reasons for the design experiment reported in this paper – to get reliable speed and area figures and to gain insight in the design of this critical component.

The network adapter's interface towards the processor is similar to a memory mapped IO-device, and it offers input and output registers corresponding to all the incoming and outgoing (virtual) channels connecting it to all the other processors. The design is described in more detail in the following.

#### IV. THE NETWORK ADAPTER

The basic functionality of a minimalistic network adapter (NA) is to present an interface to the processing core, which enables the processor to access communication channels to other cores efficiently. The processing core should not be concerned with managing time slots. To fully utilize the network, there are the following requirements to the minimalistic NA:

- Provide an interface to view the status of all channels to the processor
- Send and receive single data words to and from the network in line with the TDM mindset to all other cores in the system
- The NA must be able to transmit and receive data in all consecutive time slots

To synchronize the sending and receiving of flits (transmitted logical words) to the router, the NA uses a time slot table. The time slot table is generated from the static schedule of the size and topology of the desired system. The time slot table in a NA, maps a given time slot to a source and a destination address. These addresses are the ID of the receiving or transmitting processing core, thus the time slot tables are different for all NAs. The time slot table is driven by a counter in the NA.

The block diagram of the NA is shown in Figure 3. The processor can write to the transmit (Tx) buffer, or read from either the receive (Rx) buffer or the status registers. The status registers shows the status of each register in the Tx or Rx channels, i.e., if the Tx register is ready to receive or if the Rx register is ready to be read out.

The interface to the processor is an address space, where each communication channel is mapped to one address and status registers are mapped to several registers depending on the number of cores in the network. In each of the two status registers, each bit represents a communication channel to one other core in the system. Maximizing the utilization of the given hardware, the static schedule is made such that the NA can both send and receive flits in each time slot.

In this simple NA not much control is needed. The task of the control logic is to set and reset bits in the status registers when flits are received and transmitted. The task of controlling each bit of the status registers individually is not complicated, but an increasing number of bits lead to an increasing amount of control logic. To set and reset each bit of the status registers efficiently the status registers should be implemented in flipflops.

#### V. IMPLEMENTATION

In this first simple implementation the whole system resides in one global clock domain. Our design is technology agnostic, but in this section the implementation decisions are related to the Cyclone II FPGA we have used for testing.

*Processor Interface:* On the processor side of the network adapter, the processor needs the ability to read out the status of the communication channels and to read or write data to the individual communication channels. The data to the

communication channels are written or read directly to/from the block RAM. Because the address on the block RAM is registered there is a one cycle delay on a read form the communication buffer. The simple way of solving this problem is to require that when the processor wants to read data, the same read instruction should be executed twice. When the status registers are read there is a multiplexer for selecting which part of the status register to select. In a design where a read from the NA would be limiting the clock frequency of the processor, the NA could implement indirect addressing. For indirect addressing the processor writes the address of a request into an address register and in the following clock cycle the data on that address can be accessed. Indirect addressing will cut the processor I/O bandwidth in half, but the clock frequency of the system could increase.

*Communication Buffer:* In this simple implementation we use one dual ported block RAM for each communication buffer. A block RAM in a Cyclone II FPGA is 4 KBit. Using the block RAMs as buffers, one port is only used for writing and the other port only used for reading. Our system supports all-to-all communication and each communication channel requires two 16-bit words of storage in the NA. The two RAM blocks will support systems of up to 16x16 nodes. In the Cyclone II FPGA the RAM blocks will not be fully utilized. In systems where block RAMs can be instantiated with a finer granularity the resource consumption can be decreased. In an ASIC design the utilization can be made to 100 %.

*Control Logic:* Our implementation of the described design is not tuned for any specific number of processing cores. The circuitry for selecting and updating the status registers are (Number of cores)-to-1 multiplexers, and 1-to-(Number of cores) decoders. Updating the status registers can be limiting the clock frequency for a sufficient number of cores in the system. When instantiating a system of a specific size, the control logic can be pipelined, if the desired frequency is not obtained. This pipelining results in a longer latency for status register updates, both for setting and resetting, the software should be aware of this longer latency.

#### VI. ALTERNATIVE DOUBLE-CLOCK IMPLEMENTATION

The routers are simple: just registers connected with 4:1 multiplexers. Therefore, those can be run at a higher frequency than a processor. If we use a second clock, synchronous to the main clock and double the frequency, we can time share the router. Thus reducing the size of the router by 50%. The block RAM usually can also run at a higher frequency (e.g., at up to 250 MHz in the Cyclone II device). Therefore, it can also use the double-frequency clock. Then we need only one block RAM for both communication buffers. The block RAM consumption for an FPGA implementation can be reduced for all NoC sizes up to 11x11.

Running the network at a higher frequency requires the NA to split a single flit into two phits. An extra pipeline stage is needed to align phits (physically transmitted words) to flits in a TDM time slot. The processors in the dual clock design all run in the primary clock domain. The network i.e., the routers and

#### TABLE I

RESOURCE CONSUMPTION AND MAXIMUM FREQUENCY OF ONE NETWORK ADAPTER (NA) AND ONE ROUTER (R) IMPLEMENTED IN A CYCLONE II (EP2C70) FPGA. THE NUMBERS INCLUDE THE TDM TIME SLOT TABLES.

| Cores       | 16    | 25    | 36    | 49    | 64    |
|-------------|-------|-------|-------|-------|-------|
| LUT         | 278   | 383   | 484   | 517   | 665   |
| Reg         | 145   | 171   | 186   | 197   | 217   |
| RAM (KBits) | 1     | 1     | 2     | 2     | 4     |
| Freq. (MHz) | 106.6 | 106.7 | 104.3 | 106.0 | 104.8 |

part of the NAs run in the double-clock domain. As the clocks are synchronous there is no real clock domain crossing needed. Only the back-end of the NA needs to handle the splitting and merging of phits between the two clock domains. The block RAM is using the double clock to double the number of ports.

If both sides of the block RAM are clocked with the network clock, the NA can return the value of a read to the processor in the same clock cycle as the read is made, thus the need to execute the read instruction twice is eliminated. Furthermore the NA can be made to support simultaneously read/write from the processor, which is supported in the processor interface but not in the Leros Processor. Implementing the Tx and Rx buffers in one block RAM requires three ports to the block RAM. One read/write port for the processor interface, one read port for the Tx channel on the network side of the block RAM and one write port for the Rx channel on the network side of the block RAM. The NA buffers the phits of a flit until the entire flit is sent or received. A flit can be read from the Tx buffer in every even clock cycle and a flit can be written to the Rx buffer in every odd clock cycle.

Additional complexity is added to the NA when the data width of the router is cut in half. The reduction in data width calls for serialization in the NA, taking more area. Also the multipumped block RAM increases complexity, multiplexing the Rx and Tx data through the same port on the block RAM. A not so obvious source of added complexity is the control logic. If the large multiplexers for selecting the status bit to update are clocked on the fast clock, they may need pipelining.

#### VII. RESULTS

To obtain results for resource consumption and maximum frequency we have used Quartus II 12.0 to compile and synthesize the design. We have also tested the implementation in our Cyclone II FPGA with a small program sending messages between all cores and when all messages are received a message is written to the UART connected to core zero. The test program is written in assembler for a 16-core system, but can easily be extended to 64 cores.

The resources shown on Table I are for one tile except the processor itself for the different network sizes that fits in our Cyclone II FPGA. The resource consumption is shown in lookup tables (LUT), registers (Reg), and memory bits (RAM). Along the resource consumption we also show the maximum frequency that the components can operate at. The numbers include the TDM time slot tables in the router and the network adapter. TABLE II Resource consumption and schedule period of the TDM time slot tables for the network adapter (NA) and the router (R).

| Cores           | 9  | 16 | 25 | 36 | 49 | 64  | 81  |
|-----------------|----|----|----|----|----|-----|-----|
| Period (clocks) | 10 | 19 | 27 | 42 | 58 | 87  | 113 |
| NA (LUT)        | 6  | 12 | 23 | 39 | 46 | 71  | 96  |
| R (LUT)         | 12 | 28 | 38 | 63 | 78 | 127 | 173 |

TABLE III A relative comparison between the single clocked and the dual clocked implementations.

| Cores | LUT  | Reg  | RAM  | Freq. |
|-------|------|------|------|-------|
| 16    | 1.18 | 1.48 | 0.50 | 0.97  |
| 64    | 1.44 | 1.91 | 0.50 | 0.74  |

The resource consumption of the different entities of the system differs from core to core. The numbers in Table I are from the entities of core zero (upper left corner) for the given network size. Core zero is usually the largest entity, but it can differ from the different network sizes. The resource consumption of tiles is not uniform throughout the implemented systems.

The major reasons for the increase in the resource consumption on one network adapter and a router as the number of cores in the system grows are:

- 1) Bookkeeping of the status bits in the NA, increases both Reg and LUT count
- 2) The size of the routing tables grows, increases the LUT count
- 3) Buffering more data channels, increase the RAM size

The frequency appears to be close to constant for the network sizes we have synthesized, with small fluctuations from run to run of the synthesis. We expect the frequency to decrease when the systems size grows larger than what we have experimented with, because of the increase in bookkeeping. To avoid the frequency slowdown for larger systems the bookkeeping mechanism can be pipelined.

In Table II we present numbers for the resource consumption of the slot tables located in the routers and network adapters along with the period of the TDM schedule. The number of lookup tables increase proportional to the period of the TDM schedules. The numbers for these slot tables are not specific to our implementation of the network adapter, but more general for these types of TDM schedules.

In Table III we show the relative size of the double-clock design compared to the single clock design. The dual-clocked design was intended to be smaller as the router multiplexers are only half the size. However, only the RAM consumption is lower due to double clocking. The logic consumes more resources. The additional circuit in the NA for the packing and unpacking offset the reduction in the routers.

Furthermore there is also a relative decrease in frequency when using the double-clocked implementation. The disadvantages of the double-clocked implementation increase as the system size grows. On top the complexity of the dual clocked network is higher, thus making it more complicated to debug and harder to maintain.

Therefore, the double clocking of the network structure proves not to be beneficial. However, the double clocking of the communication memory reduced the number of block RAMs to a single one. Therefore, one design point can be a single clock per packet NoC and NA, but double-clock the block RAM.

#### VIII. DISCUSSION

In Section VI we have described an alternative implementation with double clocked routers. However, the results presented in Section VII show a higher resource consumption for this alternative. This is another indication that simplicity often wins, as the simple NA implementation was the smallest and fastest.

With higher number of nodes, the resource consumption of the routing tables increases per node. However, it has to be noted that the router tables start very small and therefore the increase is moderate. A complete NA and router with the routing tables for a 64-core system is still just 665 LCs.

If one would even like to reduce this size further, an application specific schedule can be used, i.e., a schedule where not all cores can communicate to all other cores. An application specific schedule can reduce the period length of the slot table schedule and thereby the resource consumption. It will also reduce the latency due to the shorter period.

An application specific schedule requires reconfigurable hardware. However, this extra hardware complexity could reduce the benefit of application specific schedule. In the natively reconfigurable hardware of an FPGA the application specific schedule can be part of the FPGA configuration and therefore be quite efficient. No programming of the schedule during runtime is needed.

The scheduler presented in [8] is capable of making such application specific schedules. These schedules have been tested on the implementation of our NoC.

As our target is to explore many-core architectures in medium sized FPGAs, we decided to use a small microprocessor, Leros [10], as the processing node. Leros is a 16-bit processor intended for small applications and utility functions similar to Xilinx's PicoBlaze [15]. Leros is an accumulator machine and uses on-chip memory for instructions and data. The data memory also contains a register file, i.e., the first 256 data locations can be directly addressed. Leros implements a two-stage pipeline and can be clocked faster than 100 MHz in Cyclone and Spartan devices.

Tiny microprocessors, like Leros, are usually programmed in assembler. Leros also comes with an assembler. However, to provide a higher level programming language, the muvium Java system has been adaped for Leros [3]. Muvium compiles Java class files into Leros assembler. The Java supported by Muvium/Leros is a *very* restricted subset. However, it is enough to write test and example programs for the presented NoC configuration.

#### IX. CONCLUSION

This paper presents a network-on-chip for real-time systems. The communication is scheduled statically in a time-divisionmultiplexed manner. This static schedule provides all-to-all communication for the chip-multiprocessor system. The resulting router is quite small and calls for an efficient implementation of the network adapter. The presented network adapter provides one word of buffer for each transmit and receive channel. By time-multiplexing a single on-chip memory it can be used to buffer input and output channels, even with one receive and one transmit word per clock cycle.

The presented network adapter is small and therefore is a good fit for the small and simple router. With a tiny processor we where able to build a 64-core system connected via a bidirectional torus network-on-chip in a medium sized FPGA from the low-cost series Altera Cyclone-II.

#### Acknowledgment

We would like to thank James Caska for his support on the Java bytecode compiler muvium for Leros. Furthermore, we thank Florian Brandner, who has helped us with the schedule generation for the router and NA tables. This work was partially funded under the European Union's 7th Framework Programme under grant agreement no. 288008: Timepredictable Multi-Core Architecture for Embedded Systems (T-CREST).

#### Source Access

We provide the VHDL code of the NoC and Leros in open source. The design is vendor agnostic; only the Makefile has this board as default target. The default target of our design is the Altera DE2-70 board. The source can be found at

https://github.com/t-crest/s4noc

The source can be downloaded via a zip file or with git

git clone git://github.com/t-crest/s4noc.git

With a DE2-70 board attached, the whole design can be built and downloaded with a simple:

#### cd s4noc

make

See the Makefile for different build options. The build process on a Windows PC needs Altera Quartus, a Java compiler for the Leros application compilation, and a Cygwin environment for the make and git command.

#### REFERENCES

- T. Bjerregaard and J. Sparsø. A Router Architecture for Connection-Oriented Service Guarantees in the MANGO Clockless Network-on-Chip. In *date*, pages 1226–1231. IEEE Computer Society Press, 2005.
- [2] Florian Brandner and Martin Schoeberl. Static routing in symmetric real-time network-on-chips. In *Proceedings of the 20th International Conference on Real-Time and Network Systems (RTNS 2012)*, Pont a Mousson, France, November 2012.
- [3] James Caska and Martin Schoeberl. Java dust: How small can embedded Java be? In Proceedings of the 9th International Workshop on Java Technologies for Real-Time and Embedded Systems (JTRES 2011), York, UK, Spetember 2011. ACM.
- [4] Kees Goossens and Andreas Hansson. The AEthereal network on chip after ten years: Goals, evolution, lessons, and future. In *Proceedings of the 47th ACM/IEEE Design Automation Conference (DAC 2010)*, pages 306 –311, 2010.
- [5] Andreas Hansson, Mahesh Subburaman, and Kees Goossens. aelite: a flit-synchronous network on chip with composable and predictable services. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE 2009), pages 250–255, Leuven, Belgium, 2009.
- [6] Hermann Kopetz and Günther Bauer. The time-triggered architecture. Proceedings of the IEEE, 91(1):112–126, 2003.
- [7] C. Paukovits and H. Kopetz. Concepts of switching in the time-triggered network-on-chip. In Proceedings of the 14th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2008), pages 120 – 129, August 2008.
- [8] Mark Ruvald Pedersen, Jaspur Højgaard, and Rasmus Bo Sørensen. Scheduling in a real-time network-on-chip. Technical report, https://github.com/t-crest/SNTs, 2012.
- [9] Martin Schoeberl. A time-triggered network-on-chip. In International Conference on Field-Programmable Logic and its Applications (FPL 2007), pages 377–382, Amsterdam, Netherlands, August 2007. IEEE.
- [10] Martin Schoeberl. Leros: A tiny microcontroller for FPGAs. In Proceedings of the 21st International Conference on Field Programmable Logic and Applications (FPL 2011), Chania, Crete, Greece, September 2011. IEEE Computer Society.
- [11] Martin Schoeberl, Florian Brandner, Jens Sparsø, and Evangelia Kasapaki. A statically scheduled time-division-multiplexed network-on-chip for real-time systems. In *Proceedings of the 6th International Symposium* on *Networks-on-Chip (NOCS)*, Lyngby, Denmark, May 2012. IEEE.
- [12] Radu Stefan, Anca Molnos, Angelo Ambrose, and Kees Goossens. A TDM NoC supporting QoS, multicast, and fast connection set-up. In Proceedings of the Design, Automation and Test in Europe Conference (DATE 2012), 2012.
- [13] Daniel Wiklund and Dake Liu. SoCBUS: Switched network on chip for hard real time embedded systems. In *International Parallel and Distributed Processing Symposium (IPDPS'03)*, page 78a, Los Alamitos, CA, USA, 2003. IEEE Computer Society.
- [14] Pascal T. Wolkotte, Gerard J.M. Smit, Gerard K. Rauwerda, and L. T. Smit. An energy-efficient reconfigurable circuit switched network-onchip. In Proc. Int'l Parallel and Distributed Processing Symposium (IPDPS), April 2005.
- [15] Xilinx. PicoBlaze 8-bit embedded microcontroller user guide, 2010.

### Appendix B

### **T-CREST NoC source code**

This appendix contains the following files:

- sc2ocp\_noc.vhd is the a wrapper for the whole T-CREST NoC platform. The wrapper converts from an OCP interface on the NoC to an SimpCon interface on the JOP, the file starts on page 70
- tb\_sc2ocp.vhd is a testbench for the the T-CREST NoC wrapper. The testbench test the different parts of the T-CREST NoC address space, the file starts on page 72
- ${\bf test.vhd}$  is a package with procedures for the test bench, the file starts on page 75
- noc\_node.vhd this file comes from the T-CREST NoC platform, the file describes a scratch pad and a network interface. The modifications we have made to this file were to merge the port to the scratch pad and the port to the network interface and add address decoding. The file starts on page 76
- nAdapter.vhd this file comes from the T-CREST NoC platform, the file describes the network interface. The modifications we have made to this file were changes to the address decoding and the command types for the OCP interface. The file starts on page 81

Listing B.1: sc2ocp\_noc.vhd

```
library ieee;
  use ieee.std logic 1164.all;
2
  use ieee.numeric_std.all;
  use work.defs.all;
4
  use work.sc_pack.all;
  entity sc2ocp_noc is
8
  port (
10
    clk
           : in std_logic;
             : in std_logic;
    reset
    sc_noc_out : in sc_out_array_type(0 to (N*N)-1);
14
    sc_noc_in : out sc_in_array_type(0 to (N*N)-1)
  );
18
  end sc2ocp_noc;
20
  architecture struct of sc2ocp_noc is
    --- NoC signals
    signal procM : procMasters;
    signal procS : procSlaves;
24
     - SimpCon signals
    signal sc_noc_out_reg, sc_noc_out_next
                                                 : sc_out_array_type(0
26
         to (N*N) - 1);
    signal sc_noc_in_reg, sc_noc_in_next : sc_in_array_type(0 to (
        N*N) - 1);
28
  begin
30
    noc : entity work.noc
       port map(
         p\_clk \implies clk,
         n_{clk} \implies clk,
34
         reset \implies reset,
         p\_ports\_in \implies procM,
36
         p_ports_out => procS
      );
38
      sc_noc_in <= sc_noc_in_reg;</pre>
40
     Connecting the Noc the the processors
49
     process(sc_noc_out_reg, sc_noc_in_reg, sc_noc_out, procS)
44
    begin
      NoC2Proc : for i in 0 to N-1 loop
46
         innerNoC2Proc : for j in 0 to N-1 loop
```

```
48
           \operatorname{procM}(i)(j). MCmd <= (sc noc out reg(i*N+j). wr or
               sc_noc_out_reg(i*N+j).rd) & sc_noc_out_reg(i*N+j).wr;
           procM(i)(j).MAddr <= std_logic_vector(to_unsigned(0,</pre>
50
               OCP ADDR WIDTH-SC ADDR SIZE)) & sc noc out reg(i*N+j).
               address;
           procM(i)(j).MData \le sc_noc_out_reg(i*N+j).wr_data;
52
           sc_noc_out_next(i*N+j) \le sc_noc_out_reg(i*N+j);
           sc_noc_in_next(i*N+j) \le sc_noc_in_reg(i*N+j);
54
           if procS(i)(j).SCmdAccept = '1' and sc_noc_out_reg(i*N+j).
56
               wr = '1' then --- The ackowledge of a write
             sc_noc_out_next(i*N+j).rd <= '0';
             \label{eq:sc_noc_out_next(i*N+j).wr <= '0';} sc_noc_out_next(i*N+j).wr <= '0';
58
             sc_noc_out_next(i*N+j).address <= (others => '0');
             sc_noc_out_next(i*N+j).wr_data <= (others => '0');
60
             sc_noc_in_next(i*N+j).rd_data <= (others => '0');
             sc_noc_in_next(i*N+j).rdy_cnt <= (others => '0');
62
           end if;
64
           if procS(i)(j).SResp = '1' and sc_noc_out_reg(i*N+j).rd =
                '1' then --- The ackowledge of a read
             sc_noc_out_next(i*N+j).rd <= '0';</pre>
66
             sc\_noc\_out\_next(i*N+j).wr <= '0';
             sc_noc_out_next(i*N+j).address <= (others => '0');
68
             sc_noc_out_next(i*N+j).wr_data <= (others => '0');
             sc_noc_in_next(i*N+j).rd_data <= procS(i)(j).SData;</pre>
             sc_noc_in_next(i*N+j).rdy_cnt <= (others => '0');
           end if;
           if sc_noc_out(i*N+j).wr = '1' or sc_noc_out(i*N+j).rd = '1'
74
                 then
             sc_noc_out_next(i*N+j) \leq sc_noc_out(i*N+j);
             sc_noc_in_next(i*N+j).rdy_cnt <= (others => '1');
           end if;
78
         end loop ;
80
       end loop ; -- NoC2Proc
82
    end process;
     noc_reg:process (clk, reset) is
84
     begin
       if rising_edge(clk) then
86
         for i in 0 to N-1 loop
           for j in 0 to N-1 loop
88
             if reset = '1' then
               sc_noc_out_reg(i * N+j).rd <= '0';
90
               sc_noc_out_reg(i*N+j).wr <= '0';
               sc_noc_out_reg(i*N+j).address <= (others => '0');
92
               sc_noc_out_reg(i*N+j).wr_data <= (others \implies '0');
               sc_noc_in_reg(i*N+j).rd_data <= (others => '0');
94
               sc_noc_in_reg(i*N+j).rdy_cnt <= (others => '1');
96
             else
```

```
sc_noc_in_reg(i*N+j) <= sc_noc_in_next(i*N+j);
ss_noc_out_reg(i*N+j) <= sc_noc_out_next(i*N+j);
end if;
end loop;
end loop;
end if;
end process noc_reg;
end struct;
```

Listing B.2: tb\_sc2ocp.vhd

```
library ieee;
  use ieee.std logic 1164.all;
  use ieee.NUMERIC STD. all;
  use work.defs.all;
  use work.sc pack.all;
  use work.test.all;
  use work.txt util.all;
  entity tb_sc2ocp is
9
  end tb sc2ocp;
  architecture RTL of tb_sc2ocp is
    constant CLOCK PERIOD : time := 10 ns;
    constant RESET_TIME : time := 21 ns;
    constant temp1 : std_logic_vector(31 downto 0) := DMA_P_MASK &
        std logic vector(to unsigned(0,OCP ADDR WIDTH-ADDR MASK W));
    constant DMA_P_ADDR : natural := to_integer(unsigned(temp1));
17
    constant temp2 : std_logic_vector(31 downto 0) := SPM_MASK &
        std_logic_vector(to_unsigned(0,OCP_ADDR_WIDTH-ADDR_MASK_W));
    constant SPM_ADDR : natural := to_integer(unsigned(temp2));
    constant temp3 : std logic vector(31 downto 0) := DMA MASK &
19
        std_logic_vector(to_unsigned(0,OCP_ADDR_WIDTH-ADDR_MASK_W));
    constant DMA_ADDR : natural := to_integer(unsigned(temp3));
    constant temp4 : std logic vector (31 \text{ downto } 0) := \text{ST MASK } \&
        std_logic_vector(to_unsigned(0,OCP_ADDR_WIDTH-ADDR_MASK_W));
    constant ST ADDR : natural := to integer(unsigned(temp4));
    signal clk : std_logic;
    signal reset : std_logic;
    signal sc noc out : sc out array type (0 \text{ to } (N*N)-1);
25
    signal sc_noc_in : sc_in_array_type(0 to (N*N)-1);
    alias sc in is sc noc in(0);
    alias sc_out is sc_noc_out(0);
    alias sc_in_2 is sc_noc_in(2);
    alias sc out 2 is sc noc out(2);
    alias sc in 4 is sc noc in(4);
    alias sc_out_4 is sc_noc_out(4);
37
39 begin
```

```
- Clock and reset
    clock generator : clockGen(clk,CLOCK PERIOD);
41
    reset_generator : resetGen(reset ,RESET_TIME);
43
    dut: entity work.sc2ocp noc
      port map(clk
                           \Rightarrow clk,
45
                         \Rightarrow reset,
              reset
              sc_noc_out => sc_noc_out ,
47
              sc_noc_in \implies sc_noc_in;
49
    stimuli_process : process
    variable result : natural;
    begin
      for i in 0 to N-1 loop
         for j in 0 to N-1 loop
           if reset = '1' then
             sc_noc_out(i*N+j).rd \leq '0';
             sc_noc_out(i*N+j).wr <= '0';
             sc_noc_out(i * N+j).address <= (others => '0');
             sc_noc_out(i*N+j).wr_data \ll (others \implies '0');
           end if;
        end loop;
61
      end loop;
      wait until reset = '0';
63
       wait for 11 ns;
         report "---
                        --- Testing DMA P -----";
65
         for i in 0 to 10 loop
           wait until rising_edge(clk);
67
           sc_write(clk,DMA_P_ADDR+i,i,sc_out,sc_in,5);
          wait for CLOCK_PERIOD;
69
        end loop;
         for i in 0 to 10 loop
           wait until rising_edge(clk);
73
           sc_read(clk,DMA_P_ADDR+i,result,sc_out,sc_in,6);
           assert result = i report "Wrong result read out!" severity
75
       failure;
           wait for CLOCK_PERIOD;
        end loop;
77
                        -- DMA test passed -----";
         report "
79
      wait until rising_edge(clk);
81
      sc_write(clk,DMA_P_ADDR,54,sc_out_4,sc_in_4,5);
      wait until rising_edge(clk);
83
      sc_write(clk,ST_ADDR+2,16,sc_out_4,sc_in_4,5);
85
      -- Write to spm
      wait until rising_edge(clk);
      sc_write(clk,SPM_ADDR+22,3,sc_out_4,sc_in_4,5);
87
      wait until rising_edge(clk);
      sc_write(clk,SPM_ADDR+22+1,6,sc_out_4,sc_in_4,5);
89
      wait until rising_edge(clk);
      sc_write(clk,SPM_ADDR+22+2,9,sc_out_4,sc_in_4,5);
91
      wait until rising_edge(clk);
      sc_write(clk,SPM_ADDR+22+3,12,sc_out_4,sc_in_4,5);
93
```

```
wait until rising_edge(clk);
       sc write (clk, SPM ADDR+22+4,15, sc out 4, sc in 4,5);
95
       wait until rising_edge(clk);
       sc_write(clk,SPM_ADDR+22+5,18,sc_out_4,sc_in_4,5);
97
       wait until rising_edge(clk);
       sc_write(clk,SPM_ADDR+22+6,21,sc_out_4,sc_in_4,5);
99
       wait until rising_edge(clk);
       sc_write(clk,SPM_ADDR+22+7,24,sc_out_4,sc_in_4,5);
       wait until rising_edge(clk);
       sc_write(clk,SPM_ADDR+22+8,27,sc_out_4,sc_in_4,5);
       wait until rising_edge(clk);
       sc_write(clk,SPM_ADDR+22+9,30,sc_out_4,sc_in_4,5);
       --- Setup dma
       wait until rising_edge(clk);
       sc_write(clk,DMA_ADDR+1,1441922,sc_out_4,sc_in_4,5);
       wait until rising_edge(clk);
       sc_write(clk,DMA_ADDR,32778,sc_out_4,sc_in_4,5);
       wait until rising_edge(clk);
       sc_write(clk,SPM_ADDR+224,4,sc_out,sc_in,5);
       wait until rising_edge(clk);
       sc_read(clk,SPM_ADDR+224,result,sc_out,sc_in,5);
       assert result = 4 report "Something is very wrong!" severity
           failure;
       wait until rising_edge(clk);
       wait for 300 ns;
         test_function(clk,sc_out,sc_in,DMA_P_ADDR,CLOCK_PERIOD);
         test_function(clk,sc_out,sc_in,SPM_ADDR,CLOCK_PERIOD);
         test_function(clk,sc_out,sc_in,DMA_ADDR,CLOCK_PERIOD);
         test_function(clk,sc_out,sc_in,ST_ADDR,CLOCK_PERIOD);
         report "----- Testing SPM -----";
         for i in 0 to 10 loop
           wait until rising_edge(clk);
           sc_write(clk,SPM_ADDR+i,i,sc_out,sc_in,5);
133
           wait for CLOCK_PERIOD;
         end loop;
         for i in 0 to 10 loop
           wait until rising_edge(clk);
           sc_read(clk,SPM_ADDR+i,result,sc_out,sc_in,6);
           assert result = i report "Wrong result read out!" severity
       failure;
           wait for CLOCK_PERIOD;
141
         end loop;
                        -- SPM test passed -----";
         report
143
       wait;
145
     end process;
```

end architecture RTL;

147

#### Listing B.3: test.vhd

```
library ieee;
  use ieee.std logic 1164.all;
  use work.sc_pack.all;
  package test is
    procedure clockGen (signal clk : out std_logic; constant period :
         in time);
    procedure resetGen (signal reset : out std_logic ; constant
        reset_time : in time);
    procedure test_function (signal clk : in std_logic; signal sc_out
         : out sc_out_type; signal sc_in : in sc_in_type; constant
        addr : in natural; constant period : in time);
10 end package test;
12 package body test is
    procedure clockGen (signal clk : out std_logic; constant period :
         in time) is
14
    variable clk_int : std_logic := '0';
    begin -- Careful this process runs forever:
      loop
        clk_int := not clk_int;
18
        clk \ll clk int;
        wait for period /2;
20
      end loop;
    end:
22
    procedure resetGen (signal reset : out std logic ; constant
        reset_time : in time) is
    begin -- Careful this process runs forever:
24
      reset \leq 1';
      wait for reset_time;
26
      reset \leq '0';
28
      wait:
    end;
30
    procedure test function (signal clk : in std logic;
                 signal sc out : out sc out type;
                 signal sc_in : in sc_in_type;
34
                 constant addr : in natural;
                 constant period : in time) is
36
    variable result : natural;
38
    begin
      report "----- Testing " & addr'simple_name & " ------";
40
      for i in 0 to 10 loop
        wait until rising_edge(clk);
        sc write(clk,addr+i,i,sc_out,sc_in,5);
42
        wait for period;
      end loop;
44
```

```
46 for i in 0 to 10 loop
	wait until rising_edge(clk);
48 sc_read(clk,addr+i,result,sc_out,sc_in,6);
	assert result = i report "Wrong result read out!" severity
	failure;
50 wait for period;
end loop;
52 report "_____ " & addr'simple_name & " test passed _____"
54 end package body test;
```

#### Listing B.4: noc\_node.vhd

```
library ieee;
1
  use ieee.std_logic_1164.all;
  use ieee.numeric_std.all;
  use work.defs.all;
  entity noc_node is
7
  port (
    p_clk
            : std_logic;
9
    n_clk
           : std_logic;
    reset
           : std_logic;
11
13
    proc_in
               : in ocp_master;
    proc_out : out ocp_slave;
    inNorth
               : in network_link;
    inSouth
              : in network link;
17
    inEast
              : in network_link;
    inWest
              : in network_link;
19
    outNorth
              : out network_link;
21
    outSouth : out network link;
23
    outEast
             : out network_link;
    outWest
              : out network_link
25
  );
27
  end noc node;
29
  architecture struct of noc node is
31
                                 -component declarations
  -2 spms
33
  component bram tdp is
35
  generic (
      DATA
               : integer := 32;
37
      ADDR
               : integer := 14
```

```
39);
  port (
41
    - Port A
      a clk
               : in std_logic;
43
      a_wr
               : in std_logic;
                   std_logic_vector(ADDR-1 downto 0);
      a addr
              : in
45
               : in std_logic_vector(DATA-1 downto 0);
      a din
              : out std_logic_vector(DATA-1 downto 0);
      a_dout
47
     Port B
49
      b clk
              : in
                     std_logic;
      b wr
               : in
                     std_logic;
                   std_logic_vector(ADDR-1 downto 0);
      b_addr : in
      b_din
               : in std_logic_vector(DATA-1 downto 0);
      b_dout : out std_logic_vector(DATA-1 downto 0)
55);
  end component;
57
   —1 na
59 component nAdapter is
61
  port (
    - General
63
    na clk
              : in std_logic;
    na_reset : in std_logic;
65
    - Processor Ports
67
   - DMA Configuration Port - OCP
    proc_in
              : in ocp_master;
69
    proc_out : out ocp_slave;
    - SPM Data Port - OCP?
             : in ocp_slave_spm;
    spm_in
73
    spm_out
               : out ocp_master_spm;
75
    - Network Ports
    - Incoming Port
77
    pkt_in
              : in
                   network_link;
79
    - Outgoing Port
    pkt_out : out network_link
81
83
  );
  end component;
85
   -1 router
87
  component router is
    port (
      clk : in std_logic;
89
      reset : in std_logic;
      inPort : in routerPort;
91
      outPort : out routerPort
    );
93
```

```
end component;
95
                                 97
   signal ip_to_net : network_link;
   signal net_to_ip
                     : network_link;
99
   signal spm_to_net : ocp_slave_spm;
   signal net to spm : ocp master spm;
   signal proc_spm_out_h : ocp_slave;
105 signal proc_spm_out_l : ocp_slave;
   signal proc_noc_in : ocp_master;
107 signal proc_noc_out : ocp_slave;
109 signal spm_h_access : std_logic;
   signal spm_l_access : std_logic;
111 signal dma access : std logic;
113 signal spm_h_wr : std_logic;
   signal spm_l_wr : std_logic;
   signal rd_rdy, next_rd_rdy : std_logic;
   type proc_sel is (spm_h_sel, spm_l_sel, dma_sel, none);
   signal proc_out_sel : proc_sel;
   signal cmd_acc : std_logic;
   begin
   --- High SPM instance
125
   spm_h : bram_tdp
   generic map (DATA=>DATA_WIDTH, ADDR => SPM_ADDR_WIDTH-1)
127
   port map (a_clk \Rightarrow p_clk),
    a_wr \implies spm_h_wr,
     a_addr \implies proc_in.MAddr(SPM_ADDR_WIDTH-1 downto 1),
     a_din => proc_in.MData,
     a_dout => proc_spm_out_h.SData,
     b_{clk} \implies n_{clk},
     b_wr \implies net_to_spm.MCmd(0),
     b_addr \implies net_to_spm.MAddr(SPM_ADDR_WIDTH-2 downto 0),
     b_din \Rightarrow net_to_spm.MData(63 downto 32),
     b_dout \Rightarrow spm_to_net.SData(63 downto 32));
|_{139}| spm_h_wr <= proc_in.MCmd(0) and spm_h_access;
   --- Low SPM instance
141
   spm_l : bram_tdp
   generic map (DATA \Rightarrow DATA_WIDTH, ADDR \Rightarrow SPM_ADDR_WIDTH-1)
143
   port map (a_clk => p_clk,
    a_wr \implies spm_l_wr,
145
     a_addr => proc_in.MAddr(SPM_ADDR_WIDTH-1 downto 1),
    a_din => proc_in.MData,
147
```

```
a\_dout \implies proc\_spm\_out\_l.SData,
      b clk \Rightarrow n clk,
149
     b_wr \implies net_to_spm.MCmd(0),
     b_addr \implies net_to_spm.MAddr(SPM_ADDR_WIDTH-2 downto 0),
      b din \Rightarrow net to spm.MData(31 downto 0),
     b\_dout \Rightarrow spm\_to\_net.SData(31 downto 0));
|155| spm_l_wr <= proc_in.MCmd(0) and spm_l_access;
    — NA instance
157
   na : nAdapter
   port map(
159
      -- General
     na_clk=>n_clk,
161
      na\_reset \Rightarrow reset ,
163
     -- Processor Ports
     --- DMA Configuration Port - OCP
165
      proc_in=>proc_noc_in,
167
     proc_out=>proc_out,
     -- SPM Data Port - OCP?
     spm_in=>spm_to_net ,
171
     spm_out=>net_to_spm ,
     --- Network Ports
     --- Incoming Port
     pkt_in=>net_to_ip ,
175
     -- Outgoing Port
      pkt_out=>ip_to_net
179);
      proc_noc_in.MData <= proc_in.MData;</pre>
181
      proc_noc_in.MAddr <= proc_in.MAddr;</pre>
      proc_noc_in.MCmd(1) \le proc_in.MCmd(1) and dma_access;
183
      proc_noc_in.MCmd(0) \le proc_in.MCmd(0) and dma_access;
185
     - router instance
187 r : router
   port map (
      clk \implies n_clk,
189
      reset \implies reset,
      inPort(0) \implies inSouth,
191
      inPort(1) \implies inWest,
      inPort(2) \implies inNorth,
193
      inPort(3) \implies inEast,
      inPort(4) \implies ip\_to\_net,
195
      outPort(0) \implies outSouth,
      outPort(1) \implies outWest,
197
      outPort(2) \implies outNorth,
      outPort(3) \implies outEast,
199
      outPort(4) => net_to_ip
201
   );
```

```
203 proc_logic: process(proc_in, proc_spm_out_h, proc_spm_out_l, rd_rdy
        , proc_noc_out, dma_access)
   begin
     spm_h_access <= '0';
205
     spm_l_access <= '0';
     dma\_access <= '0';
207
     next_rd_rdy \ll '0';
     cmd_acc <= '0':
209
     proc_out_sel <= none;</pre>
211
     if proc_in.MAddr(OCP_ADDR_WIDTH-1 downto SPM_ADDR_WIDTH) =
         SPM_MASK & std_logic_vector(to_unsigned(0,OCP_ADDR_WIDTH-
         ADDR_MASK_W-(SPM_ADDR_WIDTH))) then -- Access to the spm port
        if proc_in.MAddr(0) = '0' then -- Access high spm
         spm_h_access <= '1';
          proc_out_sel <= spm_h_sel;</pre>
        else — Access low spm
          spm_l_access <= '1';
          proc_out_sel <= spm_l_sel;</pre>
       end if:
         - Write operation
       if proc_in.MCmd(0) = '1' then
         cmd acc \leq '1';
       end if;
       --- Read operation
        if proc_in.MCmd(1) = '1' and proc_in.MCmd(0) = '0' then
          next_rd_rdy <= '1';</pre>
       end if;
227
     else -- Access to the dma configuration port
       dma_access <= '1';
       proc_out_sel <= dma_sel;</pre>
     end if;
231
   end process;
233
   process (proc_out_sel, proc_spm_out_h, proc_spm_out_l, proc_out,
        rd_rdy, cmd_acc)
     -process(all)
235
     begin
     proc\_out.SData <= (others => '0');
     proc_out.SResp <= '0';
     proc_out.SCmdAccept <= '0';</pre>
     -- Proc_out mux
     case proc_out_sel is
243
       when spm_h_sel \Rightarrow
          proc_out.SData <= proc_spm_out_h.SData;
245
          proc_out.SCmdAccept <= cmd_acc;</pre>
          proc_out.SResp <= rd_rdy;</pre>
       when spm_l\_sel \Rightarrow
          proc_out.SData <= proc_spm_out_l.SData;</pre>
          proc_out.SCmdAccept <= cmd_acc;</pre>
          proc_out.SResp <= rd_rdy;</pre>
       when dma_sel \Rightarrow
251
          proc_out <= proc_noc_out;</pre>
       when none =>
253
```

```
proc_out.SData \ll (others \implies '0');
          proc_out.SResp <= '0';</pre>
255
          proc out.SCmdAccept \leq '0';
     end case;
257
259 end process;
   process (p_clk)
261
   begin
      if rising_edge(p_clk) then
263
        if reset = '1' then
          rd_rdy <= '0';
265
        else
          rd_rdy <= next_rd_rdy;</pre>
267
        end if;
     end if;
269
   end process;
   end struct;
```

Listing B.5: nAdapter.vhd

```
Copyright Technical University of Denmark. All rights reserved.
     This file is part of the T-CREST project.
     Redistribution and use in source and binary forms, with or
      without
     modification, are permitted provided that the following
      conditions are met:
        1. Redistributions of source code must retain the above
      copyright notice,
           this list of conditions and the following disclaimer.
        2. Redistributions in binary form must reproduce the above
      copyright
           notice, this list of conditions and the following
      disclaimer in the
           documentation and/or other materials provided with the
      distribution.
14
     THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER 'AS IS' AND
      ANY EXPRESS
     OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
      IMPLIED WARRANTIES
     OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
      DISCLAIMED. IN
     NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
18
      FOR ANY
     DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
      CONSEQUENTIAL DAMAGES
     (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
20
      OR SERVICES;
```

```
– LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
       CAUSED AND
    - ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
      LIABILITY, OR TORT
     (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
      THE USE OF
     THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE
24
   - The views and conclusions contained in the software and
26
      documentation are
     those of the authors and should not be interpreted as
      representing official
     policies, either expressed or implied, of the copyright holder.
28
30
   - Network Adaptor (NI) for the TDM NoC with DMAs.
34
     Author: Evangelia Kasapaki
36
38 library ieee;
  use ieee.std_logic_1164.all;
  use ieee.numeric_std.all;
40
  use work.defs.all;
42
  entity nAdapter is
44
  port (
46
    - General
              : in
                     std_logic;
    na_clk
48
                    std_logic;
    na_reset
              : in
    - Processor Ports
   - DMA Configuration Port - OCP
             : in ocp_master;
    proc_in
    proc_out : out ocp_slave;
54
   – SPM Data Port – OCP?
              : in ocp_slave_spm;
    spm in
              : out ocp_master_spm;
    spm_out
58
    - Network Ports
   - Incoming Port
    pkt_in
              : in
                     network_link;
    - Outgoing Port
64
    pkt_out : out network_link
```

```
66
   ):
  end nAdapter;
68
   architecture rtl of nAdapter is
72
                                 - signal declarations
74
   signal slt_index : std_logic_vector(SLT_WIDTH-1 downto 0);
   signal sc_en
                  : std_logic;
76
   signal slt_en
                   : std_logic;
78
   signal slt_entry : std_logic_vector(DMA_IND_WIDTH downto 0);
  signal vld_slt
                     : std_logic;
80
  signal config
                 : std_logic_vector(3 downto 0);
82
   signal config_reg : std_logic_vector(4 downto 0);
84
   signal dma_index : std_logic_vector(DMA_IND_WIDTH-1 downto 0);
   signal dma_entry : std_logic_vector(DMA_WIDTH-1 downto 0);
86
   signal dma_entry_updated: std_logic_vector(DMA_WIDTH-1 downto 0);
88
                     : std_logic_vector(2 downto 0);
   signal dma_ren
90 signal dma_wen
                     : std_logic_vector(2 downto 0);
   signal dma_waddr
                     : std_logic_vector(DMA_IND_WIDTH-1 downto 0);
92 signal dma_wdata
                     : std_logic_vector(DMA_WIDTH-1 downto 0);
   signal dma raddr
                     : std_logic_vector(DMA_IND_WIDTH-1 downto 0);
94 signal dma_rdata
                    : std_logic_vector(DMA_WIDTH-1 downto 0);
                    : unsigned (BLK_CNT-1 downto 0);
  signal dma_cnt
96
   signal dma_cnt_new : unsigned (BLK_CNT-1 downto 0);
   signal dma_rp_new : unsigned (SPM_ADDR_WIDTH-1 downto 0);
98
   signal dma_wp_new : unsigned (SPM_ADDR_WIDTH-1 downto 0);
   signal dma_ctrl
                    : std_logic;
  signal dma_ctrl_new : std_logic_vector(1 downto 0);
102
   signal done : std_logic;
104 signal done_new
                   : std_logic;
   signal state_cnt : unsigned(1 downto 0);
106
   signal val
               : unsigned (1 downto 0);
108
   signal dIn_h
                   : std_logic_vector(DATA_WIDTH-1 downto 0);
                   : std_logic_vector(DATA_WIDTH-1 downto 0);
110 signal dOut_l
   signal address
                     : std_logic_vector(SPM_ADDR_WIDTH-1 downto 0);
   signal m_cmd
                   : std_logic;
114
   signal dOutreg_ld : std_logic;
   signal dInreg_ld : std_logic;
116
   signal adreg_ld
                     : std_logic;
118
   signal mux_out : std_logic_vector(DATA_WIDTH-1 downto 0);
```

```
120 signal hdr_phit
                      : std_logic_vector(DATA_WIDTH-1 downto 0);
                     : std_logic_vector(PHIT_WIDTH-1 downto 0);
   signal phitOut
   signal phitIn
                  : std_logic_vector(PHIT_WIDTH-1 downto 0);
124
   signal pkt_ctrl
                    : std_logic;
   signal dma_ctrl_reg : std_logic;
126
   signal ctrlOutreg_ld : std_logic;
128
                                    - Components declarations
130
   component counter
     generic (
       WIDTH : integer
     );
134
     port (
       clk
             : in std_logic ;
136
               : in std_logic ;
       reset
       enable : in std_logic;
138
             : out std_logic_vector(WIDTH-1 downto 0)
       cnt
     );
140
   end component;
142
   component dma_sdp
     generic (
144
       DATA
               : integer := 64;
       ADDR
               : integer := 2
146
     );
     port (
148
             : in std_logic;
       clk
       reset
               : in std_logic;
             : in std_logic_vector(2 downto 0);
       ren
             : in std_logic_vector(2 downto 0);
       wen
                 : in std_logic_vector(ADDR-1 downto 0);
154
       waddr
                : in std_logic_vector(DATA-1 downto 0);
       wdata
       raddr
                 : in std_logic_vector(ADDR-1 downto 0);
                : out std_logic_vector(DATA-1 downto 0)
       rdata
158
     );
   end component;
160
   component bram
     generic (
       DATA
               : integer := 32;
       ADDR
               : integer := 14
164
     );
166
     port (
             : in std_logic ;
168
       clk
       rd_addr : in std_logic_vector(ADDR-1 downto 0);
       wr_addr : in std_logic_vector(ADDR-1 downto 0);
       wr_data : in std_logic_vector(DATA-1 downto 0);
       wr_ena : in std_logic ;
       rd_data : out std_logic_vector(DATA-1 downto 0)
```

```
);
174
   end component;
178
   begin
180
     - component instantiations
     - Slot Counter
182
     slt_cnt : counter
        generic map ( WIDTH=>SLT_WIDTH )
184
        port map ( clk=>na_clk, reset=>na_reset, enable=>sc_en, cnt=>
            slt_index );
186
     -DMA Table – simple block ram
     dma_table : dma_sdp
188
        generic map ( DATA=>DMA_WIDTH, ADDR=>DMA_IND_WIDTH )
        port map (clk=>na_clk, reset=>na_reset,
190
          ren => dma_ren,
          wen \Rightarrow dma_wen,
192
          waddr \Rightarrow dma_waddr,
          wdata => dma wdata,
          raddr \implies dma raddr,
          rdata => dma_rdata
196
        );
198
     slt_en <= '1' when config=ST_ACCESS and proc_in.MCmd(0)='1'</pre>
200
          else '0';
      Slot Table
202
     slt_table : bram
        generic map ( DATA=>DMA_IND_WIDTH+1, ADDR=>SLT_WIDTH )
204
        port map (clk \Rightarrow na_clk,
          rd_addr \implies slt_index,
206
          wr_addr => proc_in.MAddr(SLT_WIDTH-1 downto 0),
          wr_data => proc_in.MData(DMA_IND_WIDTH downto 0),
208
          wr_ena \Rightarrow slt_en,
          rd_data \implies slt_entry
210
        );
212
     dma_index <= slt_entry(DMA_IND_WIDTH-1 downto 0);</pre>
     vld_slt <= slt_entry(DMA_IND_WIDTH);</pre>
214
216

    configuration interface

     - decode inputs
     - address map decoding
218
     ocp_decode : process (proc_in.MAddr) begin
        config \leq CNULL;
220
       -- ST configuration
        if proc_in.MAddr(OCP_ADDR_WIDTH-1 downto OCP_ADDR_WIDTH-
222
            ADDR_MASK_W)=ST_MASK then
          config <= ST\_ACCESS;
```

```
- DMA 3/route configuration
       elsif proc in.MAddr(OCP ADDR WIDTH-1 downto OCP ADDR WIDTH-
           ADDR MASK W)=DMA P MASK then
         config \ll DMA_R_ACCESS;
226
         - DMA 1,2 configuration
       elsif proc_in.MAddr(OCP_ADDR_WIDTH-1 downto OCP_ADDR_WIDTH-
228
           ADDR_MASK_W)=DMA_MASK
           and proc_in.MAddr(0) = '0' then
         config \leq DMA_H_ACCESS;
230
       elsif proc_in.MAddr(OCP_ADDR_WIDTH-1 downto OCP_ADDR_WIDTH-
           ADDR_MASK_W)=DMA_MASK
           and proc_in.MAddr(0) = '1' then
         config \ll DMA\_L\_ACCESS;
         - not configuration
       else
         config \leq CNULL;
       end if;
     end process;
238
240
     - build outputs -
     -- ocp data response
     ocp_response : process ( state_cnt, config_reg, dma_rdata) begin
       proc out.SData \leq (others = >'0');
244
       proc_out.SResp <= '0';
246
       case state_cnt is
       when "00" =>
248
          if config_reg=('1' & DMA_R_ACCESS) or config_reg=('1' &
             DMA_H_ACCESS) or config_reg = ('1' & DMA_L_ACCESS) then
            proc_out.SData <= dma_rdata(OCP_DATA_WIDTH-1 downto 0);
           proc_out.SResp <= '1';
         end if;
252
       when "01" =>
         if config_reg=('1' & DMA_R_ACCESS) or config_reg=('1' &
254
             DMA_H_ACCESS) or config_reg = ('1' & DMA_L_ACCESS) then
            proc_out.SData <= dma_rdata(OCP_DATA_WIDTH-1 downto 0);</pre>
           proc_out.SResp <= '1';</pre>
256
         end if;
       when others \Rightarrow
258
         proc_out.SData <= (others = >'0');
         proc_out.SResp <= '0';
260
       end case;
     end process;
262
264
    - SPM interface
266
       construct SPM interface signals --->ocp???
     spm_interface : process (state_cnt, pkt_ctrl, dma_entry, address)
268
           begin
       if state_cnt = "00" and pkt_ctrl = '1' then
         spm_out.MCmd <= "11";
270
```

```
spm_out.MAddr <= std_logic_vector(to_unsigned(0,</pre>
              OCP ADDR WIDTH-(SPM ADDR WIDTH-1))) & address(
              SPM ADDR WIDTH-1 downto 1);
        else
272
         spm out.MCmd \leq "00";
         spm_out.MAddr <= x"0000" \& '0' \& dma_entry(47 downto 33);
       end if:
     end process;
     spm_out.MData(SPM_DATA_WIDTH-1 downto DATA_WIDTH) <= dIn_h;
     spm_out.MData(DATA_WIDTH-1 downto 0) \le phitIn(DATA_WIDTH-1)
278
         downto 0;

    network interface

280
    – input pkt control-

    decode incoming packet

282
     pkt_ctrl <= phitIn(PHIT_WIDTH-1) or phitIn(PHIT_WIDTH-2) or</pre>
          phitIn(PHIT_WIDTH-3);
284

    output pkt construction -

286
     - build hdr phit
     hdr_phit <= dma_entry(DATA_WIDTH-1 downto 0);
288
    - mux to choose outgoing data
290
     nout_select : process(state_cnt, dma_ctrl, dma_ctrl_reg, hdr_phit
          , spm_in.SData(63 downto 32), dOut_l) begin
292
       case state_cnt is
       when "00" =>
          if dma_ctrl_reg='1'then
            --mux on 1 (data1)
            mux_out <= spm_in.SData(63 downto 32) after PDELAY;</pre>
296
          else
            mux_out <= (others=>'0') after PDELAY;
298
          end if;
       when "01" =>
300
          if dma_ctrl_reg='1'then
            --mux on 2 (data2)
302
            mux_out <= dOut_l after PDELAY;</pre>
          else
304
            mux_out <= (others=>'0') after PDELAY;
306
          end if;
       when "10" =>
          if dma_ctrl='1' then
308
            --mux on 0 (hdr)
310
            mux_out <= hdr_phit after PDELAY;</pre>
          else
            mux_out \ll (others = >'0') after PDELAY;
312
          end if;
       when others \Rightarrow
314
         mux\_out <= (others = >'0');
       end case;
316
     end process;
318
```

```
- build outgoing packet
     --- control bits
     phitOut(PHIT_WIDTH-1) <= state_cnt(1) and dma_ctrl; --hdr
     phitOut(PHIT_WIDTH-2) \le not (state_cnt(0) or state_cnt(1)) and
322
          dma ctrl reg; ---md
     phitOut(PHIT_WIDTH-3) <= state_cnt(0) and dma_ctrl_reg; --eop
     ---hdr or payload
324
     phitOut(PHIT_WIDTH-4 downto 0) <= mux_out;</pre>
326
    - DMA signals
328
     dma_state_control : process (state_cnt, config, proc_in, dma_ctrl
          , dma_index, dma_entry_updated, dma_rdata) begin
        dma_waddr <= (others \implies '0');
        dma_wdata <= (others \implies '0');
        dma_wen <= (others \Rightarrow '0');
332
        dma_raddr \ll (others \implies '0');
        dma_ren <= ( others \implies '0');
334
        proc_out.SCmdAccept <= '0';</pre>
        dma_entry \leq (others \Rightarrow '0');
336
338
        {\tt case state\_cnt is}
        when "00" =>
            - configuration write
340
          if proc_in.MCmd(0) = '1' then
             if config=DMA_R_ACCESS then
               dma_waddr <= proc_in.MAddr(DMA_IND_WIDTH-1 downto 0);
               dma_wdata <= x"00000000" & proc_in.MData;
344
              dma_wen \ll config(2 downto 0);
               proc_out.SCmdAccept <= '1';</pre>
346
             elsif config=DMA_H_ACCESS then
               dma_waddr <= proc_in.MAddr(DMA_IND_WIDTH downto 1);
348
               dma_wdata <= proc_in.MData(BANK0_W-1 downto 0) & x'
                   0000000000000000"
              dma_wen \ll config(2 downto 0);
350
               proc_out.SCmdAccept <= '1';</pre>
             elsif config=DMA_L_ACCESS then
352
               dma_waddr <= proc_in.MAddr(DMA_IND_WIDTH downto 1);
               dma_wdata <= x"0000" & proc_in.MData & x"0000";
354
              dma_wen <= config(2 \text{ downto } 0);
               proc_out.SCmdAccept <= '1';</pre>
356
             elsif config=ST_ACCESS then
               dma_waddr <= (others \Rightarrow '0');
358
               dma_wdata \leq (others \Rightarrow '0');
              dma_wen <= (others \Rightarrow '0');
360
               proc_out.SCmdAccept <= '1';</pre>
             else
362
               dma_waddr <= (others \implies '0');
               dma_wdata <= (others \Rightarrow '0');
364
              dma_wen <= (others \Rightarrow '0');
               proc_out.SCmdAccept <= '0';</pre>
366
            end if;
            -configuration read or no read
368
```

```
else
             if config=DMA R ACCESS then
370
               dma_raddr <= proc_in.MAddr(DMA_IND_WIDTH-1 downto 0);
               dma\_ren <= config(2 downto 0);
372
               ---build ocp slave signals
               proc_out.SCmdAccept <= '1';</pre>
374
             elsif config=DMA_H_ACCESS or config=DMA_L_ACCESS then
               dma_raddr <= proc_in.MAddr(DMA_IND_WIDTH downto 1);
               dma\_ren <= config(2 downto 0);
               ---build ocp read data
378
               proc_out.SCmdAccept <= '1';</pre>
             else
380
               dma_waddr \ll (others \implies '0');
               dma_wdata <= (others \Rightarrow '0');
382
               dma_wen <= (others \Rightarrow '0');
               ---build ocp read data
384
               proc_out.SCmdAccept <= '0';</pre>
            end if:
386
          end if;
388
        when "01" =>
          if proc_in.MCmd(0) = '1' then
390
             if config=DMA_R_ACCESS then
               dma_waddr <= proc_in.MAddr(DMA_IND_WIDTH-1 downto 0);
392
               dma_wdata <= x"00000000" & proc_in.MData;
               dma_wen \ll config(2 \ downto \ 0);
394
               proc_out.SCmdAccept <= '1';</pre>
             elsif config=DMA_H_ACCESS then
396
               dma_waddr <= proc_in.MAddr(DMA_IND_WIDTH downto 1);
               dma_wdata <= proc_in.MData(BANK0_W-1 downto 0) & x"
398
                   0000000000000000";
               dma_wen \ll config(2 \ downto \ 0);
               proc_out.SCmdAccept <= '1';</pre>
400
             elsif config=DMA_L_ACCESS then
               dma_waddr <= proc_in.MAddr(DMA_ND_WIDTH downto 1);
402
               dma_wdata <= x"0000" & proc_in.MData & x"0000";
               dma_wen <= config(2 \text{ downto } 0);
404
               proc_out.SCmdAccept <= '1';</pre>
             elsif config=ST_ACCESS then
406
               dma_waddr <= (others \Rightarrow '0');
               dma_wdata <= (others \Rightarrow '0');
408
               dma_wen <= (others \Rightarrow '0');
               proc_out.SCmdAccept <= '1';</pre>
410
             else
               dma_waddr \ll (others \implies '0');
412
               dma_wdata \ll (others \implies '0');
               dma_wen <= (others \Rightarrow '0');
414
               proc_out.SCmdAccept <= '0';</pre>
            end if;
416
          end if;
          dma_raddr <= dma_index;
418
          dma\_ren <= "111";
420
        when "10" =>
          dma_waddr <= dma_index;
422
```

```
dma_wdata <= dma_entry_updated;</pre>
           if dma ctrl='1' then
424
            dma wen \leq = "110";
           else
426
            dma wen \leq = "000";
          end if;
428
          if proc in MCmd(0) = '0' then
430
             if config=DMA_R_ACCESS then
               dma_raddr <= proc_in.MAddr(DMA_IND_WIDTH-1 downto 0);</pre>
432
               dma_ren <= config(2 \text{ downto } 0);
               proc_out.SCmdAccept <= '1';</pre>
434
             elsif config=DMA_H_ACCESS or config=DMA_L ACCESS then
               dma_raddr <= proc_in.MAddr(DMA_IND_WIDTH downto 1);
436
               dma_ren <= config(2 \text{ downto } 0);
               proc_out.SCmdAccept <= '1';</pre>
438
             else
               dma_raddr \ll (others \implies '0');
440
               dma ren \leq (others \Rightarrow '0');
               proc_out.SCmdAccept <= '0';</pre>
442
             end if;
          end if;
444
           if vld slt='1' then
446
             dma_entry <= dma_rdata;</pre>
           else
448
             dma_entry <= (others = >'0');
          end if;
450
        when others \Rightarrow
452
          dma_waddr <= (others \Rightarrow '0');
          dma_wdata <= (others \Rightarrow '0');
454
          dma_wen <= (others \implies '0');
          dma_raddr <= (others \Rightarrow '0');
456
          dma_ren <= (others \implies '0');
          proc_out.SCmdAccept <= '0';</pre>
458
        end case;
     end process;
460
462
     - DMA control 0 decode dma entry
     ---valid dma entry and transfer not done yet
464
     dma_ctrl \ll dma_entry(DMA_WIDTH-1) and (not dma_entry(DMA_WIDTH)
          -2));
     dma\_cnt \le unsigned(dma\_entry(61 downto 48));
466
468
      update dma entry fields
     dma\_cnt\_new \le dma\_cnt - 2;
     dma_rp_new <= unsigned(dma_entry(SPM_ADDR_WIDTH-1+32 downto 32))
          + 2;
     dma_wp_new <= unsigned (dma_entry (SPM_ADDR_WIDTH-1+16 downto 16))
          + 2;
472
     done <= '1' when dma_cnt_new=0
        else '0';
474
```

```
done_new <= dma_entry(DMA_WIDTH-1) and done;
     dma ctrl new \leq dma entry(DMA WIDTH-1) & done new;
476
     updated dma entry
478
     dma_entry_updated <= (dma_ctrl_new &
            std_logic_vector(dma_cnt_new) &
480
            "0000000" & std_logic_vector(dma_rp_new) &
            "0000000" & std_logic_vector(dma_wp_new) &
482
            dma_entry(15 downto 0)) when dma_ctrl='1' else
            dma entry;
484
486
    - control FSM - just counter
488
     val <= state_cnt + 1;
     process (na_reset , na_clk)
490
     begin
       if na_reset = '1' then
492
         state_cnt <= (others=>'0') after PDELAY;
       elsif rising_edge(na_clk) then
494
         if state_cnt="10" then
            state_cnt <= (others=>'0') after PDELAY;
496
         else
            state_cnt <= val after PDELAY;</pre>
498
         end if;
       end if;
     end process;
502
     reg_control : process(state_cnt)
     begin
504
     dOutreg\_ld <= \ '0 \ ';
     adreg_ld <= '0';
506
     dInreg_ld <= '0';
     ctrlOutreg_ld <= '0';
508
     sc_en <= ,0;;
     if state_cnt="00" then
        -ld dataOut_reg
       dOutreg_ld <= '1';
     elsif state_cnt="01" then
514
       ---ld addr_reg
       adreg_ld <= ,1 ';
     elsif state_cnt="10" then
       --load dataIn_reg
518
       dInreg_ld <= '1';
       ctrlOutreg_ld <='1';
       ---update slt_cnt
       sc_en <= '1';
     else
       dOutreg_ld <= '0';
       adreg_ld <= '0';
       dInreg_ld <= '0';
       ctrlOutreg_ld <= '0';</pre>
```

```
sc_en <= '0';
528
     end if;
530
     end process;
     - registers
     registers : process(na_clk, na_reset) begin
       if na_reset='1' then
         dma_ctrl_reg <= '0' after PDELAY;</pre>
538
         address <= (others=>'0') after PDELAY;
         dIn_h <= (others=>'0') after PDELAY;
540
         dOut_l <= (others=>'0') after PDELAY;
         phitIn \leq (others = > '0') after PDELAY;
         pkt_out <= (others=>'0') after PDELAY;
         config\_reg <= (others = >'0');
544
       elsif rising_edge(na_clk) then
546
          if ctrlOutreg_ld='1' then
            dma_ctrl_reg <= dma_ctrl after PDELAY;</pre>
548
         end if;
         if adreg_ld='1' then
            address <= phitIn (SPM_ADDR_WIDTH-1+16 downto 16) after
                PDELAY;
         end if;
         if dInreg_ld='1' then
            dIn_h <= phitIn (DATA_WIDTH-1 downto 0) after PDELAY;
         end if;
          if dOutreg_ld='1' then
            dOut_l <= spm_in.SData(DATA_WIDTH-1 downto 0) after PDELAY;
         end if;
558
         phitIn <= pkt_in after PDELAY;</pre>
         pkt_out <= phitOut after PDELAY;</pre>
         config\_reg \le proc\_in.MCmd(1) \& config;
       end if;
     end process;
564
566
568
570 end rtl;
```

92

### Appendix C

## **JOP** infrastructure

This appendix contains the following files:

- **jopcpu.vhd** this file comes from the JOP project, the file describes the core of the JOP processor. The modification we have made to this file was to pull out the SimpCon interface from the SPM, to connect it to the T-CREST NoC platform. The file starts on page 93.
- **jopmul\_512x32.vhd** this file comes from the JOP project, the file describes the top level component of a multicore JOP platform. The modifications we have made to this file was to instantiate the T-CREST NoC platform and connect it to the JOP cores. The file starts on page 101.

#### Listing C.1: jopcpu.vhd

This file is a part of JOP, the Java Optimized Processor Copyright (C) 2001-2008, Martin Schoeberl (martin@jopdesign.com ) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by

```
the Free Software Foundation, either version 3 of the License,
0
      or
       (at your option) any later version.
       This program is distributed in the hope that it will be useful,
       but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
                                                                  See the
      GNU General Public License for more details.
       You should have received a copy of the GNU General Public
17
       License
       along with this program. If not, see <a href="http://www.gnu.org/">http://www.gnu.org/</a>
       licenses />.
19
       jopcpu.vhd
       The JOP CPU
      2007 - 03 - 16
                    creation
      2007 - 04 - 13
                    Changed memory connection to records
      2008 - 02 - 20
                    memory - I/O muxing after the memory controller (
      mem_sc)
      2008 - 03 - 03
                    added scratchpad RAM
      2008 - 03 - 04
                    correct MUX selection
      2009 - 11 - 15
                    include extension code
       todo: clean up: substitute all signals by records
     comments from former extension.vhd
      2004 - 09 - 11
                    first version
      2005 - 04 - 05
                    Reserve negative addresses for wishbone interface
39
      2005 - 04 - 07
                    generate bsy from delayed wr or 'ed with mem_out.bsy
      2005 - 05 - 30
                    added wishbone interface
41
      2005\!-\!11\!-\!28
                    Substitute WB interface by the SimpCon IO interface
             All IO devices are now memory mapped
43
       2007-04-13 Changed memory connection to records
             New array instructions
45
      2007 - 12 - 22
                    Correction of data MUX bug for array read access
      2008 - 02 - 20
                    Removed memory - I/O muxing
47
                    move MMU decode from jopcpu/extension to decode
      2009 - 11 - 22
49
  library ieee;
  use ieee.std_logic_1164.all;
  use ieee.numeric_std.all;
  use work.jop_types.all;
  use work.sc_pack.all;
```

```
59 entity jopcpu is
61 generic (
    jpc_width : integer;
                            --- address bits of java bytecode pc =
        cache size
                              --- 2*block_bits is number of cache
     block_bits : integer;
63
        blocks
    spm_width : integer := 0 — size of scratchpad RAM (in number
        of address bits)
65);
67 port (
    clk : in std_logic;
    reset : in std_logic;
69
71
   - SimpCon memory interface
    sc_mem_out : out sc_out_type;
    sc_mem_in : in sc_in_type;
75
77
      SimpCon IO interface
79
    sc_io_out : out sc_out_type;
    sc_io_in : in sc_in_type;
81
83
    - SimpCon DMA interface
85
    sc_noc_out : out sc_out_type;
87
    sc_noc_in : in sc_in_type;
80
    - Interrupts from sc_sys
91
                 : in irq_bcf_type;
     irq_in
93
               : out irq_ack_type;
    irq_out
    exc_req
               : out exception_type;
95
97
    - TM exception
99
    exc_tm_rollback : in std_logic := '0'
   );
103 end jopcpu;
105 architecture rtl of jopcpu is
107
       Signals
109
```

```
signal stack_tos
                       : std_logic_vector(31 downto 0);
     signal stack nos
                       : std logic vector (31 downto 0);
     signal rd, wr
                       : std logic;
113
                       : std_logic_vector(MMU_WIDTH-1 downto 0);
     signal mmu_instr
     signal stack din
                        : std logic vector (31 downto 0);
    - extension/mem interface
     signal mem_in
                     : mem_in_type;
119
     signal mem out
                        : mem out type;
     signal sc_ctrl_mem_out : sc_out_type;
     signal sc_ctrl_mem_in : sc_in_type;
     signal sc_scratch_out : sc_out_type;
     signal sc_scratch_in : sc_in_type;
     signal next_mux_mem : std_logic_vector(1 downto 0);
     signal dly_mux_mem : std_logic_vector(1 downto 0);
129
                         : std_logic_vector(1 downto 0);
     signal mux_mem
     signal is_pipelined : std_logic;
                       : std_logic;
     signal mem_access
     signal noc_access : std_logic;
     signal io_access
                       : std_logic;
     signal bsy
                     : std_logic;
     signal bc_wr_addr : std_logic_vector(jpc_width-3 downto 0); ---
        address for jbc (in words!)
     signal bc_wr_data
                       : std_logic_vector(31 downto 0); --- write
        data for jbc
     signal bc_wr_ena
                       : std_logic;
    - SimpCon io interface
     signal sp_ov : std_logic;
145
          ********** signals from extension ***************
147
149
       signals for mulitiplier
     signal mul_dout
                          : std_logic_vector(31 downto 0);
     signal mul_wr
153
                         : std_logic;
                        : std_logic; --- generate a bsy with delayed
     signal wr_dly
        wr
     signal exr
                        : std_logic_vector(31 downto 0); ---
        extension data register
161 begin
```

```
exc req.rollback <= exc tm rollback;
163
165
        components of jop
      core: entity work.core
169
        generic map(jpc_width)
        port map (
           clk \implies clk,
           reset \implies reset,
           bsy \implies bsy,
           din => stack_din ,
           mem_in \implies mem_in,
           mmu_instr => mmu_instr,
           mul_wr \implies mul_wr,
           wr_dly \implies wr_dly,
           bc_wr_addr => bc_wr_addr,
           bc_wr_data \implies bc_wr_data,
181
           bc_wr_ena => bc_wr_ena,
           irq_in \implies irq_in,
183
           irq_out => irq_out ,
           sp_ov \implies sp_ov,
185
           aout => stack_tos,
           bout \implies stack_nos
187
        );
189
      exc_req.spov <= sp_ov;</pre>
191
      mem: entity work.mem_sc
        generic map (
193
           jpc\_width \implies jpc\_width,
           block_bits => block_bits
195
        )
        port map (
197
           clk \implies clk,
           reset \implies reset,
199
           ain => stack_tos,
           bin => stack_nos,
201
           np\_exc \implies exc\_req.np,
203
           ab_exc => exc_req.ab,
205
           mem_in \implies mem_in,
           mem_out \implies mem_out,
207
           bc_wr_addr => bc_wr_addr,
209
           bc_wr_data => bc_wr_data,
           bc\_wr\_ena \implies bc\_wr\_ena\,,
211
           sc_mem_out \implies sc_ctrl_mem_out,
213
           sc_mem_in \implies sc_ctrl_mem_in
215
        );
```

```
- Generate scratchpad memory when size is != 0.
       - Results in warnings when the size is 0.
221
     sc1: if spm_width /= 0 generate
       scm: entity work.sdpram
         generic map (
            width \Rightarrow 32,
           addr width \Rightarrow spm width
         )
         port map (
            wrclk \Rightarrow clk,
            data => sc_scratch_out.wr_data,
            wraddress => sc_scratch_out.address(spm_width-1 downto 0),
231
           wren => sc_scratch_out.wr,
            rdclk \implies clk,
            rdaddress \implies sc\_scratch\_out.address(spm\_width-1 downto 0),
           rden => sc_scratch_out.rd,
           dout => sc_scratch_in.rd_data
       );
     end generate;
     sc_scratch_in.rdy_cnt <= (others => '0');
         Select for the read mux
         TODO: this mux selection works ONLY for two cycle pipelining!
         25.3.2011: should now be ok - at least the bug with
         SPM, NoC IO, and TDMA arbiter disappeared
         TODO: should check more configurations
   process(clk, reset)
251
   begin
     if (reset = '1') then
253
       dly\_mux\_mem <= (others \implies '0');
       next_mux_mem \ll (others \implies '0');
       is_pipelined <= '0';</pre>
     elsif rising_edge(clk) then
257
       if sc_ctrl_mem_out.rd='1' or sc_ctrl_mem_out.wr='1' then
          if sc_ctrl_mem_out.rd='1' then
           - highest address bits decides between IO, memory, and on-
261
              chip memory
          -- save the mux selection on read or write
         next_mux_mem <= sc_ctrl_mem_out.address(SC_ADDR_SIZE-1 downto</pre>
263
              SC_ADDR_SIZE-2;
           - a read or write with rdy_cnt of 1 means pipelining
         if sc_ctrl_mem_in.rdy_cnt = "01" then
265
            is_pipelined <= '1';
         end if;
267
          - remeber for the next mux selection in case of pipelining
         dly_mux_mem <= next_mux_mem;
269
```

```
end if;
            - delayed mux selection for pipelined access
          if sc_ctrl_mem_in.rdy_cnt(1) = '0' then
            dly_mux_mem <= next_mux_mem;
273
          end if;
         - pipelining is over
        if sc_ctrl_mem_in.rdy_cnt = "00" then
          is_pipelined \leq '0';
        end if;
279
     end if;
   end process;
281
   process (next_mux_mem, dly_mux_mem, sc_ctrl_mem_out, sc_ctrl_mem_in,
283
        sc_mem_in, sc_io_in, sc_noc_in, is_pipelined, mux_mem)
   begin
285
     mem\_access <= '0';
     noc_access <= '0';
287
     io_access <= '0';
280
     -- for one cycle peripherals we need to set the mux from
         next\_mux\_mem
291
     mux_mem <= next_mux_mem;</pre>
     -- for pipelining we need to delay the mux selection
     if is_pipelined='1' then
       mux_mem <= dly_mux_mem;</pre>
     end if;
295
     -- read MUX
297
     case mux_mem is
       when "10" =>
299
          --sc_ctrl_mem_in <= sc_scratch_in;</pre>
          sc_ctrl_mem_in <= sc_noc_in;</pre>
301
       when "11" =>
          sc_ctrl_mem_in <= sc_io_in;</pre>
303
        when others \Rightarrow
          sc_ctrl_mem_in <= sc_mem_in;</pre>
305
     end case;
307
     -- select
     case sc_ctrl_mem_out.address(SC_ADDR_SIZE-1 downto SC_ADDR_SIZE
309
          -2) is
       when "10" =>
311
          noc\_access <= '1';
        when "11" =>
          io_access <= '1';
313
        when others \Rightarrow
          mem\_access \ <= \ '1\ ';
315
     end case;
317
   end process;
319
     sc_mem_out.address <= sc_ctrl_mem_out.address;</pre>
     sc_mem_out.wr_data <= sc_ctrl_mem_out.wr_data;</pre>
321
```

```
sc_mem_out.wr <= sc_ctrl_mem_out.wr and mem_access;</pre>
     sc mem out.rd <= sc ctrl mem out.rd and mem access;
323
     sc_mem_out.atomic <= sc_ctrl_mem_out.atomic;</pre>
     sc_mem_out.cache <= sc_ctrl_mem_out.cache;</pre>
325
     sc mem out.cinval <= sc ctrl mem out.cinval;</pre>
     sc_mem_out.tm_cache <= sc_ctrl_mem_out.tm_cache;</pre>
     ---sc scratch out.address <= sc ctrl mem out.address;
320
     --sc_scratch_out.wr_data <= sc_ctrl_mem_out.wr_data;
     --sc scratch out.wr <= sc ctrl mem out.wr and scratch access;
331
     --sc_scratch_out.rd <= sc_ctrl_mem_out.rd and scratch_access;</pre>
     --sc_scratch_out.atomic <= sc_ctrl_mem_out.atomic;</pre>
333
     --sc_scratch_out.cinval <= sc_ctrl_mem_out.cinval;</pre>
     --sc_scratch_out.cache <= sc_ctrl_mem_out.cache;</pre>
335
     sc_noc_out.address <= sc_ctrl_mem_out.address;</pre>
337
     sc_noc_out.wr_data <= sc_ctrl_mem_out.wr_data;</pre>
     sc\_noc\_out.wr <= sc\_ctrl\_mem\_out.wr and noc\_access;
     sc_noc_out.rd <= sc_ctrl_mem_out.rd and noc_access;</pre>
341
     sc_noc_out.atomic <= sc_ctrl_mem_out.atomic;</pre>
     sc_noc_out.cinval <= sc_ctrl_mem_out.cinval;</pre>
     sc_noc_out.cache <= sc_ctrl_mem_out.cache;</pre>
343
345
     sc_io_out.address <= sc_ctrl_mem_out.address;</pre>
     sc_io_out.wr_data <= sc_ctrl_mem_out.wr_data;</pre>
     sc_io_out.wr <= sc_ctrl_mem_out.wr and io_access;</pre>
347
     sc\_io\_out.rd \ <= \ sc\_ctrl\_mem\_out.rd \ and \ io\_access;
     sc_io_out.atomic <= sc_ctrl_mem_out.atomic;</pre>
349
     sc_io_out.cinval <= sc_ctrl_mem_out.cinval;</pre>
     sc_io_out.cache <= sc_ctrl_mem_out.cache;</pre>
351
         353
     ml : entity work.mul
355
          port map (
            clk \implies clk,
357
            ain => stack_tos,
            bin => stack_nos ,
359
            wr \implies mul_wr,
            dout \implies mul_dout
361
       );
363
     stack_din <= exr;</pre>
365
       TODO: the following code is degenerated to decode functions
367
       should probably go to decode.vhd
369
371
        read
373
       TODO: the read MUX could be set by using the
       according wr/mmu_instr from JOP and not the
375
       following rd/mmu_instr
```

```
Than no intermixing of mul/mem and io operations
377
       is allowed. But we are not using interleaved mul/mem/io
       operations in jvm.asm anyway.
       TAKE CARE when mem out.bcstart is read!
381
        ** bcstart is also read without a mem_bc_rd JOP wr !!! ***
383
         \Rightarrow a combinatorial mux select on rd and ext adr==7!
385
         The rest could be set with JOP wr start transaction
         Is this also true for io data?
387
       29.11.2005 evening: I think this solution driving the exr
389
       mux from mmu_instr is quite ok. The pipelining from rd/ext_adr
       to A is fixed.
391
393 process (clk, reset)
   begin
     if (reset = '1') then
395
       exr \ll (others \implies '0');
     elsif rising_edge(clk) then
397
       if (mmu instr=LDMRD) then
390
         exr \ll mem out.dout;
       elsif (mmu_instr=LDMUL) then
401
         exr <= mul_dout;
         - elsif (mmu_instr=LDBCSTART) then
403
       else
         exr <= mem_out.bcstart;
405
       end if;
407
     end if;
   end process;
409
411
      --- a JOP wr generates the first bsy cycle
     -- the following are generated by the memory
413
     -- system or the SimpCon device
     bsy <= wr_dly or mem_out.bsy;
415
417 end rtl;
```

#### Listing C.2: jopmul\_512x32.vhd

This file is a part of JOP, the Java Optimized Processor
Copyright (C) 2001-2008, Martin Schoeberl (martin@jopdesign.com)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by

```
the Free Software Foundation, either version 3 of the License,
9
      or
      (at your option) any later version.
      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.
      You should have received a copy of the GNU General Public
17
      License
      along with this program. If not, see <http://www.gnu.org/
      licenses />.
19
      jopmul 512x32.vhd
      top level for a 512x32 SSRAM board (e.g. Altera DE2-70 board)
                  adapted from jopcyc.vhd
      2006 - 08 - 06
      2007 - 06 - 04
                  Use jopcpu and change component interface to
      records
      2010 - 06 - 25
                  Working version with SSRAM
33 library ieee;
  use ieee.std_logic_1164.all;
  use ieee.numeric_std.all;
  use work.jop_types.all;
  use work.sc_pack.all;
  use work.sc_arbiter_pack.all;
39
  use work.jop_config.all;
  use work.defs.all;
41
43
  entity jop is
45
  generic (
              : integer := 3; -- clock cycles for external ram
    ram_cnt
47
              : integer := 3; -- clock cycles for external rom OK
      rom_cnt
       for 20 MHz
              : integer := 15; -- clock cycles for external rom for
49
    rom_cnt
        100 MHz
    jpc\_width : integer := 12; --- address bits of java bytecode pc =
         cache size
               : integer := 5; -- 2*block_bits is number of cache
    block_bits
        blocks
    spm_width : integer := 7; -- size of scratchpad RAM (in number
        of address bits for 32-bit words)
    cpu_cnt : integer := 9
                              — number of cpus
  );
```

```
55
   port (
               : in std_logic;
     clk
       serial interface
59
     ser_txd
                 : out std_logic;
61
                 : in std_logic;
     ser rxd
     oUART_CTS
                 : in std_logic;
63
     iUART RTS
                 : out std_logic;
65
      watchdog
67
              : out std_logic;
     wd
69
      LEDs
73
             : out std_logic_vector(17 downto 0);
     oLEDR
       oLEDG : out std_logic_vector(7 downto 0);
       Switches
81
     i SW
              : in std_logic_vector(17 downto 0);
83
      only one ram bank
85
               : out std_logic_vector(18 downto 0); -- edit
    oSRAM_A
87
               : inout std_logic_vector(31 downto 0); -- edit
    SRAM DQ
    oSRAM_CE1_N : out std_logic;
89
                 : out std_logic;
: out std_logic_vector(3 downto 0);
    oSRAM_OE_N
    o\!SRAM\_BE\_N
91
    oSRAM_WE_N
                 : out std_logic;
                 : out std_logic;
    oSRAM_GW_N
93
    oSRAM_CLK : out std_logic;
    oSRAM_ADSC_N : out std_logic;
95
    oSRAM_ADSP_N : out std_logic;
    oSRAM_ADV_N : out std_logic;
97
     oSRAM_CE2 : out std_logic;
    oSRAM_CE3_N : out std_logic
99
   );
101 end jop;
103
  architecture rtl of jop is
       components:
107
109 component pll is
```

```
generic (multiply_by : natural; divide_by : natural);
   port (
111
     inclk0
               : in std_logic;
             : out std_logic;
     c0
113
             : out std logic;
     c1
     locked
               : out std_logic
   ):
  end component;
117
119
       Signals
     signal clk_int
                         : std_logic;
123
     signal clk_int_inv
                           : std_logic;
     signal pll_lock
                         : std_logic;
     signal int res
                         : std logic;
     signal res cnt
                          : unsigned (2 \text{ downto } 0) := "000"; -- for the
         simulation
     attribute altera_attribute : string;
     attribute altera attribute of res cnt : signal is "POWER UP LEVEL
        =LOW";
      jopcpu connections
     signal sc_arb_out : arb_out_type(0 to cpu_cnt-1);
     signal sc_arb_in
                        : arb_in_type(0 \ to \ cpu_cnt-1);
     signal sc mem out
                        : sc_out_type;
     signal sc_mem_in
                        : sc_in_type;
     signal sc_local_mem_out : sc_out_array_type(0 to cpu_cnt-1);
141
     signal sc_local_mem_in
                               : sc_in_array_type(0 to cpu_cnt-1);
143
     signal sc_io_out
                         : sc_out_array_type(0 to cpu_cnt-1);
     signal sc_io_in
                         : sc_in_array_type(0 to cpu_cnt-1);
145
     signal sc_noc_out
                        : sc_out_array_type(0 to cpu_cnt-1);
     signal sc_noc_in
                         : sc_in_array_type(0 to cpu_cnt-1);
147
     signal irq_in
                       : irq_in_array_type(0 to cpu_cnt-1);
149
     signal irq_out
                         : irq_out_array_type(0 to cpu_cnt-1);
     signal exc_req
                          : exception_array_type(0 to cpu_cnt-1);
      IO interface
     signal ser_in
                        : ser_in_type;
     signal ser_out
                         : ser_out_type;
     type wd_out_array is array (0 to cpu_cnt-1) of std_logic;
     signal wd_out
                        : wd_out_array;
159
    -- for generation of internal reset
161
    - memory interface
```

```
signal ram addr
                      : std logic vector(18 downto 0);
  signal ram dout
                      : std logic vector(31 downto 0);
  signal ram_din
                      : std_logic_vector(31 downto 0);
  signal ram dout en
                      : std logic;
  signal ram_clk
                      : std_logic;
  signal ram_nsc
                      : std_logic;
  signal ram_ncs
                      : std_logic;
  signal ram_noe
                      : std_logic;
  signal ram nwe
                      : std logic;
 - cmpsync
  signal sync_in_array : sync_in_array_type(0 to cpu_cnt-1);
  signal sync_out_array : sync_out_array_type(0 to cpu_cnt-1);
 - not available at this board:
  signal ser_ncts
                    : std_logic;
  signal ser_nrts
                      : std_logic;
 - remove the comment for RAM access counting
 — signal ram_count : std_logic;
 - NoC signals
 signal procM : procMasters;
  signal procS : procSlaves;
begin
  ser_ncts <= '0';
    intern reset
    no extern reset, epm7064 has too less pins
 – should also use PLL lock signal
process(clk_int)
begin
  if rising_edge(clk_int) then
    if (res_cnt/="111") then
      res\_cnt <= res\_cnt+1;
    end if;
    int_res <= not res_cnt(0) or not res_cnt(1) or not res_cnt(2);
 end if;
end process;
```

components of jop

163

165

167

177

179

181

183

185

187

189

195 197

199

201

203

205

207

209 211

```
inclk0
                \Rightarrow clk,
             \Rightarrow clk int,
        c0
219
        c1 => clk_int_inv,
        locked => pll_lock
     );
     - clk_int \ll clk;
     - process(wd out)
      variable wd_help : std_logic;
        begin
227
          wd_help := '0':
          for i in 0 to cpu_cnt-1 loop
            wd_help := wd_help or wd_out(i);
          end loop;
231
          wd <= wd_help;
    - end process;
     \mathrm{wd} \, <= \, \mathrm{wd}\_\mathrm{out}\,(\,0\,)\;;
     gen_cpu: for i in 0 to cpu_cnt-1 generate
        cpu: entity work.jopcpu
          generic map(
            jpc_width => jpc_width,
             block_bits => block_bits ,
241
            spm_width \implies spm_width
          )
243
          port map(clk_int, int_res,
            sc_arb_out(i), sc_arb_in(i),
245
            sc_io_out(i), sc_io_in(i),
            sc_noc_out(i), sc_noc_in(i),
             irq_in(i), irq_out(i), exc_req(i));
     end generate;
249
     sc_noc : entity work.sc2ocp_noc
251
        port map(
          clk => clk_int,
253
          reset => int_res,
          sc_noc_out => sc_noc_out ,
255
          sc_noc_in => sc_noc_in
        );
257
259
      arbiter: entity work.arbiter
        generic map(
261
          addr_bits \implies SC_ADDR_SIZE,
          cpu\_cnt \implies cpu\_cnt,
263
          write_gap \implies 2,
          read_gap \implies 2,
265
          slot_length \implies 3
        )
267
        port map(clk_int, int_res,
          sc_arb_out, sc_arb_in,
269
          sc_mem_out, sc_mem_in);
271
       - io for processor 0
```

```
io: entity work.scio generic map (
273
           cpu id \Rightarrow 0,
           cpu_cnt => cpu_cnt
        )
        port map (clk_int, int_res,
277
           sc_io_out(0), sc_io_in(0),
           irq_in(0), irq_out(0), exc_req(0),
           sync_out \implies sync_out_array(0),
281
           sync_in \implies sync_in_array(0),
283
           txd \implies ser_txd,
           rxd \implies ser rxd,
285
           ncts \Rightarrow oUART_CTS,
           nrts => iUART_RTS,
287
           oLEDR \implies oLEDR,
289
             oLEDG \implies oLEDG,
           iSW \implies iSW,
291
           wd \Rightarrow wd_out(0),
           l \implies \texttt{open} \;,
           r \implies open \;,
           t \implies open,
           b \implies open
297
           -- remove the comment for RAM access counting
           -- ram_cnt => ram_count
299
        );
301
      -- io for processors with only sc_sys
      gen_io: for i in 1 to cpu_cnt-1 generate
303
        io2: entity work.sc_sys generic map (
           addr_bits \implies 4,
305
           clk_freq => clk_freq,
           \operatorname{cpu\_id} \implies \operatorname{i},
307
           cpu\_cnt \implies cpu\_cnt
        )
309
        port map(
           clk => clk_int,
311
           reset => int_res,
           address => sc_io_out(i).address(3 downto 0),
313
           wr_data => sc_io_out(i).wr_data,
           rd \implies sc\_io\_out(i).rd,
315
           wr \implies sc\_io\_out(i).wr,
           rd_data \implies sc_io_in(i).rd_data,
317
           rdy_cnt => sc_io_in(i).rdy_cnt,
319
           irq_in \implies irq_in(i),
321
           irq_out => irq_out(i),
           exc\_req \implies exc\_req(i),
323
           sync_out => sync_out_array(i),
           sync_in => sync_in_array(i),
325
           wd \implies wd_out(i)
           -- remove the comment for RAM access counting
327
```

```
- ram_count => ram_count
        );
329
      end generate;
331
     scm: entity work.sc_mem_if
        generic map (
333
          \operatorname{ram_ws} \implies \operatorname{ram_cnt} -1,
           addr bits \Rightarrow 19
        )
        port map (clk_int, int_res,
337
           clk\_int\_inv ,
          sc\_mem\_out, sc\_mem\_in,
339
          ram_addr \implies ram_addr,
341
          ram_dout \implies ram_dout,
          ram_din => ram_din,
343
          ram_dout_en => ram_dout_en,
          ram_clk => ram_clk ,
345
          ram_nsc \implies ram_nsc,
347
          ram_ncs \implies ram_ncs,
          ram_noe \implies ram_noe,
          ram_nwe \implies ram_nwe
349
        );
351
     --- syncronization of processors
353
      sync: entity work.cmpsync generic map (
        cpu_cnt => cpu_cnt)
355
        port map
357
        (
           clk \implies clk\_int,
           reset => int_res,
359
          sync_in_array \implies sync_in_array,
          sync_out_array => sync_out_array
361
        );
363
      process(ram_dout_en, ram_dout)
365
      begin
        if ram_dout_en='1' then
367
          SRAM_DQ \le ram_dout;
        else
369
          SRAM_DQ \ll (others \implies 'Z');
371
        end if;
      end process;
373
     ram_din \ll SRAM_DQ;
375
       - remove the comment for RAM access counting
     -- ram_count <= ram_ncs;</pre>
377
379
        To put this RAM address in an output register
        we have to make an assignment (F\!AST\_OUTPUT\_REGISTER)
381
```

```
383
     oSRAM_A \ll ram_addr;
     oSRAM\_CE1\_N <= ram\_ncs;
     oSRAM\_OE\_N <= ram\_noe;
385
     \mathrm{oSRAM\_WE\_N} <= \mathrm{ram\_nwe};
     oSRAM\_BE_N <= (others \Rightarrow '0');
387
     oSRAM_GW_N \le 1';
     oSRAM_CLK <= ram_clk;
389
     oSRAM\_ADSC\_N <= ram\_nsc;
391
     oSRAM_ADSP_N <= '1';
     oSRAM_ADV_N <= '1';
393
395
     oSRAM_CE2 <= not ram_ncs;
        oSRAM\_CE3\_N \le ram\_ncs;
397
   end rtl;
```

# Appendix D

## **TDM** scheduler source code

This appendix contains the following files:

- **IOutput.h** is the abstract class that describes the interface of an Output class, the file starts on page 112.
- xmlOutput.h is the header file of the Output class that writes the xml output, the file starts on page 112.
- **xmlOutput.cpp** is the Output class that writes the xml output, the file starts on page 113.
- vhdlOutput.h is the header file of the Output class that writes the vhdl output, the file starts on page 117.
- vhdlOutput.cpp is the Output class that writes the vhdl output, the file starts on page 118.
- ScheduleConverter.java is the main class of the schedule converter that converts the xml file to a Java array, the file starts on page 125.
- SchedulePrinter.java is the SchedulePrinter class that formats and prints the Java file, the file starts on page 128.
- **TileCoord.java** is a class for saving the coordinate of a tile, the file starts on page 131.

```
Listing D.1: IOutput.h
```

```
*
     File:
              IOutput.h
   *
2
   *
     Author: Rasmus Bo Soerensen
4
   *
     Created on 6. august 2012, 11:07
   *
   */
  #ifndef IOUTPUT_H
8
  #define IOUTPUT H
  #include "schedule.hpp"
12
  class IOutput
  {
  public:
       virtual bool output_schedule(const network_t& n) =0;
  };
18
  #endif
          /* IOUTPUT_H */
```

## Listing D.2: xmlOutput.h

```
File:
              xmlOutput.h
2
   *
   *
     Author: T410s
   *
4
     Created on 6. august 2012, 11:13
   *
6
   */
8 #ifndef XMLOUTPUT_H
  #define XMLOUTPUT H
  #include <iostream>
12 #include <fstream>
  #include <string>
14 #include <math.h>
  #include <stdlib.h>
16 #include <unordered_map>
  #include <cstdio>
18 #include "IOutput.h"
  #include "lex cast.h"
20 #include "pugixml.hpp"
22
  class xmlOutput: public IOutput {
  private:
24
      string output_dir;
26
      char p2c(port id p);
    void print_coord(const pair<int, int> r, char* co, const size_t
28
        buffer_size);
30 public:
```

#### Listing D.3: xmlOutput.cpp

```
* File:
              xmlOutput.cpp
   * Author: Rasmus
   * Created on 6. august 2012, 11:13
   */
  #include <string.h>
ç
  #include "xmlOutput.h"
11
13 bool xmlOutput::output_schedule(const network_t& n)
  ł
    int numOfNodes = n.routers().size();
    int countWidth = ceil(log2(n.best));
    xml_document doc;
    xml_node schedule = doc.append_child("schedule");
19
    schedule.append_attribute("length").set_value(n.best);
21
    for (vector < router t * > :: const iterator r = n.routers().begin(); r
        != n.routers().end(); r++){ // For each router, write Network
          Adapter Table and Router Table
      // New xml tile
23
      xml_node tile = schedule.append_child("tile");
      char co [10];
25
      print_coord ((*r)->address, co, sizeof(co));
       tile.append_attribute("id") = co;
27
      // Vector for saving data to calculate Worst-Case Latencies
       vector<router_id> destinations (n. best, (*r)->address);
29
       for (timeslot t = 0; t < n.best; t++) // Write table row for
          each timeslot
         // New timeslot
        xml node ts = tile.append child("timeslot");
         ts.append_attribute("value") = t;
33
         int t0 = t-1;
         int t1 = t;
35
         if(t = 0){
37
           t0 = n.best - 1;
           t1 = n.best;
        }
39
         // Write row in Network Adapter table
        router_id dest_id = (*r)->address;
41
```

```
router_id src_id = (*r)->address;
        if ((*r) \rightarrow local in best schedule.has((t+2)\%n.best))
43
          dest_id = (*r)->local_in_best_schedule.get((t+2)%n.best)->
               to;
         }
45
         if ((*r)->local_out_best_schedule.has(t1))
          src_id = (*r)->local_out_best_schedule.get(t1)->from;
         destinations [t] = dest_id;
        // New na
        xml node na = ts.append child("na");
        print_coord(src_id, co, sizeof(co));
        na.append_attribute("rx") = co;
        print_coord(dest_id, co, sizeof(co));
        na.append_attribute("tx") = co;
        // Write row in Router table
        port_id_ports [5] = {___NUM_PORTS, ___NUM_PORTS, ___NUM_PORTS,
              _NUM_PORTS, ___NUM_PORTS};
         // New router
        xml_node router = ts.append_child("router");
        for (int out_p = 0; out_p < \_NUM_PORTS-1; out_p++){
           // For all 4 output ports not being the local port.
           if (!(*r)->out((port_id)out_p).has_link()){
             continue; // No outgoing link from the port.
          1
           if (!(*r)->out((port_id)out_p).link().best_schedule.has(t)){
             ports[(port_id)out_p] = __NUM_PORTS; // No outgoing
                 channel on link
             continue;
           1
             If there is a channel comming out of the port, find the
           // input port from which the channel is comming from.
          const channel* out_c =(*r)->out((port_id)out_p).link().
               best_schedule.get(t);
           for (int in_p = 0; in_p < __NUM_PORTS-1; in_p++){
             // For all 4 input ports not being the local port.
             if (!(*r)->in ((port_id)in_p).has_link())
               continue; // No link into this port
             if((*r) \rightarrow in((port_id)in_p).link().best_schedule.has(t0)){
               const channel* in_c =(*r)->in((port_id)in_p).link().
                   best_schedule.get(t0);
               if(out_c = in_c)
                 // The correct link found
81
                 ports [(port_id)out_p] = (port_id)in_p;
                 break;
83
               }
             }
85
           if (ports [(port_id)out_p] != __NUM_PORTS) {
87
             continue; // Channel was found on one of the input ports.
89
           // It should be on the local in port, but we test it anyway
```

```
if ((*r)->local_in_best_schedule.has(t)) {
91
              const channel* in c = (*r)->local in best schedule.get(t)
              if(out_c == in_c)
93
                // The correct link found
                ports[(port_id)out_p] = L;
95
              } else {
                cout << "Failure: Channel rose from nothing like a
97
                    fenix." << endl;</pre>
              }
           }
90
         }
         if ((*r)->local_out_best_schedule.has(t1)) { // For the local
             out port.
           const channel* out_c = (*r)->local_out_best_schedule.get(t1
                );
           for (int in_p = 0; in_p < \underline{NUM_PORTS-1}; in_p++){
              // For all 4 input ports not being the local port.
              if (!(*r) \rightarrow in((port_id)in_p).has_link())
                continue; // No link into this port
              if((*r) \rightarrow in((port_id)in_p).link().best_schedule.has(t0)){
                const channel* in_c =(*r)->in((port_id)in_p).link().
                    best_schedule.get(t0);
                if(out_c = in_c)
                  // The correct link found
                  ports[L] = (port_id)in_p;
                  break;
                }
             }
           }
            if(ports[L] = \_NUM_PORTS){
              // If channel was not found on any of the 4 input ports.
              // It should be on the local in port, but we test it
                 anyway
              cout << "Failure: Not allowed to route back in to local."
                   \ll endl;
           }
121
            // and so on . . .
         for (int p = 0; p < \_NUM\_PORTS; p++){
           // New output
           xml_node output = router.append_child("output");
           sprintf(co, "\%c", p2c((port_id)p)); // Should be snprintf,
                avoiding buffer overflow
              sprintf(co, sizeof(co), "%c", p2c((port_id)p));
           output.append_attribute("id") = co;
           sprintf(co, "%c", p2c(ports[(port_id)p])); // Should be
                snprintf, avoiding buffer overflow
              sprintf(co,sizeof(co),"%c",p2c(ports[(port_id)p]));
           output.append_attribute("input") = co;
         }
       }
       xml_node latency = tile.append_child("latency");
```

```
// The following for loop is slow and unnecessary, can be
           changed to improve runtime
       for_each(n.channels(), [&](const channel & c) {
         if (c.from != (*r) \rightarrow address)
139
           return; // Channel not from router
         }
141
         // For each channel from router
         int WCL = 0;
143
         int late = 0;
         int inlate = 0:
145
         bool init = true;
         for (int i = 0; i < n.best; i++){
            if(c.to != destinations[i]){
              // Increment latency
              late++;
              continue;
           }
            // Correct destination
            if(init){
              init = false;
              inlate = late;
            if(late > WCL){
             WCL = late;
           late = 0;
161
         }
         late += inlate;
         if(late > WCL){
           WCL = late;
165
         }
         // Analyze the latency
167
         xml_node destination = latency.append_child("destination");
         print_coord(c.to,co,sizeof(co));
         destination.append_attribute("id") = co;
         destination.append_attribute("WCL") = WCL;
       });
     }
     char co [500];
     sprintf(co, "%soutput.xml", output_dir.c_str()); // Should be
         snprintf, avoiding buffer overflow
     //sprintf(co,sizeof(co),"%soutput.xml",output_dir.c_str());
     doc.save_file(co);
     delete this;
     return true;
181
   }
   void xmlOutput::print_coord(const pair<int, int> r, char* co, const
183
       size_t buffe_size){
     sprintf(co,"(%i,%i)",r.first,r.second); // Should be snprintf,
         avoiding buffer overflow
       sprintf(co, buffer_size, "(%i,%i)", c.to.first, c.to.second);
185
187
```

```
char xmlOutput::p2c(port_id p){
       char c;
189
       if (p == N) c = 'N';
       if (p = E) c = 'E';
       if (p = S) c = 'S';
       if (p == W) c = W';
193
       if (p = L) c = 'L';
       if (p = \_NUM_PORTS) c = 'D';
195
       return c;
197
     }
199
   xmlOutput::xmlOutput(string _output_dir) : output_dir(_output_dir){
  }
201
203 xmlOutput::~xmlOutput() {
205 }
```

## Listing D.4: vhdlOutput.h

```
/*
1
   * File:
              vhdlOutput.h
3
   * Author: T410s
   * Created on 6. august 2012, 11:13
   */
7
  #ifndef VHDLOUTPUT_H
9 #define VHDLOUTPUT_H
11 #include <iostream>
  #include <fstream>
13 #include <string>
  #include <math.h>
15 #include <unordered map>
  #include "IOutput.h"
17 #include "lex cast.h"
19
  class vhdlOutput: public IOutput {
  private:
21
      enum port {North, East, South, West, Local, DC};
23
         class STslot{
         public:
25
                 port ports [5];
                 int x_dest;
27
                 int y_dest;
                 int x_src;
29
                 int y src;
                 STslot(){
31
                          for (int i = 0; i < 5; i++){
                                   ports[i] = DC;
33
```

```
x dest = 0:
35
                          y dest = 0;
                          x \operatorname{src} = 0;
                          y\_src = 0;
                 }
39
         };
       ofstream niST;
43
       ofstream routerST;
       string numOfNodesStr;
45
       string bin(int val, int bits);
47
      char p2c(port_id p);
       void startST(int num, ofstream* ST);
       void writeHeaderRouter(int countWidth);
       void endArchRouter();
       void writeSlotRouter(int slotNum, int countWidth, port_id*
           ports);
       void writeHeaderNI(int countWidth, int numOfNodes);
       void writeSlotNIDest(int slotNum, int countWidth, int dest);
       void writeSlotNISrc(int src);
       void startniST(int num);
       void startrouterST(int num);
       void endniST(int num);
       void endrouterST(int num);
61
       void endArchNI();
63
  public:
      vhdlOutput(string output_dir);
65
      ~vhdlOutput();
67
        bool output_schedule(network_t& n);
       bool output_schedule(const network_t& n);
69
  };
71
  #endif /* VHDLOUTPUT_H */
73
```

| Listing D. | 5: vhdlC | output.cpp |
|------------|----------|------------|
|------------|----------|------------|

```
/*
 * File: vhdlOutput.cpp
 * Author: T410s
 *
 * Created on 6. august 2012, 11:13
 */
 *
 #include "vhdlOutput.h"
 bool vhdlOutput::output_schedule(const network_t& n)
 12 {
```

```
int numOfNodes = n.routers().size();
    numOfNodesStr = ::lex cast<string>(numOfNodes);
14
    int countWidth = ceil(log2(n.best));
    this ->writeHeaderRouter(countWidth);
    this ->writeHeaderNI(countWidth, numOfNodes);
18
    for (vector < router_t * >:: const_iterator r = n.routers().begin(); r
20
        != n.routers().end(); r++){ // For each router, write Network
          Adapter Table and Router Table
       int r_id = (*r)->address.first + (*r)->address.second * n.cols
           ();
       this->startniST(r_id);
       this->startrouterST(r_id);
       for(timeslot t = 0; t < n.best; t++){ // Write table row for
24
           each timeslot
         int t0 = t-1;
         int t1 = t;
26
         if(t == 0){
           t0 = n.best - 1;
28
           t1 = n.best;
         }
30
         // Write row in Network Adapter table
        router_id dest_id = (*r)->address;
32
         router_id src_id = (*r)->address;
         if ((*r) \rightarrow local_in_best_schedule.has((t+2)\%n.best)){
34
           dest_id = (*r)->local_in_best_schedule.get((t+2)%n.best)->
               to;
36
         if ((*r)->local_out_best_schedule.has(t1))
           src_id = (*r)->local_out_best_schedule.get(t1)->from;
38
         int dest = dest_id.first + dest_id.second * n.cols();
         int src = src_id.first + src_id.second * n.cols();
         this->writeSlotNIDest(t, countWidth, dest);
         this -> writeSlotNISrc(src);
44
         // Write row in Router table
         //this->writeSlotRouter(t,countWidth,ports);
46
         port_id ports[5] = {___NUM_PORTS, ___NUM_PORTS, ___NUM_PORTS,
            ___NUM_PORTS, ___NUM_PORTS};
48
         for (int out_p = 0; out_p < __NUM_PORTS-1; out_p++){
           // For all 4 output ports not being the local port.
           if (!(*r)->out((port_id)out_p).has_link()){
             continue; // No outgoing channel from the port.
           if((*r) \rightarrow out((port_id)out_p).link().best_schedule.has(t)){
54
             // If there is a channel comming out of the port, find
                 the
             // input port from which the channel is comming from.
             const channel* out_c =(*r)->out((port_id)out_p).link().
                 best_schedule.get(t);
             for (int in_p = 0; in_p < \_NUM_PORTS-1; in_p++){
58
               // For all 4 input ports not being the local port.
```

```
if (!(*r)->in ((port_id)in_p).has_link())
60
                  continue; // No link into this port
                if((*r) \rightarrow in((port_id)in_p).link().best_schedule.has(t0)
                     ) { // REMEMBER: Change back t-1 \rightarrow t
                  const channel* in_c =(*r)->in((port_id)in_p).link().
                       best_schedule.get(t0);
                  if (out_c == in_c) {
64
                     // The correct link found
                     ports[(port_id)out_p] = (port_id)in_p;
                     break:
                  }
68
                }
              }
              if (ports [(port_id)out_p] == __NUM_PORTS) {
                // If channel was not found on any of the 4 input ports
                // It should be on the local in port, but we test it
                    anyway.
                if ((*r)->local_in_best_schedule.has(t)) {
74
                  const channel* in_c = (*r)->local_in_best_schedule.
                       get(t);
                   if (out_c == in_c) {
                     // The correct link found
                     ports[(port_id)out_p] = L;
78
                  } else {
                     cout << "Failure: Channel rose from nothing like a
80
                         fenix." << endl;</pre>
                }
82
              }
            } else {
84
              ports[(port_id)out_p] = __NUM_PORTS;
            1
86
            // ports[N] = ;
88
          if((*r) \rightarrow local\_out\_best\_schedule.has(t1)) { // For the local}
90
              out port.
            const channel* out_c = (*r)->local_out_best_schedule.get(t1
                );
            for (int in_p = 0; in_p < \_NUM_PORTS-1; in_p++){
              // For all 4 input ports not being the local port.
              if (!(*r)->in ((port_id)in_p).has_link())
                continue; // No link into this port
              if((*r) \rightarrow in((port_id)in_p).link().best_schedule.has(t0)){
96
                const channel* in_c =(*r)->in((port_id)in_p).link().
                     best_schedule.get(t0);
                if(out_c == in_c)
                  // The correct link found
                  ports[L] = (port_id)in_p;
                  break;
                }
              }
104
            if(ports[L] = \_NUM_PORTS)
```

```
// If channel was not found on any of the 4 input ports.
106
              // It should be on the local in port, but we test it
                  anyway.
              cout << "Failure: Not allowed to route back in to local."
108
                   \ll endl;
            }
            // and so on...
          }
          this ->writeSlotRouter(t, countWidth, ports);
       }
114
       this ->endniST(r_id);
       this->endrouterST(r_id);
     }
       n.router(e) \rightarrow next;
118
     //n.routers().at(n).local_out_best_schedule.get(t).from
120
     this->endArchRouter();
     this -> endArchNI();
     delete this;
124
     return true;
   }
126
   string vhdlOutput::bin(int val, int bits) {
128
     int max = (int) pow(2.0, bits - 1);
     string s = "";
130
     for (int i = 0; i < bits; i++){
       if (val/max \ge 1)
         val -= \max;
         s += "1";
134
       else 
          s += 0";
136
       }
       \max = \max / 2;
138
     3
140
     return s;
   ļ
   char vhdlOutput::p2c(port_id p){
       char c;
144
       if (p = N) c = 'N';
       if (p == E) c = 'E';
146
       if (p = S) c = 'S';
       if (p = W) c = W';
148
       if (p = L) c = 'L';
       if (p = \_NUM_PORTS) c = 'D';
150
152
       return c;
     }
154
   vhdlOutput::vhdlOutput(string output_dir){
     niST.open(output_dir + "ni_ST_" + numOfNodesStr + ".vhd", ios::
156
         trunc);
```

```
routerST.open(output_dir + "router_ST_" + numOfNodesStr + ".vhd",
           ios::trunc);
     if (! niST.good()) {
158
       niST.close();
        string new_file = output_dir + ::lex_cast<string>((int)time(
160
            NULL)) + "ni_ST_" + numOfNodesStr + ".vhd";
       cout << "Warning: Output failure, new output name: " + new_file
             << endl:
       niST.open(new_file, ios::trunc);
162
     }
     if (!routerST.good()){
164
       routerST.close();
       string new_file = output_dir + ::lex_cast<string>((int)time(
            NULL)) + "router_ST_" + numOfNodesStr + ".vhd";
       cout << "Warning: Output failure, new output name: " + new_file
             << endl;
       routerST.open(new_file, ios::trunc);
168
     // TODO: Error handling + Specify output file name
   vhdlOutput::~vhdlOutput() {
     niST.close();
174
     routerST.close();
   }
176
   void vhdlOutput::writeHeaderRouter(int countWidth){
178
     routerST << '
         n";
     routerST << "-- router_ST_" << numOfNodesStr << ".vhd\n";</pre>
180
     \operatorname{routerST} << "--- This is an auto generated file , do not edit by
         hand.\n"
     routerST << "-- These tables were generated from an application
182
          specific \n";
     routerST << "-- schedule by the SNTs project.\n";
     routerST << "-- https://github.com/rbscloud/SNTs\n";</pre>
184
     \operatorname{routerST} <<
         n";
     \texttt{routerST} <\!\!< \texttt{"library ieee;} \\ \texttt{n";}
186
     routerST << "use ieee.std_logic_1164.all;\n";</pre>
     routerST << "use ieee.numeric_std.all;\n\n";</pre>
188
     routerST << "use work.noc_types.all;\n\n";</pre>
190
     routerST << "entity router_ST_" << numOfNodesStr << " is \n";</pre>
     routerST << "\tgeneric (\n";
     routerST \ll "(t)tNI_NUM(t: natural); (n";
     routerST << "\tport (\n";
     routerST << "\t\tcount\t: in unsigned(" << countWidth-1 << "
196
          downto 0;\n";
     routerST << "\t\tsels\t: out select_signals\n";
     routerST << "(t,t); (n";
198
     routerST << "end router_ST_" << numOfNodesStr << ";\n\n";</pre>
```

```
200
     routerST << "architecture data of router ST " << numOfNodesStr <<
          " is \n";
     routerST << "begin -- data\n\n";
202
204 }
   void vhdlOutput :: endArchRouter() {
206
     routerST << "end data; \n";
208 }
210 void vhdlOutput::writeSlotRouter(int slotNum, int countWidth,
       port_id* ports){
     routerST << "\t\twhen \"" << bin(slotNum,countWidth) << "\" =>\n"
     routerST << "\t \in (N) <= " << p2c(ports [N]) << "; n";
     routerST << "\t\t\tsels(E) <= " << p2c(ports[E]) << ";\n";
     routerST << "\t\t\tsels(S) <= " << p2c(ports[S]) << ";\n";
214
     routerST << "\t\t\tsels(W) <= " << p_{2c}(ports[W]) << ";\n";
     routerST << "\t\tsels(L) <= " << p2c(ports[L]) << ";\n";
   J
   void vhdlOutput::writeHeaderNI(int countWidth, int numOfNodes){
     \mathrm{niST} << "
220
         n";
     niST \ll "-ni_ST_" \ll numOfNodesStr \ll ".vhd/n";
     \rm niST <\!\!< "-- This is an auto generated file , do not edit by hand. \backslash
         n";
     \rm niST <\!\!< "-- These tables were generated from an application
         specific \n";
     niST << "-- schedule by the SNTs project.\n";
     niST << "-- https://github.com/rbscloud/SNTs\n";</pre>
     \mathrm{niST} << "
226
         n";
     niST << "library ieee;\n";</pre>
     niST << "use ieee.std_logic_1164.all;\n";</pre>
     niST << "use ieee.numeric std.all;\n\n";
     niST << "use work.noc_types.all;\n\n";</pre>
     niST << "entity ni_ST_" << numOfNodesStr << " is n;
     niST \ll " \ tgeneric (\ n";
     niST \ll " \ t \ tNI_NUM \ t : natural); \ n";
     niST \ll "\tport (\n";
     niST \ll "\times times in unsigned(" \ll countWidth-1 \ll " downto)
         0); n"
     niST \ll "ttdestt: out integer range 0 to " \ll numOfNodes-1 \ll
         ";n;
     niST \ll "\t tsrc\t: out integer range 0 to " \ll numOfNodes-1 \ll "
```

218

224

228

230

232

234

236

238

240

n";niST << "  $\ t \ t$  );  $\ n$  " ;

niST << "end ni\_ST\_" << numOfNodesStr << ";\n\n"; 242

```
niST << "architecture data of ni_ST_" << numOfNodesStr << " is \n"
     niST \ll "begin -- data n n";
244
   }
246
   void vhdlOutput::startniST(int num){
248
     startST(num, \& this \rightarrow niST);
   }
   void vhdlOutput::startrouterST(int num){
     startST(num,&this->routerST);
   }
254
   void vhdlOutput::startST(int num, ofstream* ST){
256
     *ST << "\tNI_NUM" << num << " : if NI_NUM = " << num << "
          generate\n";
     *ST \ll " \setminus t process(count) begin \setminus n \setminus n";
258
     *ST \ll " \ t \ count is \ n \ ;
   }
260
   void vhdlOutput::writeSlotNIDest(int slotNum, int countWidth, int
262
       dest){
     niST \ll " \ t \ t \ when \ " \ll bin (slotNum, countWidth) \ll " = > n";
     \label{eq:nist} {\rm niST} \ <<\ "\ t\ t\ t\ t\ dest} \ <=\ "\ <<\ dest\ <<\ "\ ;\ n\ "\ ;
264
   }
266
   void vhdlOutput::writeSlotNISrc(int src){
     niST \ll " t t t c \ll "; n";
268
   }
   void vhdlOutput::endrouterST(int num){
     routerST << "\t \ t \ t when others =>n";
272
     routerST << "\times the ls(N) <= D; \times n";
     routerST << "t t t sels (E) <= D; n";
     routerST \ll "\t\t\tsels(S) \leq D;\n";
     routerST \ll "\t\t\tsels(W) \leq D;\n";
276
     routerST \ll "\t\t\tsels(L) \ll D;\n";
     routerST \ll "\t\tend case;\n";
278
     routerST << "\tend process;\n\n";
     routerST << "\tend generate NI_NUM" << num << ";\n\n";
280
   }
282
   void vhdlOutput::endniST(int num){
     niST \ll " \setminus t \setminus t when others =>\n";
284
     niST \ll "(t)(t)(t) dest \ll "; n";
     niST \ll (t)t(t)t(src \ll u \ll u)
286
     niST \ll " \ t \ t \ case; \ n";
     niST << "\tend process;\n\n";
288
     niST << "\tend generate NI_NUM" << num << ";\n\n";
290 }
   void vhdlOutput :: endArchNI() {
292
     niST \ll "end data; \n";
294 }
```

```
2 package dk.rbscloud.tcrest.SNTs;
4 import java.io.File;
  import java.util.ArrayList;
6 import java.util.List;
  import javax.xml.parsers.*;
8 import org.w3c.dom.*;
10 /**
  * A converter from xml format to Setup of DMA tables
12 * @author Rasmus
  */
14 public class ScheduleConverter{
    private enum Port {N, E, S, W, L, D}
    private static List<List<List<Integer>>> initArray;
    private static final int SLOT_TABLE = 0;
    private static final int ROUTE_TABLE = 1;
1.8
    private static Document doc;
    private static NodeList tList;
20
    public static void main(String[] args) {
       if (args.length < 1)
        System.out.println("No input file specified!");
24
        return;
      }
26
      parseXml(args[0]);
       try {
28
        initializeArray(tList.getLength());
         int numOfNodes = tList.getLength();
30
        new TileCoord(0,0,(int)Math.sqrt(numOfNodes)); //
             Initializing the static sideLength variable in TileCoord
32
         /* For each tile , slot table and route table is written. \ast/
         for (int tileIdx = 0; tileIdx < tList.getLength(); tileIdx++)</pre>
           Node tile = tList.item(tileIdx);
           if (tile.getNodeType() == Node.ELEMENT_NODE) {
36
             Element tile E = (Element) tile;
             TileCoord tileCoord = getTileCoord(tile);
38
             /* For each tile write the slot table */
             NodeList slotList = tileE.getElementsByTagName("timeslot"
40
                 );
             int slotTableWidth = (int)Math.ceil(Math.log(numOfNodes)/
42
                 Math.log(2));
             /* For each time slot, slot table and route table is
                 written. */
             for (int slotIdx = 0; slotIdx < slotList.getLength();</pre>
44
                 \operatorname{slotIdx}+) {
               Node slot = slotList.item(slotIdx);
               if (slot.getNodeType() == Node.ELEMENT_NODE) {
46
                 Element slotE = (Element) \ slot;
```

```
// Get the coordinates of the receiver for this
48
                     timeslot
                 TileCoord destCoord = getDestCoord(slotE);
                 // Write the destination ID in the slot table.
                 int slotVal = destCoord.getTileId() | (1 <<</pre>
                     slotTableWidth);
                 initArray.get(tileCoord.getTileId()).get(SLOT_TABLE).
                     add(slotIdx,slotVal);
54
                 /* For each transmission slot write an entry in the
                     route table */
                 String binRoute = "":
                 if (destCoord.getTileId() != tileCoord.getTileId()){
                   TileCoord tempTileCoord = new TileCoord(tileCoord.x
58
                       , tileCoord.y);
                   char inPort = 'L';
                   for(int i = 0; tempTileCoord.getTileId() !=
                       destCoord.getTileId(); i++){
                     NodeList ports = getPorts(tempTileCoord, slotIdx+i
                         );
                     char outPort = findOutputPort(ports, inPort);
                     binRoute = port2bin(outPort) + binRoute;
                     inPort = oppositPort(outPort);
64
                     nextTile(tempTileCoord,outPort);
                   }
                   // Route to local port
                   binRoute = port2bin(inPort) + binRoute;
68
                 int route = Integer.parseInt("0" + binRoute, 2);
                   Write the route to the route table in the
                     initArray
                 initArray.get(tileCoord.getTileId()).get(ROUTE_TABLE)
                     .set(destCoord.getTileId(), route);
              }
            }
74
          }
        }
      } catch (Exception e) {
78
        e.printStackTrace();
80
      SchedulePrinter printer = new SchedulePrinter();
      printer.printData(initArray);
82
      printer.printFooter();
    }
84
    private static NodeList getPorts(TileCoord tileCoord, int slotIdx
86
      Element tile E = (Element) get Tile (tile Coord);
      NodeList sList = tileE.getElementsByTagName("timeslot");
88
      slotIdx = (slotIdx + 2) % sList.getLength(); // The schedule
          takes pipelining into acount.
                              // Shifting by 2 gets rid of the
90
                                   pipelining.
```

```
Element slotE = (Element) sList.item(slotIdx);
       NodeList rList = slotE.getElementsByTagName("router");
92
       Node router = rList.item(0);
       Element routerE = (Element) router;
94
       return routerE.getElementsByTagName("output");
     }
96
     private static char findOutputPort(NodeList ports, char inPort){
98
       char outPort = ' ';
       for (int nodeIdx = 0; nodeIdx < ports.getLength(); nodeIdx++){
100
          if (ports.item(nodeIdx).getAttributes().getNamedItem("input").
             getNodeValue().charAt(0) = inPort)
           outPort = ports.item(nodeIdx).getAttributes().getNamedItem(
                "id").getNodeValue().charAt(0);
       }
104
       return outPort;
     }
106
108
     private static char oppositPort(char p){
       if(p = 'N') \{p = 'S';\}
       else if (p = 'E') \{p = 'W';\}
       else if (p = 'S') \{p = 'N';\}
       else if (p = 'W') \{ p = 'E'; \}
       else\{p = 'L';\}
       return p;
114
     }
     private static void nextTile(TileCoord tileCoord, char outPort){
       if (outPort = 'N') { tileCoord.moveNorth (); }
118
       else if(outPort == 'E'){tileCoord.moveEast();}
       else if (outPort == 'S') {tileCoord.moveSouth();}
120
       else if (outPort = 'W') {tileCoord.moveWest();}
       // If local port do nothing
     }
     private static String port2bin(char p){
       String bin;
       if(p = 'N') \{bin = "10";\}
       else if (p == 'E') {bin = "11";}
128
       else if (p = 'S') {bin = "00"; }
       else if (p = 'W') {bin = "01";}
130
       else{bin} = "00";
       return bin;
     }
134
     private static Node getTile(TileCoord tileCoord){
       for (int tileIdx = 0; tileIdx < tList.getLength(); tileIdx++) {</pre>
136
             // For each tile
         Node tile = tList.item(tileIdx);
         if (tile.getNodeType() == Node.ELEMENT_NODE) {
138
            if (tileCoord.getTileId() = getTileCoord(tile).getTileId())
              return tile;
140
           }
```

```
142
         }
       return tList.item(0);
144
146
     private static TileCoord getTileCoord(Node tile){
       String[] tileCoord = tile.getAttributes().getNamedItem("id").
148
           getNodeValue().split("\\D");
       TileCoord tileId = new TileCoord(Integer.parseInt(tileCoord[1])
           , Integer.parseInt(tileCoord[2]));
       return tileId;
     }
     private static TileCoord getDestCoord(Element slotE) {
       String [] coord = slotE.getElementsByTagName("na").item(0).
154
           getAttributes().getNamedItem("tx").getNodeValue().split("\\
           D");
       TileCoord destCoord = new TileCoord(Integer.parseInt(coord[1]),
            Integer.parseInt(coord[2]));
156
       return destCoord;
     }
158
     private static void parseXml(String inputFile){
160
       try{
         File fXmlFile = new File(inputFile);
         DocumentBuilderFactory dbFactory = DocumentBuilderFactory.
162
             newInstance();
         DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
         doc = dBuilder.parse(fXmlFile);
164
         doc.getDocumentElement().normalize();
         tList = doc.getElementsByTagName("tile");
       } catch (Exception e) {
         e.printStackTrace();
     }
     private static void initializeArray(int nrCpu){
       initArray = new ArrayList<List<List<Integer>>>(nrCpu);
       for (int i = 0; i < nrCpu; i++) {
174
         initArray.add(new ArrayList<List<Integer> >(2));
176
         initArray.get(i).add(new ArrayList<Integer>());
         initArray.get(i).add(new ArrayList<Integer>(nrCpu));
         for (int j = 0; j < nrCpu; j++){
           initArray.get(i).get(ROUTE_TABLE).add(0);
180
       }
182
     }
   1
```

Listing D.7: SchedulePrinter.java

```
package dk.rbscloud.tcrest.SNTs;
import java.io.*;
import java.util.List;
```

```
public class SchedulePrinter{
6
    private static FileWriter ofile;
    private static int indent = 0;
8
    private static final int SLOT TABLE = 0;
    private static final int ROUTE\_TABLE = 1;
    public SchedulePrinter(){
      try {
14
         ofile = new FileWriter(new File("../Tables.java"));
         String str = ind() + "/**"
               + ind() + " * AUTO-Generated file DO NOT EDIT !!! "
               + ind () + " * Loads the pre calculated schedule into
18
                   the Slot and Route tables."
               + ind() + " * @author package dk.rbscloud.tcrest.SNTs"
               + ind() + " */"
20
               + ind () + "package dk.rbscloud.tcrest.API;"
               + ind() + "import com.jopdesign.sys.Native;"
22
               + ind() + ""
               + ind () + "public class Tables" + openBrac()
24
               + ind() + "public static final int[][] initArray = "
                   + openBrac()
               + ind() + "";
26
                 public static final int[][][] initArray = {
28
                    \{1, 2, 3, 4, 5\},\
                    \{1, 2, 3\}
30
32
                    \{1, 2, 3, 4, 5\},\
                    \{1, 2, 3\}
34
                 }
               };
36
         ofile.write(str);
38
       } catch (Exception e) {
         e.printStackTrace();
40
       }
    }
42
    public void printFooter(){
44
       try {
         String str = closeBrac() + ";"
46
               + ind() + "private static int[] getSlotTable(int cpuId)
                    " + openBrac()
               + ind() + "return initArray[cpuId][0];"
48
               + closeBrac()
               + ind() + ""
               + ind() + "private static int[] getDmaTable(int cpuId)"
                    + openBrac()
               + ind() + "return initArray[cpuId][1];"
               + closeBrac()
               + ind() + "
54
```

```
+ ind() + "public static void load(int cpuId)" +
                    openBrac()
               + ind() + "// Loading the slot table"
56
               + ind() + "int[] slotTable = Tables.getSlotTable(cpuId)
                   . "
               + ind() + "for(int i = 0; i < \text{slotTable.length}; i++)" +
58
                     openBrac()
               + ind() + "Native.wrMem(slotTable[i], Const.
                   SLOT_TBL_BASE+i);"
               + closeBrac()
               + ind() + "// Loading the dma table"
               + ind () + "int [] dmaTable = Tables.getDmaTable(cpuId);"
               + ind () + "for (int i = 0; i < dmaTable.length; i++)" +
                    openBrac()
               + ind() + "Native.wrMem(dmaTable[i], Const.DMA_P_BASE+i
                   );"
               + closeBrac()
               + closeBrac()
               + ind() + ""
               + ind() + "public static boolean verify(int cpuId)" +
68
                    openBrac()
               + ind() + "// Reading and verifying the dma table'
               + ind () + "int [] dmaTable = getDmaTable(cpuId);"
               + ind() + "for(int i = 0; i < dmaTable.length; i++)" +
                    openBrac()
               + ind() + "int dmaData = Native.rdMem(Const.DMA_P_BASE+
                    i);"
               + ind() + "if (dmaData != dmaTable[i])" + openBrac()
               + ind () + "System.out.println (\"DMA_P_BASE faliure \langle ");"
74
               + ind() + "return false;"
               + closeBrac()
               + closeBrac()
               + ind() + "return true;"
               + closeBrac()
               + closeBrac()
80
               + ind();
         ofile.write(str);
         ofile.close();
       } catch(Exception e) {
84
         e.printStackTrace();
86
       }
     }
88
     public void printData(List<List<List<Integer>>> initArray){
90
       String str = "";
       for(List<List<Integer>> initCpu: initArray){
         str += ind() + openBrac()
             + ind() + "{";
         for(int slot : initCpu.get(SLOT_TABLE)){
           str += slot + ", ";
96
         }
         str = str.substring(0, str.length()-1);
98
         str += "}," + ind() + "{";
         for(int route: initCpu.get(ROUTE_TABLE)){
100
```

```
str += route + ",";
          }
          str = str.substring(0, str.length()-1);
          str += "}" + closeBrac() + ",";
        }
        str = str.substring(0, str.length()-1);
106
        try {
          ofile.write(str);
108
        } catch(Exception e){
          e.printStackTrace();
        }
     }
114
      private String ind(){
        String str = " \setminus n";
        if (indent < 0) {
          return "";
118
        }
        for (int i = 0; i < indent; i++) {
120
          \operatorname{str} += " \setminus t";
        }
        return str;
     }
124
      private String openBrac(){
126
        {\tt indent}{++;}
        return "{";
128
      }
130
      private String closeBrac(int b){
        String str = "";
        for (int i = 0; i < b; i++) {
          str += closeBrac();
134
        }
136
        return str;
     }
138
      private String closeBrac(){
        if (indent > 0) {
140
          indent --;
        }
142
        String str = ind();
        return str + "}";
144
      }
146 }
```

#### Listing D.8: TileCoord.java

```
/*
2 * To change this template, choose Tools | Templates
* and open the template in the editor.
4 */
package dk.rbscloud.tcrest.SNTs;
6
```

```
/**
    *
8
     @author Rasmus
    *
    */
  public class TileCoord {
     public int x, y;
     private static int sideLength;
     public TileCoord(int x, int y, int sideLength){
14
       this.x = x;
       this.y = y;
       this.sideLength = sideLength;
     }
18
     public TileCoord(int x, int y){
       this.x = x;
20
       this.y = y;
     }
22
     public int getTileId(){
       return x+y*sideLength;
24
     }
26
     public void moveNorth() {
       if(this.y == 0){
28
          this.y = sideLength - 1;
       } else {
30
          this.y--;
       }
     }
34
     public void moveSouth() {
       if(this.y == sideLength - 1){
36
          this.y = 0;
       } else {
38
          this.y++;
       }
40
     }
     public void moveEast(){
42
       if(this.x == sideLength -1){
          \mathbf{this} \, . \, \mathbf{x} \; = \; \mathbf{0} \, ;
44
       } else {
          this.x++;
46
       }
     }
48
     public void moveWest(){
       if(this.x == 0) \{
50
          this.x = sideLength - 1;
       } else {
52
          this.x--;
54
       }
     }
56
  }
```



### **MPI** source code

This appendix contains the following files:

- **NoC.java** is the static Java class implementing the communication primitives, the file starts on page 133.
- **Const.java** is a Java file with constants describing the address space of the MPI, the file starts on page 136.
- **Tables.java** is an example file generated by the schedule converter, it starts on page 137.

| Listing | E.1: | NoC.java |
|---------|------|----------|
|---------|------|----------|

```
while (!doneDMA(destDMA));
13
       int txBufAddr = getTxBuf(cpuId, destCpuId);
        System.out.println("txBufAddr: " + txBufAddr);
      copyInMsg(msg,txBufAddr);
       int localChanBuf = getChanBufAddr(cpuId, destCpuId);
       int destChanBuf = getChanBufAddr(destCpuId,cpuId);
       int writePointer = destChanBuf*Const.CHANNEL_BUF_SIZE+swapBuf(
19
           Const.TX ACT BUF, localChanBuf);
       int readPointer = localChanBuf*Const.CHANNEL_BUF_SIZE + Const.
          TX BUF;
        System.out.println("SRC: "+cpuId+" DEST: "+destCpuId+"
21
      WritePointer: " + writePointer);
        System.out.println("readPointer: " + readPointer);
         System.out.println("destDMA: " + destDMA);
23
      setupDMA(msg,readPointer,writePointer,destDMA);
      return true;
    }
    public static boolean sendRdy(int destCpuId){
      int destDMA = Const.DMA_BASE+(destCpuId << 1);</pre>
       return doneDMA(destDMA);
    }
    public static boolean recv(int[] msg, int srcCpuId, int cpuId){
       if (srcCpuId == cpuId) {
        return false;
       }
       int rxBufAddr = getRxBuf(cpuId, srcCpuId);
       while (!doneRecv(rxBufAddr));
      copyOutMsg(msg,rxBufAddr);
39
      int localChanBuf = getChanBufAddr(cpuId, srcCpuId);
      swapBuf(Const.RX_ACT_BUF, localChanBuf);
41
      //System.out.println("Swap: "+);
      //if(msg.length > 8){ return false;}
43
      return true;
    }
45
    public static boolean recvRdy(int srcCpuId, int cpuId){
47
       if (srcCpuId == cpuId) {
         return false;
49
      int rxBufAddr = getRxBuf(cpuId, srcCpuId);
51
       return doneRecv(rxBufAddr);
    }
    private static boolean doneRecv(int addr){
       int length = Native.rdMem(addr);
       if(length != 0){
         if (Native.rdMem(addr+length+1) == -1){ return true;}
       }
      return false;
    }
    private static void setupDMA(int[] msg, int txBufAddr, int
        rxBufAddr, int addrDMA){
```

```
Native.wrMem((txBufAddr << 16) | rxBufAddr, addrDMA + 1);</pre>
       if((msg.length \& 1) = 0){
65
         Native.wrMem(msg.length+2 | 32768, addrDMA);
       } else{
67
         Native.wrMem(msg.length+3 | 32768, addrDMA);
      }
69
    }
    private static boolean doneDMA(int addrDMA){
      int DMA = Native.rdMem(addrDMA);
       if ((DMA \& 32768) != 0) {
         if((DMA \& 16384) == 0) \{ return false; \}
      }
      return true;
    }
    private static void copyInMsg(int[] msg, int addr){
      Native.wrMem(msg.length, addr);
81
       for (int i = 1; i < msg.length+1; i++){
         Native.wrMem(msg[i-1], addr+i);
83
      Native.wrMem(-1, addr+msg.length+1);
85
    }
87
    private static void copyOutMsg(int[] msg, int addr){
      int length = Native.rdMem(addr);
80
       //Native.wrMem(0, addr);
       for (int i = 1; i < length+1; i++)
91
        msg[i-1] = Native.rdMem(addr+i);
          Native.wrMem(0, addr+i);
93
      }
       //Native.wrMem(0, addr+length+1);
95
       for (int i = 0; i < Const.BUFFER_SIZE; i++)
        Native.wrMem(0, addr+i);
97
       }
    }
99
    public static int getTxBuf(int cpuId, int destCpuId){
      int bufAddr = getChanBufAddr(cpuId, destCpuId);
       return Const.NI_BASE + (bufAddr * Const.CHANNEL_BUF_SIZE) +
           Const.TX_BUF;
    }
    public static int getRxBuf(int cpuId, int srcCpuId){
       int bufAddr = getChanBufAddr(cpuId, srcCpuId);
       return Const.NI_BASE + (bufAddr * Const.CHANNEL_BUF_SIZE) +
           getActBuf(bufAddr);
    }
    private static int getChanBufAddr(int cpuId, int channId){
       if (channId == Const.NUMBER_OF_CORES-1) {
         channId = cpuId;
       }
       return channId;
    }
```

```
117
     private static int getActBuf(int bufAddr){
       int actBufAddr = Const.NI BASE + (bufAddr * Const.
           CHANNEL_BUF_SIZE) + Const.RX_ACT_BUF;
       int actBuf = Native.rdMem(actBufAddr);
       if ((actBuf & 4) == 0) { return Const.RX_BUF_1;}
       else { return Const.RX_BUF_2;}
     }
     private static int swapBuf(int statusAddr, int bufAddr){
       int actBufAddr = Const.NI_BASE + (bufAddr * Const.
           CHANNEL_BUF_SIZE) + statusAddr;
       int actBuf = Native.rdMem(actBufAddr);
         System.out.println("actBuf: "+actBuf);
       int newActBuf = actBuf \hat{} 4;
         System.out.println("newActBuf: "+newActBuf);
         System.out.println("actBufAddr: "+actBufAddr);
       Native.wr(newActBuf, actBufAddr);
         Native.wr(10, 4194432);
         int j = 0;
         for (int i = 0; i < 1000; i++){
           j = i + 2;
         int ret = Native.rd(4194432);
         System.out.println("Read from mem: " + ret);
         System.out.println("newActBuf: "+Native.rdMem(actBufAddr));
         System.out.println("j: "+j);
141
       if ((\operatorname{actBuf} \& 4) = 0) \{ \operatorname{return} \operatorname{Const.RX}_BUF_1; \}
143
       else { return Const.RX_BUF_2;}
     }
145
     public static void checkSPM(){
147
       int verified = 0;
       for (int i = 0; i < Const.COM\_SPM\_SIZE; i++){
         Native.wr(i, i+Const.COM_SPM);
          if (Native.rd(i+Const.COM_SPM) != i) {
            System.out.println("SPM error at address: " + i);
         else 
            verified++;
         }
       System.out.println(verified + " Addresses verified out of " +
           Const.COM_SPM_SIZE + " Addresses.");
       for (int i = 0; i < Const.COM\_SPM\_SIZE; i++){
         Native.wr(0, i+Const.COM_SPM);
161
     ł
```



```
package dk.rbscloud.tcrest.API;
/**
```

```
* Constants for the T-CREST DMA Network Interface
   * @author Rasmus Bo Soerensen
  */
6
  class Const {
    /* DMA Network Interface addresses */
8
    public static final int NI_BASE = 0 \times 400000;
    public static final int COM SPM = NI BASE;
    public static final int DMA BASE = NI BASE + 0x80000;
    public static final int DMA_P_BASE = NI_BASE + 0x100000;
12
    public static final int SLOT TBL BASE = NI BASE + 0x180000;
    public static final int CONFIG DONE = NI BASE + 0x100;
14
    // Channel buffer constants
16
    public static final int COM_SPM_SIZE = 256;
    public static final int NUMBER_OF_CORES = 9;
18
    public static final int CHANNEL_BUF_SIZE = COM_SPM_SIZE/(
        NUMBER_OF_CORES-1);
    public static final int BUFFER SIZE = (CHANNEL BUF SIZE -2) / 3;
20
    public static final int RX ACT BUF = 0;
    public static final int TX\_ACT\_BUF = 1;
    public static final int RX_BUF_1 = 2 + BUFFER_SIZE*0;
    public static final int RX_BUF_2 = 2 + BUFFER_SIZE*1;
24
    public static final int TX_BUF = 2 + BUFFER_SIZE*2;
```

```
26 }
```

Listing E.3: Tables.java

```
2
  /**
    * AUTO-Generated file DO NOT EDIT !!!
    * Loads the pre calculated schedule into the Slot and Route tables
    * @author package dk.rbscloud.tcrest.SNTs
  */
  package dk.rbscloud.tcrest.API;
8 import com. jopdesign.sys. Native;
10 public class Tables {
     public static final int[][][] initArray = {
        {
          \{17, 18, 19, 20, 24\},\
14
          \{0, 7, 13, 8, 28, 0, 0, 0, 54\}
        },
        ł
          \{17, 16, 17, 17, 17\},\
18
          \{13, 0, 0, 0, 0, 0, 0, 0, 0, 0\}
        }.
20
          \{18, 18, 18, 18, 16, 17\},\
          \{7, 13, 0, 0, 0, 0, 0, 0, 0, 0\}
        },
24
        ł
          \{16, 19, 19, 18, 19\},\
26
          \{2, 0, 54, 0, 0, 0, 0, 0, 0, 0\}
```

```
},
28
          \{20, 20, 20, 16, 19\},\
30
          \{54, 0, 0, 13, 0, 0, 0, 0, 0\}
       }
          \{21, 21, 21, 21, 20, 21\},\
34
          \{0, 0, 0, 0, 0, 13, 0, 0, 0, 0\}
36
       }
       ł
          \{22, 22, 22, 22, 21, 22\},\
38
          \{0, 0, 0, 0, 0, 0, 54, 0, 0, 0\}
       }
40
          \{23, 23, 23, 23, 22, 23\},\
42
          \{0, 0, 0, 0, 0, 0, 0, 13, 0, 0\}
       }
44
          \{24, 24, 24, 23, 24\},\
46
          \{0, 0, 0, 0, 0, 0, 0, 0, 13, 0\}
       }
48
     };
     private static int[] getSlotTable(int cpuId){
       return initArray[cpuId][0];
     }
     private static int[] getDmaTable(int cpuId){
54
       return initArray[cpuId][1];
     }
56
     public static void load(int cpuId){
58
       // Loading the slot table
       int[] slotTable = Tables.getSlotTable(cpuId);
       for (int i = 0; i < \text{slotTable.length}; i++){
          Native.wrMem(slotTable[i], Const.SLOT_TBL_BASE+i);
       }
       // Loading the dma table
64
       int[] dmaTable = Tables.getDmaTable(cpuId);
       for (int i = 0; i < dmaTable.length; i++){
          Native.wrMem(dmaTable[i], Const.DMA_P_BASE+i);
68
       }
     }
     public static boolean verify(int cpuId){
72
       // Reading and verifying the dma table
       int[] dmaTable = getDmaTable(cpuId);
       for (int i = 0; i < dmaTable.length; i++){
74
          int dmaData = Native.rdMem(Const.DMA_P_BASE+i);
          if (dmaData != dmaTable[i]) {
76
            System.out.println("DMA_P_BASE faliure");
            return false;
78
          }
       1
80
       return true;
     }
82
```

}

## Appendix F

# Test and benchmark source code

This appendix contains the following files:

- HelloDMA.java is the Hello World program sending a message between all processors before writing Hello World to the console, the file starts on page 141.
- **DMABench.java** is the microbenchmark program for the T-CREST NoC platform, the file starts on page 143.

| Listing F.1: | HelloDMA.java |
|--------------|---------------|
|--------------|---------------|

| 1<br>3 | /* This file is part of JOP, the Java Optimized Processor see <http: www.jopdesign.com=""></http:>    |
|--------|-------------------------------------------------------------------------------------------------------|
| 5      | Copyright (C) 2005-2008, Martin Schoeberl (martin@jopdesign.com)                                      |
| 7      | This program is free software: you can redistribute it and/or modify                                  |
|        | it under the terms of the GNU General Public License as published by                                  |
| 9      | the Free Software Foundation, either version 3 of the License, or (at your option) any later version. |

```
This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
13
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.
    You should have received a copy of the GNU General Public License
    along with this program. If not, see <a href="http://www.gnu.org/">http://www.gnu.org/</a>
         licenses />.
  */
19
21
   /*:
   */
  package cmp;
  import java.util.Vector;
  import joprt.RtThread;
31 import com.jopdesign.io.IOFactory;
  import com.jopdesign.io.SysDevice;
33 import com.jopdesign.sys.Const;
  import com.jopdesign.sys.Native;
35 import com.jopdesign.sys.Startup;
  import dk.rbscloud.tcrest.API.*;
37
   /**
   * A CMP version of Hello World
30
   * @author Rasmus
41
43
   */
  public class HelloDMA implements Runnable {
45
    int id;
47
    static Vector msg;
49
    public HelloDMA(int i) {
51
      id = i;
    }
    /**
     * @param args
     */
    public static void main(String[] args) {
       Tables.load(0);
       SysDevice sys = IOFactory.getFactory().getSysDevice();
      msg = new Vector();
      System.out.println("Core 0 started");
       for (int i=0; i < sys.nrCpu-1; ++i) {
         Runnable r = new HelloDMA(i+1);
         Startup.setRunnable(r, i);
       }
65
```

```
// start the other CPUs
       sys.signal = 1;
67
       int [] message = \{0, 1, 2, 3, 4, 5, 6, 7\};
       int [] rmessage = \{0, 0, 0, 0, 0, 0, 0, 0, 0\};
69
       NoC. send (message, sys. nrCpu-1, 0);
       for (;;) {
         int size = msg.size();
         if (size!=0) {
           StringBuffer sb = (StringBuffer) msg.remove(0);
           System.out.println(sb);
         }
         if(NoC.recvRdy(1,0)){
           NoC. recv(rmessage, 1, 0);
           for (int i = 0; i < message.length; i++){
              if (message[i] != rmessage[i]) { System.exit(1);}
81
           System.out.println("Hello World!");
83
         }
       }
85
     }
87
     public void run() {
       Tables.load(id);
80
       StringBuffer sb = new StringBuffer();
       sb.append("Core ").append(id).append(" started");
91
       RtThread.sleepMs(300*id);
       msg.addElement(sb);
93
       int [] message = \{0, 0, 0, 0, 0, 0, 0, 0, 0\};
       int src = id+1;
95
       if(id == 8) \{ src = 0; \}
       for (;;) {
97
         NoC.recv(message, src, id);
         NoC. send (message, id - 1, id);
99
       }
     }
```

}

Listing F.2: DMABench.java

|    | //*                                                               |
|----|-------------------------------------------------------------------|
| 2  | This file is part of JOP, the Java Optimized Processor            |
|    | <pre>see <http: www.jopdesign.com=""></http:></pre>               |
| 4  |                                                                   |
|    | Copyright (C) 2005-2008, Martin Schoeberl (martin@jopdesign.com)  |
| 6  |                                                                   |
|    | This program is free software: you can redistribute it and/or     |
|    | modify                                                            |
| 8  | it under the terms of the GNU General Public License as published |
|    | by                                                                |
|    | the Free Software Foundation, either version 3 of the License, or |
| 10 |                                                                   |
|    |                                                                   |
| 19 | This program is distributed in the hope that it will be useful    |
| 12 | This program is distributed in the hope that it will be useful,   |

```
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
                                                              See the
14
    GNU General Public License for more details.
16
    You should have received a copy of the GNU General Public License
    along with this program. If not, see <http://www.gnu.org/
18
         licenses />.
  * /
20
  /**
   */
24
  package cmp;
26
  import java.util.Vector;
28
  import joprt.RtThread;
30
  import com.jopdesign.io.IOFactory;
32 import com.jopdesign.io.SysDevice;
  import com.jopdesign.sys.Const;
34 import com.jopdesign.sys.Native;
  import com.jopdesign.sys.Startup;
36 import dk.rbscloud.tcrest.API.*;
  /**
38
   * A DMA Benchmark
40
   * @author Rasmus
42
   */
  public class DMABench implements Runnable {
44
    int id;
46
    static Vector msg;
48
    public DMABench(int i) {
      id = i;
    }
54
    /**
     * @param args
     */
    public static void main(String[] args) {
58
      msg = new Vector();
      System.out.println("Core 0 started");
60
      Tables.load(0);
      SysDevice sys = IOFactory.getFactory().getSysDevice();
62
       int[] message = IOFactory.getFactory().getScratchpadMemory();
       for (int i = 0; i < message.length; i++){
         message[i] = i;
        int [] message = \{4, 4, 7, 8768, 456, 34, 6, 27\}; // 8
```

```
int [] message = \{4, 4, 7, 8768, 456, 34, 6\};
68 //
          int [] message = \{4, 4, 7, 8768, 456, 34\};
                                                   // 5
          int [] message = \{4, 4, 7, 8768, 456\};
70 /
          int [] message = \{4, 4, 7, 8768\};
                                                 // 4
         int [] message = \{4, 4, 7\};
                                               // 3
72
                                               // 2
         int [] message = \{4, 4\};
                                               1
         int [] message = \{4\};
       for (int i=0; i < sys.nrCpu-1; ++i) {
76
         Runnable r = new DMABench(i+1);
         Startup.setRunnable(r, i);
78
       }
80
       // start the other CPUs
       sys.signal = 1;
82
       // Print out start messages from other cores
84
       for (int printOut = 0; printOut < sys.nrCpu-1;) {</pre>
         int size = msg.size();
86
          if (size!=0) {
            StringBuffer sb = (StringBuffer) msg.remove(0);
88
            System.out.println(sb);
            printOut++;
90
         }
       }
92
       int tRead, t0, t1;
94
       int time, sum, res;
       int i;
96
       int iterations = 10;
       // Find timing overhead
98
       t0 = Native.rdMem(Const.IO_CNT);
       t1 = Native.rdMem(Const.IO_CNT);
100
       tRead = t1 - t0;
       System.out.println("Timing overhead t = " + tRead);
       // Find avarage read time
       sum = 0;
104
       for (i = 0; i < iterations; i++)
         t0 = Native.rdMem(Const.IO_CNT);
106
         res = Native.rdMem(0 \times 400000);
         t1 = Native.rdMem(Const.IO_CNT);
108
         time = t1-t0-tRead;
            System.out.println("Read time = " + time);
110
         sum += time;
       }
       time = sum/iterations;
       System.out.println("Avg read timet = " + time);
114
       // Find avarage write time
       sum = 0;
       for (i = 0; i < iterations; i++)
         t0 = Native.rdMem(Const.IO_CNT);
118
          Native.wrMem(0, 0x40000A);
         t1 = Native.rdMem(Const.IO_CNT);
120
         time = t1-t0-tRead;
            System.out.println("Write time = " + time);
122
```

```
sum += time;
       }
124
       time = sum/iterations;
       System.out.println("Avg write timet = " + time);
126
       // Time for single calculation
       t0 = Native.rdMem(Const.IO_CNT);
128
       message[0] = message[0] + 1;
         i = i + 1;
130
       t1 = Native.rdMem(Const.IO_CNT);
       time = t1-t0-tRead;
       System.out.println("Message[0]++\t\t= " + time);
134
       // Find round trip time
       sum = 0;
136
       for (i = 0; i < iterations; i++)
         t0 = Native.rdMem(Const.IO_CNT);
138
         NoC. send (message, sys. nrCpu-1, 0);
         NoC.recv(message, 1, 0);
140
         t1 = Native.rdMem(Const.IO_CNT);
142
         time = t1-t0-tRead;
           System.out.println("Write time = " + time);
144
         sum += time;
       }
146
       time = sum/iterations;
       System.out.println("Avg roundtrip time\t= " + time);
148
       // Find interleaved round trip time
       sum = 0;
       for (i = 0; i < iterations; i++)
         t0 = Native.rdMem(Const.IO_CNT);
         NoC. send (message, sys. nrCpu-1, 0);
         NoC. send (message, sys. nrCpu-1, 0);
154
         NoC.recv(message,1,0);
         NoC.recv(message,1,0);
         t1 = Native.rdMem(Const.IO_CNT);
         time = t1-t0-tRead;
158
            System.out.println("Write time = " + time);
         \operatorname{sum} += \operatorname{time};
       }
       time = sum/iterations;
       System.out.println("Avg interleaved roundtrip timet = " + time)
           ;
       // Find Echo time
       sum = 0;
       for (i = 0; i < iterations; i++)
168
         t0 = Native.rdMem(Const.IO_CNT);
         NoC.send(message, 1, 0);
         NoC. recv (message, 1, 0);
         t1 = Native.rdMem(Const.IO_CNT);
         time = t1-t0-tRead;
            System.out.println("Write time = " + time);
         sum += time;
174
       time = sum/iterations;
176
```

```
System.out.println("Avg echo timet = " + time);
178
       // Find interleaved echo time
180
       sum = 0;
       for (i = 0; i < iterations; i++)
182
         t0 = Native.rdMem(Const.IO_CNT);
         NoC. send (message, 2, 0);
184
         NoC. send (message, 2, 0);
         NoC.recv (message, 2, 0);
186
         NoC.recv(message, 2, 0);
         t1 = Native.rdMem(Const.IO_CNT);
188
         time = t1-t0-tRead;
            System.out.println("Write time = + time);
190
         sum += time;
       }
192
       time = sum/iterations;
       System.out.println("Avg interleaved echo timet = " + time);
       // Find time for send
196
       sum = 0;
       for (i = 0; i < iterations; i++)
198
         t0 = Native.rdMem(Const.IO_CNT);
         NoC.send(message, 3, 0);
200
         t1 = Native.rdMem(Const.IO_CNT);
         NoC.recv(message, 3, 0);
202
         time = t1-t0-tRead;
            System.out.println("Write time = " + time);
204
         sum += time;
       }
206
       time = sum/iterations;
       System.out.println("Avg send timet = " + time);
208
       // Find time for recv
210
       sum = 0;
       for (i = 0; i < iterations; i++)
212
         NoC. send (message, 4, 0);
         while (!NoC.recvRdy(4,0));
214
         t0 = Native.rdMem(Const.IO_CNT);
         NoC.recv (message, 4, 0);
216
         t1 = Native.rdMem(Const.IO_CNT);
         time = t1-t0-tRead;
218
            System.out.println("Write time = " + time);
         sum += time;
       }
       time = sum/iterations;
       System.out.println("Avg recv timet = " + time);
224
         for (;;) {
226
            if (NoC.recvRdy(1,0)) {
              t0 = Native.rdMem(Const.IO_CNT);
228
             NoC. recv (message, 3, 0);
              t1 = Native.rdMem(Const.IO_CNT);
              time = t1-t0-tRead;
```

```
System.out.println("Receive time = " + time);
               t0 = Native.rdMem(Const.IO CNT);
               NoC. send (message, 3, 0);
234
               t1 = Native.rdMem(Const.IO_CNT);
               time = t1-t0-tRead:
236
               System.out.println("Send time = " + time);
               message[0] = message[0] + 1;
238
             }
          }
240
      }
242
      public void run() {
        Tables.load(id);
244
        //NoC.checkSPM();
        StringBuffer sb = new StringBuffer();
        if(!Tables.verify(id)){
          sb.append("Schedule failure: CPU ").append(id).append("\n");
248
        }
        sb.append("Core ").append(id).append(" started");
        RtThread.sleepMs(300*id);
        msg.addElement(sb);
        //sb.delete(0, sb.length());
        int[] message = IOFactory.getFactory().getScratchpadMemory();
254
        for (int i = 0; i < message.length; i++){
           message[i] = i;
        }
           int [] message = \{0 \times id, 1 \times id, 2 \times id, 3 \times id, 4 \times id, 5 \times id, 6 \times id, 7 \times id\};
258
        // 8
                 message = \{0 * id, 1 * id, 2 * id, 3 * id, 4 * id, 5 * id, 6 * id\};
          int []
                                                                               // 7
                                                               ^{//}_{^{// 5}}_{^{// 4}}
                                                                            // 6
           int []
                 message = \{0 * id, 1 * id, 2 * id, 3 * id, 4 * id, 5 * id\};
260
          int[]
                 message = \{0 * id, 1 * id, 2 * id, 3 * id, 4 * id\};
          int []
                 message = \{0 * id, 1 * id, 2 * id, 3 * id\};
262
                 message = \{0 * id, 1 * id, 2 * id\};
           int []
                 message = \{0 * id, 1 * id\};
           int[]
264
           int [] message = \{0 * id\};
        int iterations = 10;
266
        int src = id+1;
268
        if(id = 8){
          \mathrm{src} = 0;
        }
        // Roundtrip measurments
272
        for (int i = 0; i < iterations *3; i++){
          NoC. recv(message, src, id);
274
          message[0] = message[0] + 1;
          NoC. send (message, id - 1, id);
276
        }
278
        // Echo measurements
        for (int i = 0; i < iterations *2; i++){
280
          NoC.recv(message,0,id);
          message[0] = message[0] + 1;
282
          NoC. send (message, 0, id);
        }
284
      }
```

Test and benchmark source code

#### Bibliography

- [1] Rasmus Bo Sørensen, Martin Schoeberl, and Jens Sparsø. A light-weight statically scheduled network-on-chip. In *30th NorChip Conference*, 2012.
- [2] Scott Hansen. T-crest project. Project webpage http://t-crest.org, 2012.
- [3] Martin Schoeberl, Pascal Schleuniger, Wolfgang Puffitsch, Florian Brandner, Christian W. Probst, Sven Karlsson, and Tommy Thorn. Towards a Time-predictable Dual-Issue Microprocessor: The Patmos Approach. In Bringing Theory to Practice: Predictability and Performance in Embedded Systems, volume 18 of OASICS 18 Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, pages 11–21, Mar 2011.
- [4] Martin Schoeberl. JOP Reference Handbook: Building Embedded Systems with a Java Processor. Number ISBN 978-1438239699. CreateSpace, August 2009. Available at http://www.jopdesign.com/doc/handbook.pdf.
- [5] K. Goossens and A. Hansson. The aethereal network on chip after ten years: Goals, evolution, lessons, and future. In *Design Automation Conference* (DAC), 2010 47th ACM/IEEE, pages 306 –311, jun 2010.
- [6] T. Bjerregaard and J. Sparsø. A Router Architecture for Connection-Oriented Service Guarantees in the MANGO Clockless Network-on-Chip. In *date*, pages 1226–1231. IEEE Computer Society Press, 2005.
- [7] M. Millberg, E. Nilsson, R. Thid, and A. Jantsch. Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip. In *Design, Automation and Test in Europe Conference* and Exhibition, 2004. Proceedings, volume 2, pages 890 – 895 Vol.2, feb. 2004.

- [8] Martin Schoeberl, Florian Brandner, Jens Sparsø, and Evangelia Kasapaki. A statically scheduled time-division-multiplexed network-on-chip for real-time systems. In *Proceedings of the 6th International Symposium* on Networks-on-Chip (NOCS), Lyngby, Denmark, May 2012. IEEE.
- [9] Martin Schoeberl. Leros: A tiny microcontroller for FPGAs. In Proceedings of the 21st International Conference on Field Programmable Logic and Applications (FPL 2011), Chania, Crete, Greece, September 2011. IEEE Computer Society.
- [10] Jens Sparsø, Evangelia Kasapaki, and Martin Schoeberl. An area-efficient network adaptor for a tdm-based network-on-chip. In *Design, Autimation* and Test in Europe Conference and Exhibition 2013, 2013. Accepted to DATE'13.
- [11] Martin Schoeberl. SimpCon a simple and efficient SoC interconnect. In Proceedings of the 15th Austrian Workhop on Microelectronics, Austrochip 2007, Graz, Austria, October 2007.
- [12] OCP-IP. Open core protocol specification. Technical report, 2012. Available at http://www.ocpip.org/uploads/dynamic\_ areas/Xu4qydXgbYWof7Ihz3Uh/947/Open%20Core%20Protocol% 20Specification%203.0.pdf.
- [13] Maurizio Palesi, Shashi Kumar, and Rickard Holsmark. A method for router table compression for application specific routing in mesh topology noc architectures. In Stamatis Vassiliadis, Stephan Wong, and TimoD. Hämäläinen, editors, *Embedded Computer Systems: Architectures, Modeling, and Simulation*, volume 4017 of *Lecture Notes in Computer Science*, pages 373–384. Springer Berlin Heidelberg, 2006.
- [14] Evgeny Bolotin, Israel Cidon, Ran Ginosar, and Avinoam Kolodny. Routing table minimization for irregular mesh nocs. In *Proceedings of the conference* on Design, automation and test in Europe, DATE '07, pages 942–947, San Jose, CA, USA, 2007. EDA Consortium.
- [15] Wikipedia.org. Rencontres numbers, 2012. http://en.wikipedia.org/ wiki/Rencontres\_numbers.
- [16] S. Even, A. Itai, and A. Shamir. On the complexity of time table and multicommodity flow problems. In *Foundations of Computer Science*, 1975., 16th Annual Symposium on, pages 184–193, oct. 1975.
- [17] Florian Brandner and Martin Schoeberl. Static routing in symmetric realtime network-on-chips. In *Proceedings of the 20th International Conference* on Real-Time and Network Systems, RTNS '12, pages 61–70, New York, NY, USA, 2012. ACM.

- [18] K. Goossens, A. Radulescu, and A. Hansson. A unified approach to constrained mapping and routing on network-on-chip architectures. In Hardware/Software Codesign and System Synthesis, 2005. CODES+ISSS '05. Third IEEE/ACM/IFIP International Conference on, pages 75-80, sept. 2005.
- [19] Mark Ruvald Pedersen, Jaspur Højgaard, and Rasmus Bo Sørensen. Scheduling in a real-time network-on-chip. Technical report, Department of Informatics and Mathematical Modelling, Technical University of Denmark, 2012.
- [20] Stefan Ropke and David Pisinger. An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows. *Transportation science*, 40(4):455–472, November 2006.
- [21] Thomas A. Feo and Mauricio G.C. Resende. Greedy randomized adaptive search procedures. *Journal of Global Optimization*, 6:109–133, March 1995.
- [22] pugixml.org. Light-weight, simple and fast xml parser for c++ with xpath support, 2012.
- [23] C. A. R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8):666–677, AUG 1978. Reprinted in "Distributed Computing: Concepts and Implementations" edited by McEntire, O'Reilly and Larson, IEEE, 1984.
- [24] G. Kahn. The semantics of a simple language for parallelprogramming. In J. L. Rosenfeld, editor, *Information processing*, pages 471–475, Stockholm, Sweden, Aug 1974. North Holland, Amsterdam.
- [25] MPI-forum. MPI: A Message-Passing Interface Standard Version 3.0. MPI-forum, 2012. Available at http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf.
- [26] open mpi.org. Open mpi: Open source high performance computing, 2012. Available at http://www.open-mpi.org/.
- [27] Cristian Grecu, Andrè Ivanov, Patha Pande, Axel Jantsch, Erno Salmimem, Umit Ogras, and Radu Marculescu. An initiative towards open network-on-chip benchmarks. Technical report, OCP-IP, 2007.