Project NIO: Hardware-Software Co-Design with a NIOS II Processor

This project is a compulsory part of the examination for the System-on-Chip Design course at the University of Twente. The goals of this project are:

To become familiar with a simple microprocessor-based system configured with platform-based design tools.
To get some experience on the interaction between software running on a microprocessor and peripheral hardware.
To learn how to make choices between hardware and software implementations of desired functionality.

This project uses two sets of files to be stored in two different directories.

Files for Part 1

The first part of the project has to be carried out in the same directory as projects SYN and DAT. Once you have logged in, execute the command

get-module sec syn

to get the first set of files in a subdirectory syn (assuming syn is the name of the directory you used for project SYN).

Fixed-Point Arithmetic

Consider a bit vector of, say, 5 positions. Such a bit vector can be used to represent a signed integer following the 2's-complement notation. Such a vector can then encode all numbers from -16 ("10000") to +15 ("01111"). -1, 0 and +1 are respectively encoded as "11111", "00000" and "00001".

If one wants to deal with fractional numbers, one can agree to have an imaginary binary point in the bit vector. The number of bits at the right of the binary point determines the accuracy of the number system. With one bit, the accuracy is 1/2, with two it is 1/4 etc. (compare with the decimal notation system where the first position after the decimal point indicates 1/10, the second 1/100, etc.). The bits left of the binary point are called the integer bits and those at its right the fractional bits. For n fractional bits, the fractional number is obtained by dividing with a factor of 2 to the power of n with respect to the 2's-complement representation.

So a 5-bit vector with 3 integer and 2 fractional bits, can represent all numbers between -4 ("100.00") and +15/4 ("011.11") in steps of 1/4. If all 5 bits were fractional, the represented range would be from -1/2 (".10000") to +15/32 (".01111") in steps of 1/32.

A number with an associated binary point at some position is called a fixed-point number. An alternative way to deal with fractional numbers is to use floating-point numbers where part of the bit vector is a fraction and the other part an exponent with which the fraction should be scaled. Floating point numbers are not considered here any further as they need complex hardware for their implementation. The type of fixed-point number presented here is the signed variant. Also an unsigned variant exists.

The hardware needed for fixed-point numbers also differs somewhat from hardware for integer arithmetic. Additions of numbers with unequal positions of the binary point, for example, require that the binary points are aligned before adding them. In multiplications, the position of the binary point in the result depends on the binary-point positions of the two operands.

Special VHDL packages exist to represent fixed-point numbers. They are not used in this exercise. Instead, the signed data type is used and issues related to the fixed-point representation (especially when multiplying) are handled by explicitly coding them in VHDL.

The Filter

The figure below gives the data-flow graph (DFG) and z-domain description of the filter to be designed. The DFG represents the so-called transposed form of a second-order infinite impulse response (IIR) filter. The delay element (T0) before the output has been introduced for implementation reasons; in this way, the output of the filter corresponds to the contents of a register.

The filter will be implemented as a new architecture for entity siso_gen. During the first 5 clock cycles after reset, the system will load the coefficients. After the 5th clock cycle, the input stream will be interpreted as data and the corresponding filtered output stream will be produced.

The computations in the filter are done by means of fixed-point arithmetic. In order to keep things simple, all signals have the same data type: two integer bits and word_length - 2 fractional bits where word_length is the generic parameter of the siso_gen entity. No distinction is made between the word lengths of data and coefficients, as opposed to real-life implementations.

New Files

The following new files are needed for the first part of this project:

siso_gen_sec_par_arch.vhd: a fully parallel behavioral implementation of the filter in VHDL.
conf_tb_siso_sec_par.vhd: the configuration to simulate the above filter.
sec.in: an input data stream.
sec.ref: a reference output data stream containing the response of the filter to sec.in.

Exercise NIO-1: Code Inspection and Data-Path Extraction

Inspect the code given in the file siso_gen_sec_par_arch.vhd. Which signals represent registers? The coefficients for the multiplications (a1, a2, b0, b1 and b2) are stored in a memory. Which address is used for which coefficient? How many arithmetic units (multipliers and adders) are instantiated? Use this information to draw a sketch of the data path (hand-made drawings are sufficient). Identify the critical path!

Exercise NIO-2: Filter Simulation

Simulate the code with the given configuration to get familiar with the design. In the given configuration, the word length equals 10 which means that all signals have 2 integer and 8 fractional bits. The real value of a signal is obtained by interpreting it as a 2's-complement number and then dividing by 256. So, the range of signal values is from -2.0 to +511/256. No measures are taken to keep the signals within these boundaries. One should avoid overflow by limiting the amplitude of the input signal.

The coefficients given in the first five positions of sec.in implement a high-pass filter. The data stream which follows, consists of three parts:

A Dirac pulse (a 1 (256) followed by 0's).
A high-frequent "sine wave" signal.
A low-frequent "sine wave" signal.

The expected output stream will consist of the impulse response, a sine wave with almost no attenuation and a strongly attenuated sine wave.

A reference output sec.ref contains the expected output. You can verify the correctness of your implementation by comparing the generated output file sec.out with sec.ref (use e.g. the Unix diff command to compare files).

A useful feature of Modelsim is to display a bit vector as an "analog" signal (see the manual). Make use of this feature. You should be able to e.g. see the sine shapes.

Files for Part 2

The rest of this project has to be carried out in a new directory, say nio. Once you have logged in, execute the command

get-module nio nio

to get the files for this proejct in a subdirectory nio. You can also use this command to recover files that you lost for some reason or the other.

The NIOS II Processor

NIOS II is a soft-core processor developed by Altera for its FPGAs. It exists in many variants. The variant used in this exercise does not have a hardware multiplier. As has been explained in the theory lecture on platform-based design, Altera has tooling (Qsys) that makes it easy to configure a system with one or more microprocessors and all kind peripherals. This exercise should provide some familiarity with a system created by Qsys.

In the final integration project at the end of the SoC Design course, Qsys will be used to create a system tailored to the project's needs. In the current HDL-Based SoC Design module, however, Qsys itself will not be used. Instead, a "frozen" design created by Qsys will be the basis for the exercises. The focus will be on the peripheral gp_custom for which you will design alternative architectures.

NIOS II is 32-bit processor. The system used in this exercise has the following address space:

Component
Base address
End address
Size

On-chip RAM
0x00010000
0x0001FFFF
64 kByte

NIOS II debug unit
0x00020800
0x00020FFF
2 kByte

gp_custom
0x00021000
0x000210FF
256 bytes

JTAG UART
0x00021100
0x00021107
8 bytes

The NIOS II has been configured for a reset vector pointing at address 0x00010000, corresponding to the first address of the on-chip RAM. That means that the processor will start executing instructions from that addres after reset.

In a practical design, the JTAG UART is a UART peripheral mapped on the FPGA's JTAG controller. This controller is used to communicate with a PC. In the current exercise, the JTAG UART is mainly used as an output port to which characters generated from a print command in C-code is sent to. In simulation, the characters will be displayed in the Questasim transcript window.

The gp_custom Peripheral

The peripheral hardware block called gp_custom is the focus of this exercise. It has two interfaces. At one side, it is connected to the Avalon I/O bus for communication with the NIOS II. At its other side, it uses the SISO-style of interface as used in earlier exercises, to communicate with the TVC. This makes it possible to connect to a testbench that takes input from a file and stores output in a file.

In this exercise, the architecture simple for gp_custom is provided as a starting point for developing more complicated behavior. The simple architecture has the capability to copy 16 words of 16 bits data from the outside world (via the SISO interface) into its internal input buffer or to copy 16 words from its output buffer to the outside world. The NIOS II can read from the input buffer or write to the output buffer and can also initiate data transfers. As the data bus of the NIOS II has a width of 32 bits, two words of the internal buffer are combined for transfer.

Exercise NIO-3: I/O System and Testbench Inspection

Study the following VHDL files:

altera_avalon_clock_source.vhd
altera_avalon_reset_source.vhd
gp_custom_ent.vhd
gp_custom_simple_arch.vhd
nios_siso.vhd
tvc_nios_siso.vhd
tb_nios_siso.vhd
conf_tb_nios_siso_copy.vhd
conf_tb_nios_siso_sec_soft.vhd

Draw a block diagram of the testbench and the NIOS II system descending the hierarchy down to the components instantiated by the mentioned files. Mention the names of the most relevant entities and signals.

To which write addresses does gp_custom react? And to which read addresses? Answer these questions with a table that contains a row for each address and tells per row what happens when reading and writing.

Mention two methods to intentionally stop the simulation in the given VHDL framework (hardware description and testbench).

Software Development

In this project, user programs will be written in the C programming language. Because of the setup of the software development environment, a few restrictions need to be observed:

The top-level function is called main.
The memory space allocated to the gp_custuom peripheral can be accessed by overlaying it with an array, as follows:
volatile unsigned int *IO_CUSTOM=(unsigned int *)GP_CUSTOM_0_BASE;
Then, IO_CUSTOM[0] will access the first address in the peripheral IO[1] the second address, etc.

The compiled user program is stored in RAM. In practice, the JTAG UART would be used to transfer the contents of the RAM from the PC. In simulation, the RAM is initialized from a file (so the right contents are already in place at time zero). The file used for initialization has the so-called hex format and has extenstion .hex.

The development environment for software is found in a subdirectory called my_software. The compilation from C code to the bit patterns in RAM has been automated by means of a makefile called Makefile. Such a file lists dependencies between files that are created from each other and the commands that take care of the transformations. You do not need to understand the makefile; you only need to declare the name of your C file, e.g. foo.c in one of the first lines of the file. Typing make at the Unix command prompt will compile the file. If no errors are found, the final result will be a file called foo.hex. In a VHDL configuration, the file foo.hex will be connected to the RAM model by means of a generic map.

Suppose that a C-program that you have written, does not behave as you want, you can modify it, run make and then restart your simulation in Modelsim without quitting. The new object code will be used in the next run.

Exercise NIO-4: Data-Copy Application

In this exercise, a simple application should be simulated on the combination of NIOS II and gp_custom. The goal is to copy 16 words from the testbench to the NIOS II, reverse their order and copy them back to the testbench. This behavior is implemented by the C-program copy.c in the subdirectory my_software. Run make in that directory to create copy.hex.

Start Modelsim in the main directory nio and create a new project. When asked whether to use the system or the current version of modelsim.ini choose for current. Make sure that the resolution is set to 1ps in the .mpf belonging to the project. Add the files mentioned in Exercise NIO-3 to the project and compile them in the given order. The VHDL sources of NIOS II itself have been precompiled in a separate library made available to you. Which configuration should you simulate in order to execute the code in copy.hex?

Note: due to way that Altera Qsys has generated VHDL, you cannot simulate the configuration in the usual way. Instead issue the following command in the transcript window:

start_sim <configuration name>

Monitor the signals of gp_custom. How many clock cycles are needed to transfer the 16 data words from file to the CPU, reverse their order, and transfer them back to file? Pay attention as well to the print statements in the code. How many clock cycles per character does the execution of the printf function need? (Hint: in the wave window, set the format for signal av_writedata to ASCII.)

Exercise NIO-5: Implementation in Software of Second-Order IIR Filter

The topic of this exercise is the same filter as in NIO-1/NIO-2, with the difference that all arithmetic is now executed on NIOS II. The NIOS II data path has a word length of 32 bits. For the sake of filter computations, 8 are supposed to act as fractional bits (see the implementation of the multiplication). The code is given in the C-program sec_soft.c. The program filters input samples in groups of 16 (8 pairs) and then sends the output samples back to the testbench.

Which VHDL configuration should you use to execute this program? The input file for the simulation is sec_soft.in and its output is collected in sec_soft.out. sec_soft.ref contains a reference output stream.

Simulate the filter. It takes longer than usual. For this reason, a short message is printed after each block. Verify that the output in sec_soft.out coincides with the one in sec_soft.ref. Use Modelsim's Wave window to estimate the average number of clock cycles needed to calculate one block of samples including data transfer to and from the testbench. Do all blocks need the same number of clock cycles? How many clock cycles are roughly needed for one filter sample? With how many clock cycles per multiplication does this correspond if all other computations in the filter can be neglected?

In subdirectory my_software, you will find a file sec_soft.s which contains the assembly code that is the result of the compilation. Rename this file to sec_soft.s-keep for the purpose of future comparison (use Unix command mv). Now, clear the directory giving command make clean and then edit Makefile to use compiler optimization -O3 (uncomment the appropriate line). This option instructs the compiler to spend more compilation time with the goal of generating a more efficient executable. Run make to compile the software with the new options.

Finally, resimulate the filter in Modelsim. How many cycles are now roughly needed for one block? And for one multiplication? Do all blocks need approximately the same number of clock cycles? Comment on the results.

Exercise NIO-6A (for students of SoC Design): More Efficient Implementation of Second-Order IIR Filter

Synthesize the design (only gp_custom, not the NIOS II; switch off scan-path creation and use special value none for the word length in generate-design). Ignore the fact that NIOS II has been designed for an FPGA implementation while generate-design is meant for an ASIC design flow. The goal of synthesis is to make sure that the design that you will deliver is synthesizable and also to obtain some data on the size and speed of the gp_custom block. Note: pay attention to the right case of the clock signal; even when VHDL is case-insensitive, Synopsys is case sensitive and does not consider "CLK" to be equal to "clk".

Perform a post-synthesis simulation using command:

start_sim_post <configuration name> <SDF file name>

where the SDF file name is the file name as found in the synopsys_out directory. Example:

start_sim_post conf_tb_nios_siso_copy_post gp_custom_simple_none_5_flat.sdf

assuming that your configuration and SDF files have the names used in the example.

Modify both the hardware as the software to improve the efficiency of the second-order IIR filter. For each variant that you implement, create a new architecture for the entity gp_custom and a new C-program. Use the VHDL configuration mechanism to simulate the new hardware-software combination. The co-processing hardware should have 2 integer and 8 fractional bits.

Think carefully about the software/hardware partitioning. It may be e.g. an idea to have programmable filter coefficients.

Here are some suggestions for efficiency improvement. You can also propose other type of improvements. Original designs displaying creativity will be especially appreciated. Think of a solution, specify the memory-mapped I/O needed for that solution and the controller-data-path combination in the hardware. Discuss your proposal with the assistant and then implement. Continue making more improvement until no more time is left.

Implement a bypass mode in which the filtering is done entirely in gp_custom. When in bypass mode, data collected from the data_in port is directly filtered and sent out via data_out. The task of NIOS II is only to turn on and off the bypass mode; it cannot access the filtered data because they are not buffered.
Implement a buffered bypass mode. In this case, the incoming data is filtered and then stored in blocks of 16 in the input buffer. NIOS II can fetch this data for further processing. As you do not really need to process the data further, you can use NIOS II to send back the data to gp_custom.
Extend gp_custom with only a multiplier (for numbers with two integer and 8 fractional bits) or a multiply-accumulate unit and let NIOS II send the operands and fetch the result.

Simulate each variant that you implement, and check that the output file is identical to the reference file. If you are satisfied with your design, synthesize gp_custom and perform a post-synthesis simulation.

Exercise NIO-6B (for students of SoC Design for Embedded Systems): More Efficient Implementation of Second-Order IIR Filter, Restricted Version

In principle the exercise is similar to NIO-6A (read its description above). However, you do not need to spend time on reviewing different design alternatives. One alternative has been chosen for you, viz.:

Extend gp_custom with a multiply-accumulate unit and use it to accelerate the software implementation of the second-order IIR filter.

You don't need to perform any synthesis or post-synthesis simulation in order to save time.

Deliverables

The drawing of the data path of Exercise NIO-1 with some explanation.
A plot of the simulation of Exercise NIO-2, displaying the input and output data as analog waveforms.
The block diagram of Exercise NIO-3 and the answers to the other questions.
The number of clock cycles of Exercise NIO-4 and an explanation of how you have computed them.
The same for Exercise NIO-5 complemented with your comments on the influence of the compiler on the performance.
The VHDL code (including configurations) and C-code developed for Exercise NIO-6. Some proof that the pre-sysnthesis and post-synthesis simulations were successful (e.g. suitable waveforms). Data about synthesis results (slack, area, number of flipflops). For each solution, explain your design considerations, present the memory map used and draw the data path implemented in gp_custom. No synthesis-related deliverables are necessary for Exercise NIO-6B.

Grading

At most 1 point can be earned with NIO-1.
At most 0.5 point can be earned with NIO-2.
At most 1 point can be earned with NIO-3.
At most 0.5 points can be earned with NIO-4.
At most 1 points can be earned with NIO-5.
At most 6 points can be earned with NIO-6A. The maximum can be reached with a single complete design, provided that both the hardware and software are the result of sound design decisions and are significantly different from what is already available. Multiple designs can compensate for missed points. Implementing only the bypass mode limits the maximum of points for this exercise to 3.5
At most 4 points can be earned with NIO-6B. The total score of students following SoC Design for Embedded Systems will be multiplied by a factor 10/8 in order to make possible a maximum score of 10.

Go (back) to

Sabih's Home Page.

Last update on: Sun Sep 11 01:22:24 CEST 2016by Sabih Gerez.

Component	Base address	End address	Size
On-chip RAM	0x00010000	0x0001FFFF	64 kByte
NIOS II debug unit	0x00020800	0x00020FFF	2 kByte
gp_custom	0x00021000	0x000210FF	256 bytes
JTAG UART	0x00021100	0x00021107	8 bytes