Due April 2, 2004.
In this assignment, we will explore pipelining and the single-cycle architecture from Assignment 3.
For pipelining, you can use provided implementations of a pipeline register and a single-bit pipeline register. These are simple registers that load with each rising edge of the clock. [The single-bit version is necessary for std_logic
signals that need to be pipelined; the regular pipelining register is used for std_logic_vector
signals.]
Below is an overview of the construction of the pipelined CPU: [figure in PDF]
Not shown here are the pipeline registers—every time a signal crosses a stage boundary, a pipelining register must be added. See also the diagram on p. 452 of the text (Figure 8-24).
Hint: When creating the pipeline, you will end up with several signals for the same values at different stages of the pipe. Keep yourself sane by coming up with some consistent naming scheme. For example, append a number indicating which stage each signal is in. So, in the control unit, you might have signals named DA2
(for the DOF stage DA signal), DA3
(for the EX stage DA signal) and DA
(the output port, needed in the WB stage).
Along with the solutions to previous assignments, there are many other files provided for this assignment.
You can download all of them in a TAR file if you like. This command will compile all of the provided files (so you can make a Makefile
a little faster):
vhdlan instrrom pipereg pipebit extend clock pc id bc regfile psr mux binput logic alu fu
(Pay attention to which instruction ROM file you want to compile when running that command.)
On all assignments, there will be marks allocated for the style of your code. You should just make sure you use appropriate variable/signal names, comment hard-to-understand parts, etc.
Pipelined Control
In a file named pcontrol.vhd
create a structural description of the control unit for the example architecture, with a four-stage pipeline, as described above and in the text. You should use this entity declaration:
entity pcontrol is port ( clock, V, C, N, Z : in std_logic; DA, AA, BA, const : out std_logic_vector(2 downto 0); FS : out std_logic_vector(4 downto 0); MB, MD, RW, MW : out std_logic); end pcontrol;
You can (and probably should) start with the control unit from assignment 3.
Note that the provided instruction decoder is slightly different from the one used in previous assignments. The instructions that were undefined have been converted to NOPs (no-operation instructions). We will need the NOPs to avoid hazards. You can download an updated description of the instruction set: [PS] [PDF].
We also need to be a little more careful with the branch control. Until the pipeline is filled, the branch control's inputs will be uninitialized ('U'
). We need to make sure the branch control increments the program counter in this case. The provided branch control does this.
The signals for branch control aren't detailed in the text. All of the control inputs to the branch control should be pipelined to stage 3 (EX). The status bits should be piped to stage 4 (WB)—the status bits will be coming from the previous instruction (the instruction just before the branch), but the other signals come from the branch instruction. Since those instructions are in different stages when the branch control does its job, the inputs must come from the different stages.
Note that there are several instruction ROMs provided:
instrrom.vhd
:- An instruction ROM that contains a few simple instructions, just to see if things are working.
instrrom-sum1.vhd
:- Does the sum 1...10, with plenty of space between instructions to avoid hazards. Note that the conditional branch must come immediately after the instruction that creates the status bits it is examining. Also note that the branch must target two instructions before you might think—by the time the branch happens, the PC has incremented twice.
instrrom-sum2.vhd
:- Does the sum 1...10, with NOPs between instructions only where necessary to avoid hazards.
instrrom-haz.vhd
:- Tries to do the sum 1...10, but the pipeline is ignored so hazards are left to happen. Doesn't work (or even come close).
In all of the instruction ROMs, the default value for a memory location is a NOP instruction. So, any addresses that aren't explicitly assigned a different value are NOPs.
Pipelined Datapath
In a file named pdp.vhd
create a structural description of the datapath unit for the example architecture, with a four-stage pipeline, as described above and in the text. You should use this entity declaration:
entity pdp is port ( clock, RW, MB, MD : in std_logic; DA, AA, BA : in std_logic_vector(2 downto 0); FS : in std_logic_vector(4 downto 0); const_in, data_in : in std_logic_vector(15 downto 0); V, C, N, Z : out std_logic; addr_out, data_out : out std_logic_vector(15 downto 0)); end pdp;
Note that the register file has been modified for the pipelined CPU. It is now a "read-after-write" register file. This is necessary to allow the WB and DOF stages to work on the same register in the same cycle. This implementation actually writes to the register file on the falling edge of the previous clock cycle. See p. 549 (second paragraph) for more details.
We also must make sure that the register file does not write when the RW
signal is uninitialized. The provided register file does this properly.
Pipelined CPU
From your control and datapath, create a pipelined CPU in a file named pcpu.vhd
with this entity declaration:
entity pcpu is port ( data_in : in std_logic_vector(15 downto 0); A_out, B_out : out std_logic_vector(15 downto 0); MW : out std_logic); end pcpu;
You should be able to use the same CPU from assignment 3, with the entity names changed.
Programming the Pipelined CPU
There is no circuitry in the processor we have constructed here to deal with hazards (neither data nor control). So, when programming it, they must be taken into account.
The problem here will be to write a program that does unsigned integer multiplication. The values to multiply will be read from "memory". We won't really connect a memory unit; it will be simulated by a testbench. The provided test bench will give the appearance that M[0]=5 and M[1]=7. (You can, and probably should, change this while testing.)
The values from these memory locations will be loaded to the register file, multiplied and stored back to memory 0. A skeleton instruction ROM has been provided for you to start with. It does the memory accesses; you have to fill in the multiplication. (Leave the file name as instrrom-mult.vhd
and entity name as instrrom
.)
How you do this is entirely up to you. This section of the assignment will be worth more than the others. Half of the marks will be given for completing the multiplication; the other half for the speed of the code. You can assume that the multiplication doesn't overflow—that the result of the multiplication will fit in 16 bits.
Create a text file named about.txt
and describe your multiplication algorithm and how the code works. You can also indicate what you did to speed it up.
Doing the multiplication in under 180 cycles (in the worst case) is good. Less than 100 cycles is possible (but not easy).
Please also submit a sim.scr
file that traces at least these signals when simulating the tb_mult
entity:
- the clock signal
- the current PC value
- the ports from the CPU
- the values in relevant registers
You can probably use the provided sim.scr file, at least as a start, for this.
Clock Speed
Create a file clock.txt
and answer these questions in it:
- What is the minimum clock period that could be used in the CPU without pipelining (as created in Assignment 3)? Why do you know this? ["I experimented until it stopped working" is probably not a good answer.]
- What is the minimum clock period that can be used in the pipelined CPU created here? [You have to be a little careful here of the read-after-write register file used in the datapath. The writing inputs (
DA
,RW
,D
) must get to the register file by the falling edge of the clock.] - What are the propagation delays in the various stages? Which one is limiting the clock speed?
Note: If t is the clock period (in nanoseconds), then the clock frequency (in megahertz) is 1000/t .
Submitting
You have to use the Submission server to submit your work. You should submit the files
pdp.vhd
,
pcontrol.vhd
,
pcpu.vhd
,
about.txt
,
sim.scr
,
instrrom-mult.vhd
,
clock.txt
.
You can do this by typing these commands (change into your assignment directory if you haven't already):
tar cvf a4.tar pdp.vhd pcontrol.vhd pcpu.vhd about.txt sim.scr clock.txt instrrom-mult.vhd gzip a4.tar
Then, submit the file a4.tar.gz
. If you want to submit a ZIP file instead, you can do that but figuring out how is your problem.