Gitlab : https://csil-git1.cs.surrey.sfu.ca/ashriram/hls-p2.git
git clone https://csil-git1.cs.surrey.sfu.ca/ashriram/hls-p2.git
In this lab you’ll complete a simple HLS-based deep learning accelerator, which is a simplified educational version of a deep learning accelerator design currently in development in at UW by the SAML group.
As part of this lab you’ll learn how to:
output.txt should include the output of the hardware test:
=====================================================================================
INFO - Load/Store test: batch=128, out_channels=128
INFO - Synchronization time: XXXms
INFO - Throughput: XXXGbs/s
INFO - Load/Store test successful!
=====================================================================================
INFO - Reset test: batch=128, out_channels=128
INFO - Synchronization time: XXXms
INFO - Reset test successful!
=====================================================================================
INFO - FC test: batch=64, in_channels=128, out_channels=128
INFO - Synchronization time: XXXms
INFO - Throughput: XXXGOPs/s
INFO - FC test successful!
=====================================================================================
INFO - 2D Convolution test: batch=1, height=9, width=9
kheight=3, kwidth=3
in_channels=32, out_channels=64
INFO - Synchronization time: XXXms
INFO - Throughput: XXXGOPs/s
INFO - 2D Convolution test successful!
Let’s go ahead and go over the pre-requisites for this lab first.
Follow the HLS tutorial instructions in order to get the Xilinx Toolchains installed on your work machine.
In addition this will get you familiarized with the process of using Vivado HLS to design and optimize hardware.
It’s essential that you return your FPGA kit the same way you found it. Do not discard packaging items as it helps keep fragile electronics safely packaged
The CSE599s development kit consists of:
The PYNQ board website is available here. Follow the Getting Started tutorial to get your Pynq board set up (please read the notes below first).
SD card flashing notes:
Board setup notes:
Ethernet connection notes:
Connecting to Jupyter notes:
Try one of the iPython notebook examples available out-of-the-box on your PYNQ board to make sure that it works as intended!
We provide an overview of the repository structure below.
.
|-- driver
|-- pynq_driver.cc # Pynq board FPGA device drivers source file
|-- pynq_driver.h # Pynq board FPGA device drivers header file
|-- scripts
|-- hls.tcl # HLS tcl script: runs simulation, synthesis and ip generation
|-- hsi.tcl # HSI tcl script: generates ARM device drivers
|-- vivado.tcl # Vivado tcl script: runs logic synthesis, place & route and bitstream generation
|-- src
|-- hw_spec.h # VTA hardware spec
|-- test_lib.cc # Test library source file (for both simulation and hardware)
|-- test_lib.h # Test library header file (for both simulation and hardware)
|-- vta.cc # VTA HLS-based hardware source file
|-- vta.h # VTA HLS-based hardware header file
|-- vta_test.cc # Test harness for simulation and hardware
|-- Makefile # Makefile for simulation, hardware compilation and device tests
The VTA design is a simple accelerator that can accelerate dense linear algebra operators commonly found in conventional deep learning architectures, including 2D convolutions, and fully connected layers (matrix multiplication).
The VTA design can execute one of three commands:
VTA is centered around a GEMM core, or matrix multiplication core, that can perform one matrix-matrix multiplication per cycle. In this lab, you will implement this core in HLS.
The VTA design presents the following local SRAM buffers to the GEMM core:
inp_mem
Read-only input storage. Should provide 1 (VTA_BATCH, VTA_BLOCK_IN)
input matrix read per cycle.wgt_mem
Read-only weight storage. Should provide 1 (VTA_BLOCK_OUT, VTA_BLOCK_IN)
weight matrix read per cycle.acc_mem
Read/write accumulator storage. Should provide 1 (VTA_BATCH, VTA_BLOCK_OUT)
accumulator matrix read per cycle.uop_mem
Read-only micro-code instruction storage.Every cycle, a micro-op is read from the micro-op memory uop_mem
.
The GEMM micro-op is composed of three fields:
dst_idx
which is the 0th dimension index to the accumulator memory acc_mem
,src_idx
which is the 0th dimension index to the input memory inp_mem
,wgt_idx
which is the 0th dimension index to the weight memory wgt_mem
.The matrix multiplication operation specified by the micro-op performs the following operation:
acc_mem[dst_idx] += GEMM(inp_mem[src_idx], wgt_mem[wgt_idx])
You are tasked to implement VTA’s GEMM core that performs multiplication between a (VTA_BATCH, VTA_BLOCK_IN)
input matrix and a (VTA_BLOCK_OUT, VTA_BLOCK_IN)
weight matrix (the latter is stored transposed) at a rate of one matrix-matrix multiplication per cycle.
In this part you will complete vta.cc
HLS source file to implement the matrix multiply core of the VTA accelerator.
The part of the file you are expected to edit is marked with TODO Part 1.1a
comment.
In order to check for correctness of your implementation, the simulation tests should all pass.
You can launch simulation with make sim
.
HINTS
- In total you should include around 20 lines of code.
- We leverage the HLS-specific
ap_int
arbitrary precision integer data type across the VTA design. This allows us to perform data packing/unpacking tricks using the range selection operator into vector-like data typesinp_vec_T
,out_vec_T
,acc_vec_T
. More info on how the range selection operator works is on page 633 of the Vivado HLS manual.- When the
reset_acc
flag istrue
, the accumulator memory at indexdst_idx
(specified by the GEMM micro op) should be set to 0.
In this part you will insert HLS pragmas
in vta.cc
HLS source file to achieve an II of 1 on the loop labeled READ_GEMM_UOP
.
The parts of the file you are expected to edit are marked with TODO Part 1.1b
comments.
To check that you’ve achieved an II of 1, launch HLS synthesis with make rpt
. You can then access the synthesis report under build/hls/vta/solution0/syn/report/vta_csynth.rpt
.
Success is determined if:
READ_GEMM_UOP
initiation interval should be equal to 1.HINTS
- In total you should include 5 pragmas (4 of those have been covered in the HLS tutorials and will get you to an II of 2).
- In order to achieve an II of 1, you will have to tell the compiler to ignore false dependences on
acc_mem
. Explanation: FPGA SRAMs require at least two cycle between when a value is written at address A, and when that new value will appear at address A. Consequently if a read at address A is performed one cycle after after it is written, the update won’t be visible. The compiler assumes that this pattern may occur, and therefore conservatively will reduce the II of theREAD_GEMM_UOP
loop to 2. More details on Removing False Dependencies to Improve Loop Pipelining on page 150 of the Vivado HLS manual.
In this part you will generate a VTA micro-op program that implements 2D convolution.
You will need to complete the getConv2dUops()
function that populates a micro-coded 2D convolution kernel.
We’ve provided the scaffolding for you, all you need to do is compute the indices of the dst_idx
, src_idx
, wgt_idx
of each micro-op from your kernel.
To check for correctness of implementation, you’ll need to run the simulation tests again with make sim
. So far, we’ve commented the 2D convolution unit test out of the vta_test.cc
file. To make sure that your implementation is correct, you’ll need to comment the test back in.
HINTS
- The input feature map has a shape of (b, h, w, ic), the output feature map has a shape of (b, h, w, oc), and the kernel has a shape of (oc, kh, kw, ic) where b, h, w, kh, kw, ic, oc denote batch size, feature map spatial height, feature map spatial width, kernel spatial height, kernel spatial width, input channel depth, and output channel depth respectively.
In this part you will push your design through place and route to obtain a bit-stream that you will test on your Pynq dev kit. You should by the end successfully run the VTA test harness in hardware, and report performance metrics.
make
. This will run your design through synthesis, placement, routing and bitstream generation. The process will take about 20-40 mins depending on how powerful your machine is.
build/vivado/export/vta.bit
file. This is the bitstream file that will implement your design on the FPGA.build/vivado/vta.runs/impl_1/vta_wrapper_timing_summary_routed.rpt
. You should hopefully see a positive WNS
under the Design Timing Summary table.build/vivado/vta.runs/impl_1/vta_wrapper_utilization_placed.rpt
.ssh xilinx@192.168.2.99
. The username and password are the same.sshfs xilinx@192.168.2.99:/home/xilinx/ <pynq_fs_dir>
.
cd <pynq_fs_dir>; https://csil-git1.cs.surrey.sfu.ca/ashriram/hls-p2.git
.cp <root>/build/vivado/export/vta.bit <pynq_fs_dir>/hls-p2/
.cp <root>/src/vta_test.cc <pynq_fs_dir>/hls-p2/src/.
and cp <root>/src/test_lib.cc <pynq_fs_dir>/hls-p2/src/.
su
. The password is xilinx
.
make exe
command../vta
. This will launch the binary that programs your FPGA and runs the test harness on your design.HINTS
- Simulation is supposed to faithfully reflect hardware behavior. In other words, if a bug were to be introduced in hardware, it should have been caught in simulation. However you probably ran into a bug in hardware that wasn’t caught by our simulation tests. This can be explained by the false dependence elimination optimization introduced in Part 2: we told the compiler that we wouldn’t perform worst-case consecutive write and read on the same address of
acc_mem
. If however that doesn’t hold true, HLS simulation has no way to know we violated this promise dynamically (the simulation doesn’t contain dynamic checks on array access patterns). It’s likely that the micro-op kernel generated in Part 2 is at fault. Find a way to changegetConv2dUops()
to fix the error you see in hardware. Don’t forget to copy the modifiedvta_test.cc
on the Pynq<pynq_fs_dir>
if you are not editing on the Pynq directly. In addition, don’t forget to recompile your test binary withmake exe
. Note: None of this requires you to recompile the bitstream!