CS 885 : Computer Architecture

Discussion Board and Mailing List

Username: parallel-systems-sfu@googlegroups.com (accept invitiation in your email before beginning to post).
SFU Mailing list : cmpt-431@sfu.ca

Creating logins for machines (You would have received email with instructions)

Reservations: Google Appointment Slots (http://cs431.youcanbook.me).

Go to reservation calendar .
Create new event (You can book only 1hr slots).
Specify Group Name or Id.
Usage mode: Ex (for timed runs) or Non-Ex (for debugging)

cs-amoeba-n3.cs.surrey.sfu.ca (Need to ssh into head cs-amoeba from which you can ssh into the n1 node)

Amoeba Calendar

AMD server with 24 cores. 2 sockets; each socket 12 core CPU. Access requires reservation.
Has two chips. Each chip has four cores. Each core has private L1 instruction and data caches (64K) and a private L2 cache (512K). The L3 cache is shared among the cores on the same chip 18MB. $ cat /proc/cpuinfo for more details
Runs Ubuntu 12.04. Available compilers and gcc. To avail of intel icc: bash$ source /p/amoeba/shared/lib/lib64/intel/stm/bin/iccvars.csh intel64. Only compile on amoeba-head not on the n1 nodes.
Performance counter infrastructure : Based on perf stat. We have a 3.5.0+ kernel and hence the perf tool will be perf_3.5.0-45. How to use perf [Slides]

Binding threads/processes to processors

When you run experiments, you'd often want to bind a thread or a process to a particular processor, core or a thread context. Here is the explanation of how this can be done along with some code samples. Another example with pthreads.

Tools useful for final project

Given below is a list of tools which might be useful for final project.

Macsim

Macsim is a heterogenous architecture timing model simulator. This is light weight compared to Gem5 and can be used if one wants to only simulate timing model. A detailed manual can be found here .

Gem 5

Gem5 is a modular discrete event driven computer system simulator platform. A tutorial for Gem 5 can be found here. This tool can be used to design new micro architectures and do functional simulation to check the performance and feasibility. There is a step by step installation of an ARM simulator using gem 5 here .

CACTI

Cacti is an integrated cache and memory access time, cycle time, area leakage, and dynamic Power model. This tool can be used to study various parameters for a cache design at different process technologies (like 22 nm).

McPat

McPAT (Multicore Power, Area, and Timing) is an integrated power, area, and timing modeling framework for multithreaded, multicore, and manycore architectures. It models power, area, and timing simultaneously and consistently and supports comprehensive early stage design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. The source code for the same can be downloaded here .

Pin - A Dynamic Binary Instrumentation Tool

Pin tool is a binary Instrumentation frame work for Intel Architecture, it can be used to instrument intel x86 assembly instructions dynamically. It may be useful to get Memory Read Write instruction traces, Memory foot print etc. A detailed documentation and examples are given in the Users Manual .

LLVM

LLVM is a collection of modular and reusable compiler and toolchain technologies. A brief introduction to LLVM can be found here, and this is its official site. LLVM uses clang compiler to generate an IR (which is human readable).Its IR is a sort of generic assembly language and one can use static and dynamic techniques to do program analysis and transformations.

Generally LLVM has compatibility issues accross versions, so If you are trying to build on an existing project, try to use the version on which the existing project has been built. You can use LLVM 3.5 for running a demo . This demo contains an example of a static and dynamic analysis ( check cmpt 886 ). The instructions for building the doc are given in the docs/BUILDING file. Before that you will have to build LLVM-3.5. The LLVM download page is available on LLVM download page.

Option 1) Building using `autoconf`

First, you must build LLVM as described here and skipping steps 4, 5, 6, and 7. Make sure to run configure with the flags --disable-optimized --enable-debug-runtime --enable-assertions. Next, follow the instructions for building LLVM projects:
cd autoconf
./AutoRegen.sh
Enter the paths to the LLVM source and build directories.
From a separate build directory, run:
<path to project source>/configure --with-llvmsrc=<path to llvm src dir> --with-llvmobj=<path to llvm build dir> --disable-optimized --enable-debug-runtime --enable-assertions
Then you can simple run make to build the project within the build directory. The callgrapher program will reside at Debug+Asserts/bin/callgrapher in the build directory.

Option 2) Building using `cmake`

First, you must build LLVM as described here. Make sure to run cmake with -DCMAKE_BUILD_TYPE=Debug. You can use -DCMAKE_INSTALL_PREFIX=<path to local install directory> to fully install llvm to a local path. From a separate build directory for the project, run:
CC=clang CXX=clang++ cmake -DLLVM_DIR=<installation or build path of LLVM>/share/llvm/cmake <path to project source>
Then you can simple run make to build the project within the build directory. The callgrapher program will reside at tools/callgrapher/callgrapher in the build directory.

Given below are the names of few research papers that uses LLVM framework

Tools for Approximate computing

ApproxBench

ApproxBench aims to make approximate-computing evaluations easier and better by providing:

A community-organized selection of existing benchmarks selected for their approximability.
Standard quality metrics that take the guesswork out of quantifying precision.
Initial program annotations that can help techniques decide where and how to apply approximations.

NPiler

NPU Compiler is a compilation workflow which automatically converts annotated regions of imperative code to a neural network representation. It uses a learning-based approach to the acceleration of approximate programs. Details of the process can be found here.

Accelerators

Due to breakdown of Dennard Scaling , the percentage of a silicon chip that can switch at full frequency is dropping exponentially with each process generation. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm, 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50% . To get a full picture of the current problem read this article (highly recommended) . Thus, specialization in the form of accelerators become important.

Dynamically Specialized Execution Resource (DySER)

DySER takes an iterative code (a very simple example would be a for loop) and dynamically maps it on an accelerator. Thus, the whole set of instructions used for the iterative code earlier is reduced to just one instruction, saving a lot of energy on fetch, decode, issue and register files. This tool does much more. An open source version of the tool can be found here. This tool may be complicated to understand, but if you are interested in working with accelerators, you can contact me.