## Introduction to Computer Architecture

### **Lecture 1: Introduction and Basics**

Pooyan Jamshidi

Week 1: January 9 & 11, 2024



CSCE 212: Introduction to Computer Architecture | Spring 2024 | <u>https://pooyanjamshidi.github.io/csce212/</u> [Slides are primarily based on those of Onur Mutlu for the Computer Architecture Course at CMU]

## Brief Self Introduction

### Pooyan Jamshidi

- Assistant Professor @ USC (CSE), since August 2018
- Postdoc 2 @ Carnegie Mellon University (US), 2016 2018
- Postdoc 1 @ Imperial College London (UK), 2014 2016
- □ Ph.D. from Dublin City University (Ireland), 2010 2014
- M.Sc. from Amirkabir University of Technology (Iran), 2006
- B.Sc. from Amirkabir University of Technology (Iran), 2003
- Worked with Google and NASA
- Homepage: <u>https://pooyanjamshidi.github.io/</u>
- Email: pjamshid@cse.sc.edu
- Research and Teaching in:
  - Machine Learning Systems = AI/ML + Computer Systems
  - Autonomous Robots = AI/ML + Robotics
  - Causal AI = Causal Inference, Causal Representation Learning, Transfer Learning
  - Neural Architectures + Hardware Accelerators
  - AI/ML Systems (See CSCE 585)
  - Autonomous and Adaptive Systems (NASA Autonomous Space Lander)



### Artificial Intelligence and Systems Laboratory (AISys)

https://pooyanjamshidi.github.io/AISys/

#### **Research Areas:**

- Causal Al
- ML for Systems
- Systems for ML
- Adversarial ML
- Robot Learning

**Sponsors:** 



Hamed Damirchi

(PhD student)









**Kimia** 

Mahdi Sharifi

(PhD student)



**Saeid Ghafouri** 

Sonam Kharde M.A. Javidian Fatemeh Ghofrani Nasrin Imanpour (Postdoc)

Shahriar Igbal

(PhD student)

(PhD student) Noorbakhsh (RA)



Abir Hossen (PhD student)





Morteza Maleki (RA)







## Key Research Directions in

## Computer Architecture at AISys

- Low-latency and Energy-efficient Neural Architectures
  - Software-Hardware Co-Design
  - Neural Architecture Search
  - Hardware Accelerators
- Domain-Specific Architectures
  - Architectures for AI/ML
  - Hardware Accelerator for LLMs

## Course Information

- Course Website: <u>https://pooyanjamshidi.github.io/csce212/</u>
- **Piazza** (Communications):
  - Discussion boards for each assignment and the course overall
    - Please post all questions on Piazza so that others can benefit from your questions
    - Answer others' questions if you know the answer ;-)
    - Learn from others' questions and answers
- GradeScope (Assignments)
- Teaching Assistant
  - Rasool Sharifi
  - Homepage: <u>https://rasool-sharifi.github.io/</u>
  - Email: <u>ASHARIFI@email.sc.edu</u>



## Textbook (Harris & Harris)



- Chapter 6 (Architecture)
- Chapter 7 (Microarchitecture)
- Chapter 8 (Memory?)

### Evaluation (subject to minor changes)

| Assignments | 50% |
|-------------|-----|
| Midterm     | 25% |
| Final       | 25% |

# Please volunteer to present a related topic to architecture if you are excited about it!



## Basic Goals & Structure the Computer Architecture Course

### What Will We Learn in This Course?

## How Computers Work (from the ground up)

### We Will Study How Something Like This Works



## Major High-Level Goals of This Course

- In Introduction to Computer Architecture
- Understand the basics
- Understand the principles (of design)
- Understand the precedents
- Based on such understanding:
  - learn how a modern computer works underneath
  - evaluate tradeoffs of different designs and ideas
  - implement a principled design (a simple microprocessor)
  - learn to systematically debug increasingly complex systems
  - Hopefully enable you to develop novel, out-of-the-box designs
- The focus is on basics, principles, precedents, and how to use them to create/implement good designs

## Why These Goals?

- Because you are here for a Computer Science degree!
- Regardless of your future direction, learning the principles of computer architecture will be useful to
  - design better hardware
  - design better software
  - design better systems
  - make better tradeoffs in design
  - understand why computers behave the way they do
  - solve problems better
  - think "in parallel"
  - think critically
  - ..

### Course Components

- Lectures (understanding concepts)
- Readings (reinforcing & going deeper)
- Homework (problem-solving & preparation)
- **Project** (hands-on experience in some concepts)
- Exam (test of understanding)
- In all, you have the freedom to adapt to your learning style
- My advice: Focus on learning & scholarship & understanding

We will enable you to learn + prepare you for the exam

- My suggestions:
  - □ focus on understanding, learning, mastering the material
    - lectures, readings, labs, HWs all enable this and prepare you
  - reinforce problem solving skills with homeworks
  - o do not worry about the exam while listening to lectures
    - most of you will pass this course (historically >80%)

We will release a lot of material to help you with the exam

- Problem solving sessions
- Exam guidance
- All past exams (and basic solutions) are already online



Learning is for life (never ends)

# Focus on learning and scholarship

## How to Approach This Course

# Learning experience Long-term tradeoff analysis Critical thinking & decision making

## How to Approach This Course

Your mindset will determine what you get out of the course Find and choose the learning style that works best for you

## What Will We Learn in This Course?



## How Computers Work (from the ground up)

# And Why We Care

## Why Do We Have Computers?

## Why Do We Do Computing?



## **To Solve Problems**

# To Gain Insight

Hamming, "Numerical Methods for Scientists and Engineers," 1962. <sup>31</sup>

# To Enable a Better Life & Future

## How Does a Computer Solve Problems?



# Orchestrating Electrons

In today's dominant technologies

## How Do Problems Get Solved by Electrons?

## So, I Hope You Are Here for This



CSCE 145/206

- What happens in-between?
- How is a computer designed using logic gates and wires to satisfy specific goals?

**CSCE** 211

"C" as a model of computation

Programmer's view of how a computer system works

Architect/microarchitect's view: How to design a computer that meets system design goals. Choices critically affect both the SW programmer and the HW designer

HW designer's view of how a computer system works

Digital logic as a model of computation

### The Transformation Hierarchy

Computer Architecture (expanded view)



Computer Architecture (narrow view)

## Levels of Transformation

"The purpose of computing is [to gain] insight" (*Richard Hamming*) We gain and generate insight by solving problems How do we ensure problems are solved by electrons?

#### Algorithm

Step-by-step procedure that is guaranteed to terminate where each step is precisely stated and can be carried out by a computer

- Finiteness
- Definiteness
- Effective computability

Many algorithms for the same problem

Microarchitecture An implementation of the ISA

#### Problem

Algorithm

Program/Language Runtime System

(VM, OS, MM)

ISA (Architecture)

Microarchitecture

Logic

Devices

Electrons



ISA

(Instruction Set Architecture)

Interface/contract between SW and HW.

What the programmer assumes hardware will satisfy.

### **Digital logic circuits**

Building blocks of micro-arch (e.g., gates)

## Computer Architecture

- is the science and art of designing computing platforms (hardware, interface, system SW, and programming model)
- to achieve a set of design goals
  - □ E.g., highest performance on earth on workloads X, Y, Z
  - E.g., longest battery life at a form factor that fits in your pocket with cost < \$\$\$ CHF</li>
  - E.g., best average performance across all known workloads at the best performance/cost ratio

• ...

Designing a supercomputer is different from designing a smartphone  $\rightarrow$  But, many fundamental principles are similar



To achieve the highest energy efficiency and performance:

### we must take the expanded view

of computer architecture



### Different Platforms, Different Goals

















Control

**Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16.

**Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

#### Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.



#### New ML applications (vs. TPU3):

- Computer vision
- Natural Language Processing (NLP)
- Recommender system
- Reinforcement learning that plays Go

250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3



https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests\_

- ML accelerator: 260 mm<sup>2</sup>, 6 billion transistors, 600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
- Two redundant chips for better safety.





Tesla Dojo Chip & System

D1 Chip

#### 362 TFLOPs BF16/CFP8 22.6 TFLOPs FP32

10TBps/dir. On-Chip Bandwidth 4TBps/edge. Off-Chip Bandwidth

400W TDP



645mm<sup>2</sup> 7nm Technology

TESLA

50 Billion Transistors

11+ Miles Of Wires

1:53:07 / 3:03:20 · Dojo >



https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s

Tesla Dojo Chip & System





#### https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s

Tesla Dojo Chip & System

0010

**PyTorch** 



| System |
|--------|
| Chip   |

#### https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s



NVIDIA is claiming a **7x improvement** in dynamic programming algorithm (**DPX instructions**) performance on a single H100 versus naïve execution on an A100.

# Up to 7X Higher Performance for HPC Applications

https://www.nvidia.com/en-us/data-center/h100/



 The largest ML accelerator chip (2021)

850,000 cores



Cerebras WSE-2 2.6 Trillion transistors 46,225 mm<sup>2</sup> Largest GPU 54.2 Billion transistors 826 mm<sup>2</sup> NVIDIA Ampere GA100

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning 55 https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu <u>"Accelerating Genome Analysis: A Primer on an Ongoing Journey"</u> IEEE Micro, August 2020.



#### FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications

July-Aug. 2021, pp. 39-48, vol. 41 DOI Bookmark: 10.1109/MM.2021.3088396

MinION from ONT

SmidgION from ONT



Benchmarking a New Paradigm: An Experimental Analysis of

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJJ, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA, GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

a Real Processing-in-Memory Architecture

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow busy with high latency and limited bandwidth, and the low data reuse in memory-bound workload is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement butlenced requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as *processing-in-memory (PMA)*.

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PM architecture. We make two evolution string, we conduct an experimental characterization of the UPMRM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwisht, yielding new insights. Second, we present PIM (*Coressing in-Homery benchmarks*), a benchmark suite of 16 worldoads from different application domains (e.g., dense/sparse linear algebra, dathases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and acaling characteristics of PIM benchmarks on the UPMRM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMRM-based PIM systems with 64 and 2550 PUP sproids new insights about suitability of different worldoads to the PIM systems reares of future PIM systems.



https://arxiv.org/pdf/2105.03814.pdf



To achieve the highest energy efficiency and performance:

#### we must take the expanded view

of computer architecture



#### What is Computer Architecture?

The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals.

#### Why Study Computer Architecture?

- Enable better systems: make computers faster, cheaper, smaller, more reliable, ...
  - By exploiting advances and changes in underlying technology/circuits
- Enable new applications
  - □ Life-like 3D visualization 20 years ago? Virtual reality?
  - Self-driving cars?
  - Personalized genomics? Personalized medicine?
- Enable better solutions to problems
  - Software innovation is built on trends and changes in computer architecture
    - > 50% performance improvement per year has enabled this innovation
- Understand why computers work the way they do

### Computer Architecture Today (I)

Today is a very exciting time to study computer architecture

- Industry is in a large paradigm shift (to novel architectures)
   many different potential system designs possible
- Many difficult problems *motivating* and *caused by* the shift
  - Huge hunger for data and new data-intensive applications
  - Power/energy/thermal constraints
  - Complexity of design
  - Difficulties in technology scaling
  - Memory bottleneck
  - Reliability problems
  - Programmability problems
  - Security and privacy issues
- No clear, definitive answers to these problems

### Computer Architecture Today (II)

- Computing landscape is very different from 10-20 years ago
- Applications and technology both demand novel architectures



General Purpose GPUs

### There's Plenty of Room at the Bottom

From Wikipedia, the free encyclopedia

"There's Plenty of Room at the Bottom: An Invitation to Enter a New Field of Physics" was a lecture given by physicist Richard Feynman at the annual American Physical Society meeting at Caltech on December 29, 1959.<sup>[1]</sup> Feynman considered the possibility of direct manipulation of individual atoms as a more powerful form of synthetic chemistry than those used at the time. Although versions of the talk were reprinted in a few popular magazines, it went largely unnoticed and did not inspire the conceptual beginnings of the field. Beginning in the 1980s, nanotechnology advocates cited it to establish the scientific credibility of their work.

## Historical: Opportunities at the Bottom (II)

## There's Plenty of Room at the Bottom

From Wikipedia, the free encyclopedia

Feynman considered some ramifications of a general ability to manipulate matter on an atomic scale. He was particularly interested in the possibilities of denser computer circuitry, and microscopes that could see things much smaller than is possible with scanning electron microscopes. These ideas were later realized by the use of the scanning tunneling microscope, the atomic force microscope and other examples of scanning probe microscopy and storage systems such as Millipede, created by researchers at IBM.

Feynman also suggested that it should be possible, in principle, to make nanoscale machines that "arrange the atoms the way we want", and do chemical synthesis by mechanical manipulation.

He also presented the possibility of "swallowing the doctor", an idea that he credited in the essay to his friend and graduate student Albert Hibbs. This concept involved building a tiny, swallowable surgical robot.

## Historical: Opportunities at the Top

#### REVIEW

# There's plenty of room at the Top: What will drive computer performance after Moore's law?

Charles E. Leiserson<sup>1</sup>, 
 Neil C. Thompson<sup>1,2,\*</sup>, 
 Joel S. Emer<sup>1,3</sup>, 
 Bradley C. Kuszmaul<sup>1,†</sup>, Butler W. Lampson<sup>1,4</sup>, 
 + See all authors and affiliations

Science 05 Jun 2020: Vol. 368, Issue 6495, eaam9744 DOI: 10.1126/science.aam9744

Much of the improvement in computer performance comes from decades of miniaturization of computer components, a trend that was foreseen by the Nobel Prize-winning physicist Richard Feynman in his 1959 address, "There's Plenty of Room at the Bottom," to the American Physical Society. In 1975, Intel founder Gordon Moore predicted the regularity of this miniaturization trend, now called Moore's law, which, until recently, doubled the number of transistors on computer chips every 2 years.

Unfortunately, semiconductor miniaturization is running out of steam as a viable way to grow computer performance—there isn't much more room at the "Bottom." If growth in computing power stalls, practically all industries will face challenges to their productivity. Nevertheless, opportunities for growth in computing performance will still be available, especially at the "Top" of the computing-technology stack: software, algorithms, and hardware architecture.

Axiom, Revisited

There is plenty of room both at the top and at the bottom

#### but much more so

when you

#### communicate well between and optimize across

the top and the bottom

#### Hence the Expanded View

Computer Architecture (expanded view)

| Problem            |  |
|--------------------|--|
| Aigorithm          |  |
| Program/Language   |  |
| System Software    |  |
| SW/HW Interface    |  |
| Micro-architecture |  |
| Logic              |  |
| Devices            |  |
| Electrons          |  |

# Computer Architecture Why Is It So Exciting Today?

## Performance Energy Efficiency Sustainability

> Reliability Safety Security Privacy

## **More Demanding Workloads**

## New (Device) Technologies

## Performance Energy Efficiency Sustainability

#### Do We Want This?



#### Or This?



# High Performance, Energy Efficient, Sustainable

### Many Difficult Problems: Climate



### Many Difficult Problems: Intelligence



### Many Difficult Problems: Intelligence

#### **Forbes**

Jun 17, 2020, 11:54am EDT | 20,934 views

### Deep Learning's Carbon Emissions Problem



AI

Rob Toews Contributor ①

*I write about the big picture of artificial intelligence.* 



Source: http://spectrum.ieee.org/image/MjYzMzAyMg.jpeg

Source: https://www.forbes.com/sites/robtoews/2020/06/17/deep-learnings-climate-change-problem/



### Many Difficult Problems: Congestion



### Many Difficult Problems: Public Health



Source: https://blog.wego.com/7-crowded-places-and-events-that-you-will-love/

### Many Difficult Problems: Genome Analysis



http://www.economist.com/news/21631808-so-much-genetic-data-so-many-uses-genes-unzipped

86

### Exponential Growth of Neural Networks

Memory and compute requirements

100,000

Total training compute, PFLOP-days 2018 2019 2020 +In just 2 years MSFT-1T (1T) MT-NLG (530B) 10,000 GPT-3 (175B) 1,000 T5 (11B) T-NLG (17B) Tomorrow, multi-trillion Megatron-LM (8B) 100 parameter models • GPT-2 (1.5B) 10 BERT Large (340M) BERT Base (110M) 1 1,000 10.000 100.000 10 100 Model memory requirement, GB certhras 🕦 II) © 20 C C ebras 🕫 tree Inc. II 🗖 its Res arved

Source: https://youtu.be/Bh13Idwcb0Q?t=283





1800x more compute

### Computation vs. Data Storage Dichotomy



Apple M1 Ultra System (2022)

### Data Movement vs. Computation Energy



### A memory access consumes ~100-1000X the energy of a complex addition

### Data Movement vs. Computation Energy



A memory access consumes 6400X the energy of a simple integer addition

# Computing Architectures with

## Minimal Data Movement

### UPMEM Processing-in-DRAM Engine (2019)

### Processing in DRAM Engine

 Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.

### Replaces standard DIMMs

- DDR4 R-DIMM modules
  - 8GB+128 DPUs (16 PIM chips)
  - Standard 2x-nm DRAM process



Large amounts of compute & memory bandwidth



https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/

### **UPMEM Memory Modules**

E19: 8 chips DIMM (1 rank). DPUs @ 267 MHz
 P21: 16 chips DIMM (2 ranks). DPUs @ 250 MHz



www.upmem.com

### 2,560-DPU Processing-in-Memory System



#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJI, American University of Beirut, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amorize the cost of main memory ancess. Fundamentally addressing this data movement builleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PdN).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PDM architectrue. We make two key contributions: First, we conduct an experimental characterization of the UPReM-based PDM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding me winsights. Second, we present PPM (*Processing I-M-Memory benchmarks*), a benchmark suite of 16 workloads from different application domains (e.g., dense/sparse linear algebra, a database, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and scaling characteristics of PfM benchmarks on the UPMEM PM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and GPU counterparts. Our extensive evaluation conducted on two real UPMEM-based PIM systems with 64 and 2556 DPUS provides new insights about suitability of different workloads to the PIM systems years of the VIM systems.

**PIM-enabled** memory DRAM CPU PIM-enable memor **PIM-enabled** memor DRAM PIM-enabled memo

https://arxiv.org/pdf/2105.03814.pdf

### FPGA-based Processing Near Memory

 Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu, "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications" <u>IEEE Micro</u> (IEEE MICRO), to appear, 2021.

# FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications

Gagandeep Singh<sup>◊</sup> Mohammed Alser<sup>◊</sup> Damla Senol Cali<sup>⋈</sup>

Dionysios Diamantopoulos<sup>∇</sup> Juan Gómez-Luna<sup>◊</sup>

Henk Corporaal<sup>★</sup> Onur Mutlu<sup>◊ ⋈</sup>

◇ETH Zürich <sup>™</sup>Carnegie Mellon University
 \*Eindhoven University of Technology <sup>▽</sup>IBM Research Europe

Samsung Newsroom

CORPORATE | PRODUCTS | PRESS RESOURCES | VIEWS | ABOUT US

Audio

Share ( 🎝

#### Samsung Develops Industry's First High Bandwidth Memory with AI Processing Power

Korea on February 17, 2021

#### The new architecture will deliver over twice the system performance and reduce energy consumption by more than 70%

Samsung Electronics, the world leader in advanced memory technology, today announced that it has developed the industry's first High Bandwidth Memory (HBM) integrated with artificial intelligence (AI) processing power – the HBM-PIM The new processing-in-memory (PIM) architecture brings powerful AI computing capabilities inside high-performance memory, to accelerate large-scale processing in data centers, high performance computing (HPC) systems and AI-enabled mobile applications.

Kwangil Park, senior vice president of Memory Product Planning at Samsung Electronics stated, "Our groundbreaking HBM-PIM is the industry's first programmable PIM solution tailored for diverse AI-driven workloads such as HPC, training and inference. We plan to build upon this breakthrough by further collaborating with AI solution providers for even more advanced PIM-powered applications."





#### [3D Chip Structure of HBM with FIMDRAM]

ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Myeong Jun Song', Ahn Choi', Daeho Kim', SooYoung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro<sup>3</sup>, Seungwoo Seo<sup>3</sup>, JoonHo Song<sup>3</sup>, Jaeyoun Youn', Kyomin Sohn', Nam Sung Kim'

<sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA <sup>3</sup>Samsung Electronics, Suwon, Korea

### **Programmable Computing Unit**

- Configuration of PCU block
  - Interface unit to control data flow
  - Execution unit to perform operations
  - Register group
    - 32 entries of CRF for instruction memory
    - 16 GRF for weight and accumulation
    - 16 SRF to store constants for MAC operations



[Block diagram of PCU in FIMDRAM]

ISSCC 2021 / SESSION 25 / DRAM / 25.4

#### 25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Ja Min Ryu', Jong-Ti Son', Seongi O) (1 Hacks AO VI), Hasuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jan Kim', BengSang Pinah', HyoungMin Kim', Bon-Bong Kim', Myeong Jun Soong', Ahn Cho'i, Daaho Kim', SooYeung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Man Sung Kim'

#### [Available instruction list for FIM operation]

| Туре              | CMD  | Description                 |  |
|-------------------|------|-----------------------------|--|
| Floating<br>Point | ADD  | FP16 addition               |  |
|                   | MUL  | FP16 multiplication         |  |
|                   | MAC  | FP16 multiply-accumulate    |  |
|                   | MAD  | FP16 multiply and add       |  |
| Data Path         | MOVE | Load or store data          |  |
|                   | FILL | Copy data from bank to GRFs |  |
| Control Path      | NOP  | Do nothing                  |  |
|                   | JUMP | Jump instruction            |  |
|                   | EXIT | Exit instruction            |  |

ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Je Min Ryu, Jong-Pil Son', Seongi O', Hak-Soo Vury, Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah, HyoungMin Kim', Myeong Jun Soong', Ann Choi), Daeho Kim', SooYoung Kim', Luim-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jaeyoun Youn', Kyomin Sohn', Wam Sung Kim'

<sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA <sup>3</sup>Samsung Electronics, Suwon, Korea

### **Chip Implementation**

- Mixed design methodology to implement FIMDRAM
  - Full-custom + Digital RTL



[Digital RTL design for PCU block]

#### ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Lee', Jaahoon Lee', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Cho', Hyun-Suang Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Myeong Jun Song', Ahn Cho', Deaho Kim', Soo'young Kim', Lun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro', Seungwoo Seo', JoonHo Song', Jayoun Youn', Kyomin Sohn', Nam Sung Kim'

<sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA <sup>3</sup>Samsung Electronics, Suwon, Korea

| Cell array<br>for bank0                             | Cell array<br>for bank4                              | Cell array<br>for bank0                             | Cell array<br>for bank4                              | Pseudo       | Pseudo    |
|-----------------------------------------------------|------------------------------------------------------|-----------------------------------------------------|------------------------------------------------------|--------------|-----------|
| PCU block<br>for bank0 & 1                          | PCU block<br>for bank4 & 5                           | PCU block<br>for bank0 & 1                          | PCU block<br>for bank4 & 5                           | channel-0    | channel-1 |
| Cell array<br>for bank1<br>Cell array<br>for bank2  | Cell array<br>for bank5<br>Cell array<br>for bank6   | Cell array<br>for bank1<br>Cell array<br>for bank2  | Cell array<br>for bank5<br>Cell array<br>for bank6   |              |           |
| PCU block<br>for bank2 & 3                          | PCU block<br>for bank6 & 7                           | PCU block<br>for bank2 & 3                          | PCU block<br>for bank6 & 7                           |              |           |
| Cell array<br>for bank3                             | Cell array<br>for bank7                              | Cell array<br>for bank3                             | Cell array<br>for bank7                              |              |           |
| Cell array<br>for bank11                            | Cell array<br>for bank15                             | Cell array<br>for bank11                            | Cell array<br>for bank15                             | ontrol Block |           |
| PCU block<br>for bank10 & 11                        | PCU block<br>for bank14 & 15                         | PCU block<br>for bank10 & 11                        | PCU block<br>for bank14 & 15                         |              |           |
| Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 | Cell array<br>for bank10<br>Cell array<br>for bank9 | Cell array<br>for bank14<br>Cell array<br>for bank13 |              |           |
| PCU block<br>for bank8 & 9                          | PCU block<br>for bank12 & 13                         | PCU block<br>for bank8 & 9                          | PCU block<br>for bank12 & 13                         | Pseudo       | Pseudo    |
| Cell array<br>for bank8                             | Cell array<br>for bank12                             | Cell array<br>for bank8                             | Cell array<br>for bank12                             | channel-0    | channel-1 |

### Samsung AxDIMM (2021)



Ke et al. "Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM", IEEE Micro (2021)

### SK Hynix Accelerator-in-Memory (2022)

#### **SK**hynix NEWSROOM

**SK hvnix STORY** 

INSIGHT

PRESS CENTER

MULTIMEDIA

Search

🌐 ENG 🗸

Q

#### SK hynix Develops PIM, Next-Generation AI Accelerator

February 16, 2022

#### Seoul, February 16, 2022

SK hynix (or "the Company", www.skhynix.com) announced on February 16 that it has developed PIM\*, a nextgeneration memory chip with computing capabilities.

\*PIM(Processing In Memory): A next-generation technology that provides a solution for data congestion issues for AI and big data by adding computational functions to semiconductor memory

It has been generally accepted that memory chips store data and CPU or GPU, like human brain, process data. SK hynix, following its challenge to such notion and efforts to pursue innovation in the next-generation smart memory, has found a breakthrough solution with the development of the latest technology.

SK hynix plans to showcase its PIM development at the world's most prestigious semiconductor conference, 2022 ISSCC\*, in San Francisco at the end of this month. The company expects continued efforts for innovation of this technology to bring the memory-centric computing, in which semiconductor memory plays a central role, a step closer In Paper 11.1, SK Hynix describes an 1ynm, GDDR6-based accelerator-in-memory with a command set for deep-learning operation. The to the reality in devices such as smartphones.

\*ISSCC: The International Solid-State Circuits Conference will be held virtually from Feb. 20 to Feb. 24 this year with a theme of "Intelligent Silicon for a Sustainable World'

For the first product that adopts the PIM technology, SK hynix has developed a sample of GDDR6-AiM (Accelerator\* in memory). The GDDR6-AiM adds computational functions to GDDR6\* memory chips, which process data at 16Gbps. A combination of GDDR6-AiM with CPU or GPU instead of a typical DRAM makes certain computation speed 16 times faster. GDDR6-AiM is widely expected to be adopted for machine learning, high-performance computing, and big data computation and storage



#### 11.1 A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep-Learning Applications

Seongju Lee, SK hynix, Icheon, Korea

8Gb design achieves a peak throughput of 1TFLOPS with 1GHz MAC operations and supports major activation functions to improve accuracy.

https://news.skhynix.com/sk-hynix-develops-pim-next-generation-ai-accelerator/

### AliBaba PIM Recommendation System (2022)

DRAM Die Photo (36Gb)





Neural Engine Region

JE I

Figure 29.1.7: Die micrographs of DRAM die, NE and ME. Detailed specifications of DRAM die and logic die.

#### 29.1 184QPS/W 64Mb/mm<sup>2</sup> 3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Dimin Niu<sup>1</sup>, Shuangchen Li<sup>1</sup>, Yuhao Wang<sup>1</sup>, Wei Han<sup>1</sup>, Zhe Zhang<sup>2</sup>, Yijin Guan<sup>2</sup>, Tianchan Guan<sup>3</sup>, Fei Sun<sup>1</sup>, Fei Xue<sup>1</sup>, Lide Duan<sup>1</sup>, Yuanwei Fang<sup>1</sup>, Hongzhong Zheng<sup>1</sup>, Xiping Jiang<sup>4</sup>, Song Wang<sup>4</sup>, Fengguo Zuo<sup>4</sup>, Yubing Wang<sup>4</sup>,– Bing Yu<sup>4</sup>, Qiwei Ren<sup>4</sup>, Yuan Xie<sup>1</sup>

### PIM Review and Open Problems

### A Modern Primer on Processing in Memory

Onur Mutlu<sup>a,b</sup>, Saugata Ghose<sup>b,c</sup>, Juan Gómez-Luna<sup>a</sup>, Rachata Ausavarungnirun<sup>d</sup>

SAFARI Research Group

<sup>a</sup>ETH Zürich <sup>b</sup>Carnegie Mellon University <sup>c</sup>University of Illinois at Urbana-Champaign <sup>d</sup>King Mongkut's University of Technology North Bangkok

Onur Mutlu, Saugata Ghose, Juan Gomez-Luna, and Rachata Ausavarungnirun, "A Modern Primer on Processing in Memory" Invited Book Chapter in <u>Emerging Computing: From Devices to Systems -</u> Looking Beyond Moore and Von Neumann, Springer, to be published in 2021.

### Cerebras's Wafer Scale ML Engine (2019)



 The largest ML accelerator chip

400,000 cores



Cerebras WSE 1.2 Trillion transistors 46,225 mm<sup>2</sup>

Largest GPU 21.1 Billion transistors 815 mm<sup>2</sup>

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learhing/

### Cerebras's Wafer Scale ML Engine-2 (2021)



 The largest ML accelerator chip (2021)

850,000 cores



Cerebras WSE-2 2.6 Trillion transistors 46,225 mm<sup>2</sup> Largest GPU 54.2 Billion transistors 826 mm<sup>2</sup> NVIDIA Ampere GA100

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

# Computing Architectures with

## Minimal Data Movement

Fundamentally **Energy-Efficient** (Data-Centric) **Computing Architectures** 

Fundamentally **High-Performance** (Data-Centric) **Computing Architectures**  Many Interesting Things Are Happening Today in Computer Architecture

Performance Energy Efficiency Sustainability Specialized Accelerators

### Apple M1 System on Chip (2021)



Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested

### Apple M1 Max System on Chip (2021)



Source: <u>https://www.anandtech.com/show/17024/apple-m1-max-performance-review</u>

### Bigger and More Powerful Systems (2021)



### Bigger and More Powerful Systems (2022)



**É**М1





https://www.anandtech.com/show/17431/apple-announces-m2-soc-apple-silicon-updated-for-2022

### Google's Video Coding Unit (2021)

Warehouse-Scale Video Acceleration: Co-design and Deployment in the Wild





## (a) Chip floorplan (b) Two chips on a PCBA Figure 5: Pictures of the VCU

Source: https://dl.acm.org/doi/pdf/10.1145/3445814.3446723

## Google's Video Coding Unit (2021)

#### **ars** TECHNICA

BIZ & IT TECH SCIENCE POLICY CARS GAMING & CULTURE STORE

#### I WONDER IF NETFLIX WANTS TO BUY SOME —

#### YouTube is now building its own videotranscoding chips

Google throws custom silicon at YouTube's massive video-transcoding workload.

RON AMADEO - 4/22/2021, 8:24 PM



Enlarge / A Google Argos VCU. It transcodes video very quickly.



Google has decided that YouTube demands such a huge transcoding workload that it needs to build its own server chips. The company detailed its new "Argos" chips in a YouTube blog post, a CNET interview, and in a paper for ASPLOS, the Architectural Support for Programming Languages and Operating Systems Conference. Just as there are GPUs for graphics workloads and Google's TPU (tensor processing unit) for AI workloads, the YouTube infrastructure team says it has created the "VCU" or "Video (trans)Coding Unit," which helps YouTube transcode a single video into over a dozen versions that it needs to provide a smooth.

#### Table 1: Offline two-pass single output (SOT) throughput in VCU vs. CPU and GPU systems

| System      | Throughput [Mpix/s] |         | Perf/TCO <sup>8</sup> |       |
|-------------|---------------------|---------|-----------------------|-------|
|             | H.264               | VP9     | H.264                 | VP9   |
| Skylake     | 714                 | 154     | 1.0x                  | 1.0x  |
| 4xNvidia T4 | 2, 484              | _       | 1.5x                  |       |
| 8xVCU       | 5, 973              | 6, 122  | 4.4x                  | 20.8x |
| 20xVCU      | 14, 932             | 15, 306 | 7.0x                  | 33.3x |

**Encoding Throughput:** Table 1 shows throughput and perf/TCO (performance per total cost of ownership) for the four systems and is normalized to the perf/TCO of the CPU system. The performance is shown for offline two-pass SOT encoding for H.264 and VP9. For H.264, the GPU has 3.5x higher throughput, and the 8xVCU and 20xVCU provide 8.4x and 20.9x more throughput, respectively. For VP9, the 20xVCU system has 99.4x the throughput of the CPU baseline. The two orders of magnitude increase in performance clearly demonstrates the benefits of our VCU system.

# TESLA Full Self-Driving Computer (2019)

- ML accelerator: 260 mm<sup>2</sup>, 6 billion transistors, 600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
- Two redundant chips for better safety.





# Tesla Dojo ML Training Chip (2021)

#### Tesla Dojo Chip



D1 Chip

#### 362 TFLOPs BF16/CFP8 22.6 TFLOPs FP32

10TBps/dir. On-Chip Bandwidth 4TBps/edge. Off-Chip Bandwidth

**400W TDP** 





**50 Billion Transistors** 

11+ Miles Of Wires

1:53:07 / 3:03:20 · Dojo >



125



#### https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s



#### https://www.youtube.com/watch?v=j0z4FweCy4M&t=6340s

## Cerebras's Wafer Scale ML Engine (2019)



 The largest ML accelerator chip

400,000 cores



Cerebras WSE 1.2 Trillion transistors 46,225 mm<sup>2</sup>

Largest GPU 21.1 Billion transistors 815 mm<sup>2</sup>

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learhing/

## Cerebras's Wafer Scale ML Engine-2 (2021)



 The largest ML accelerator chip (2021)

850,000 cores



Cerebras WSE-2 2.6 Trillion transistors 46,225 mm<sup>2</sup> Largest GPU 54.2 Billion transistors 826 mm<sup>2</sup>

NVIDIA Ampere GA100

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

## Google Tensor Processing Unit (~2016)





**Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16.

**Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

#### Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.

# Google TPU Generation II (2017)



https://www.nextplatform.com/2017/05/17/first-depth-look-googles-new-second-generation-tpu/

#### 4 TPU chips vs 1 chip in TPU1

## High Bandwidth Memory vs DDR3

## Floating point operations vs FP16

45 TFLOPS per chip vs 23 TOPS

Designed for training and inference vs only inference

## Google TPU Generation III





TPU v2 - 4 chips, 2 cores per chip



TPU v3 - 4 chips, 2 cores per chip

#### More High Bandwidth Memory

More Systolic Arrays

### Google TPU Generation IV (2021)



#### New ML applications (vs. TPU3):

- Computer vision
- Natural Language Processing (NLP)
- Recommender system
- Reinforcement learning that plays Go

250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3



https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests\_

## An Example Modern Systolic Array: TPU (II)

As reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer [Kun80][Ram91][Ovt15b]. Figure 4 shows that data flows in from the left, and the weights are loaded from the top. A given 256-element multiply-accumulate operation moves through the matrix as a diagonal wavefront. The weights are preloaded, and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion that the 256 inputs are read at once, and that they instantly update one location of each of 256 accumulators. From a correctness perspective, software is unaware of the systolic nature of the matrix unit, but for performance, it does worry about the latency of the unit.



Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.

## An Example Modern Systolic Array: TPU (III)



**Figure 1.** TPU Block Diagram. The main computation part is the yellow Matrix Multiply unit in the upper right hand corner. Its inputs are the blue Weight FIFO and the blue Unified Buffer (UB) and its output is the blue Accumulators (Acc). The yellow Activation Unit performs the nonlinear functions on the Acc, which go to the UB.

# Many (Other) AI/ML Chips

- Alibaba
- Amazon
- Facebook
- Google
- Huawei
- Intel
- Microsoft
- NVIDIA
- Tesla
- Many Others and Many Startups...

### Many More to Come...

# Many (Other) AI/ML Chips (2021)



All information contained within this infographic is gathered from the internet and periodically updated, no guarantee is given that the information provided is correct, complete, and up-to-date.

#### https://basicmi.github.io/AI-Chip/

## Recall Our Axiom

To achieve the highest energy efficiency and performance:

### we must take the expanded view

of computer architecture



Many Interesting Things Are Happening Today in Computer Architecture

> Reliability Safety Security Privacy

## Collapse of the "Galloping Gertie"



### Another View



Source: http://www.seattlepi.com/science/article/A-Tacoma-Narrows-Galloping-Gertie-bridge-6617030.php Source: AP

### How Secure Are These People?



#### Security is about preventing unforeseen consequences

Source: https://s-media-cache-ak0.pinimg.com/originals/48/09/54/4809543a9c7700246a0cf8acdae27abf.jpg

### How Safe & Secure Is This Platform?



## Security: RowHammer (2014)



## The Story of RowHammer

- One can predictably induce bit flips in commodity DRAM chips
   All tested DRAM chips are vulnerable
- First example of how a simple hardware failure mechanism can create a widespread system security vulnerability



### Modern DRAM is Prone to Disturbance Errors



Repeatedly reading a row enough times (before memory gets refreshed) induces disturbance errors in adjacent rows in most real DRAM chips you can buy today

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

# Most DRAM Modules Are Vulnerable



Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors, (Kim et al., ISCA 2014)

### One Can Take Over an Otherwise-Secure System

### Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors

Abstract. Memory isolation is a key property of a reliable and secure computing system — an access to one memory address should not have unintended side effects on data stored in other addresses. However, as DRAM process technology

# Project Zero

<u>Flipping Bits in Memory Without Accessing Them:</u> <u>An Experimental Study of DRAM Disturbance Errors</u> (Kim et al., ISCA 2014)

News and updates from the Project Zero team at Google

Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn+, 2015)

Monday, March 9, 2015

Exploiting the DRAM rowhammer bug to gain kernel privileges

## Security: RowHammer (2014)



It's like breaking into an apartment by repeatedly slamming a neighbor's door until the vibrations open the door you were after

## More Security Implications (II)

"Can gain control of a smart phone deterministically"

# Hammer And Root

# androids Millions of Androids

Drammer: Deterministic Rowhammer Attacks on Mobile Platforms, CCS<sup>4153</sup>

Source: https://fossbytes.com/drammer-rowhammer-attack-android-root-devices/

## More Security Implications (VI)

IEEE S&P 2020

### RAMBleed: Reading Bits in Memory Without Accessing Them

Andrew Kwong University of Michigan ankwong@umich.edu Daniel Genkin University of Michigan genkin@umich.edu Daniel Gruss Graz University of Technology daniel.gruss@iaik.tugraz.at Yuval Yarom University of Adelaide and Data61 yval@cs.adelaide.edu.au

#### Terminal Brain Damage: Exposing the Graceless Degradation in Deep Neural Networks Under Hardware Fault Attacks

Sanghyun Hong, Pietro Frigo<sup>†</sup>, Yiğitcan Kaya, Cristiano Giuffrida<sup>†</sup>, Tudor Dumitraș

University of Maryland, College Park <sup>†</sup>Vrije Universiteit Amsterdam



A Single Bit-flip Can Cause Terminal Brain Damage to DNNs One specific bit-flip in a DNN's representation leads to accuracy drop over 90%

Our research found that a specific bit-flip in a DNN's bitwise representation can cause the accuracy loss up to 90%, and the DNN has 40-50% parameters, on average, that can lead to the accuracy drop over 10% when individually subjected to such single bitwise corruptions...

**Read More** 

## More Security Implications (VIII)

#### DeepHammer: Depleting the Intelligence of Deep Neural Networks through Targeted Chain of Bit Flips

Fan Yao University of Central Florida fan.yao@ucf.edu Adnan Siraj RakinDeliang FanArizona State Universityasrakin@asu.edudfan@asu.edu

Degrade the **inference accuracy** to the level of **Random Guess** 

Example: ResNet-20 for CIFAR-10, 10 output classes

Before attack, Accuracy: 90.2% After attack, Accuracy: ~10% (1/10)



## Can We Truly Depend on Computers?



Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg

## Security: Meltdown and Spectre (2018)



Source: J. Masters, Redhat, FOSDEM 2018 keynote talk.

# Silent Data Corruption In-the-Field (2021)



HotOS 2021: Cores That Don't Count (Fun Hardware)

https://www.youtube.com/watch?v=QMF3rqhjYuM

## Silent Data Corruption In-the-Field (2021)

#### **Silent Data Corruptions at Scale**

Harish Dattatraya Dixit Facebook, Inc. hdd@fb.com Sneha Pendharkar Facebook, Inc. spendharkar@fb.com Matt Beadon Facebook, Inc. mbeadon@fb.com Chris Mason Facebook, Inc. clm@fb.com

Tejasvi Chakravarthy Facebook, Inc. teju@fb.com Bharath Muthiah Facebook, Inc. bharathm@fb.com Sriram Sankar Facebook Inc. sriramsankar@fb.com

### Cores that don't count

Peter H. Hochschild Paul Turner Jeffrey C. Mogul Google Sunnyvale, CA, US Rama Govindaraju Parthasarathy Ranganathan Google Sunnyvale, CA, US David E. Culler Amin Vahdat Google Sunnyvale, CA, US

#### https://www.youtube.com/watch?v=QMF3rqhjYuM

Many Interesting Things Are Happening Today in Computer Architecture

# **More Demanding Workloads**

#### 2018 2019 2020 +In just 2 years MSFT-1T (1T)

#### Huge Demand for Performance & Efficiency

**Exponential Growth of Neural Networks** 

Memory and compute requirements

100,000

Source: https://youtu.be/Bh13Idwcb0Q?t=283

Total training compute, PFLOP-days MT-NLG (530B) 10,000 GPT-3 (175B) 1,000 T5 (11B) T-NLG (17B) Tomorrow, multi-trillion Megatron-LM (8B) 100 parameter models • GPT-2 (1.5B) 10 BERT Large (340M) BERT Base (110M) 1 1.000 10.000 100.000 10 100 Model memory requirement, GB certhras 🕦 II) © 20 C C ebras 🔊 👘 Inc. II 🗖 Its Res arved





1800x more compute

## Increasingly Demanding Applications

## Dream

# and, they will come

As applications push boundaries, computing platforms will become increasingly strained.

## New Genome Sequencing Technologies

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Damla Senol Cali 🖾, Jeremie S Kim, Saugata Ghose, Can Alkan, Onur Mutlu

Briefings in Bioinformatics, bby017, https://doi.org/10.1093/bib/bby017 Published: 02 April 2018 Article history ▼



Oxford Nanopore MinION

## Data → performance & energy bottleneck

## Why Do We Care? An Example

200 Oxford Nanopore sequencers have left UK for China, to support rapid, near-sample coronavirus sequencing for outbreak surveillance

#### Fri 31st January 2020

Following extensive support of, and collaboration with, public health professionals in China, Oxford Nanopore has shipped an additional 200 MinION sequencers and related consumables to China. These will be used to support the ongoing surveillance of the current coronavirus outbreak, adding to a large number of the devices already installed in the country.







700Kg of Oxford Nanopore sequencers and consumables are on their way for use by Chinese scientists in understanding the current coronavirus outbreak.

## Population-Scale Microbiome Profiling



https://blog.wego.com/7-crowded-places-and-events-that-you-will-love/

#### City-Scale Microbiome Profiling 1. Swab (3 min) 3. GPS-tag/timestamp 2. Annotate G Back Data Upload Entry Subway Car Sea 0 С D Е Extract DNA (n=1,457 samples) Viruses Archaea Plasmids Bogota Ambiguous\_ 0.032% 0.003% 0.001% field Pari 4.184% Illumina and Qiagen Library Prep Eukaryota

0.771% HiSeq2500 125x125 Sequences Unknown Bacteria Organisms 46.927% Quality Trim (Q20) 48.313% MegaBLAST-LCA alignment Afshinnekoo+, "Geospatial Resolution of Human and MetaPhIAN classification Bacterial Diversity with City-Scale Metagenomics", Cell

#### Figure 1. The Metagenome of New York City

Systems, 2015 (A) The five boroughs of NYC include (1) Manhattan (green)

(B) The collection from the 466 subway stations of NYC across the 24 subway lines involved three main steps: (1) collection with Copan Elution swabs, (2) data entry into the database, and (3) uploading of the data. An image is shown of the current collection database, taken from http://pathomap.giscloud.com. (C) Workflow for sample DNA extraction, library preparation, sequencing, quality trimming of the FASTQ files, and alignment with MegaBLAST and MetaPhIAn to discern taxa present

#### Example: Rapid Surveillance of Ebola Outbreak

#### Figure 1: Deployment of the portable genome surveillance system in Guinea.



Quick+, "Real-time, portable genome sequencing for Ebola surveillance", Nature, 2016

## High-Throughput Genome Sequencers



## High-Throughput Genome Sequencers

Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu <u>"Accelerating Genome Analysis: A Primer on an Ongoing Journey"</u> IEEE Micro, August 2020.



#### FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications

July-Aug. 2021, pp. 39-48, vol. 41 DOI Bookmark: 10.1109/MM.2021.3088396

MinION from ONT

SmidgION from ONT

#### The Genomic Era



http://www.economist.com/news/21631808-so-much-genetic-data-so-many-uses-genes-unzipped 171



#### Data → performance & energy bottleneck

| reau4: | COULTCOAT |
|--------|-----------|
| read5: | CCATGACGO |
| read6: | TTCCATGAC |

#### 3 Variant Calling



#### **Scientific Discovery 4**

#### We Need Faster & Scalable Genome Analysis



Understanding genetic variations, species, evolution, ...



Rapid surveillance of **disease outbreaks** 



Predicting the presence and relative abundance of **microbes** in a sample



Developing personalized medicine

And, many, many other applications ...

## Our Dream (circa 2007)

- An embedded device that can perform comprehensive genome analysis in real time (within a minute)
  - Which of these DNAs does this DNA segment match with?
  - What is the likely genetic disposition of this patient to this drug?
  - What disease/condition might this particular DNA/RNA piece associated with?
  - What potential viruses & variants might be lurking around?
  - ••••

#### Software Acceleration: Eliminate Useless Work

Download the source code and try for yourself
 <u>Download link to FastHASH</u>

Xin et al. BMC Genomics 2013, **14**(Suppl 1):S13 http://www.biomedcentral.com/1471-2164/14/S1/S13

PROCEEDINGS

#### A colorational recording a

## Accelerating read mapping with FastHASH

Hongyi Xin<sup>1</sup>, Donghyuk Lee<sup>1</sup>, Farhad Hormozdiari<sup>2</sup>, Samihan Yedkar<sup>1</sup>, Onur Mutlu<sup>1\*</sup>, Can Alkan<sup>3\*</sup>

*From* The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013) Vancouver, Canada. 21-24 January 2013



**Open Access** 

#### Hardware Acceleration: Vectorizable Algorithms

#### https://github.com/CMU-SAFARI/Shifted-Hamming-Distance

*Bioinformatics*, 31(10), 2015, 1553–1560 doi: 10.1093/bioinformatics/btu856 Advance Access Publication Date: 10 January 2015 Original Paper

OXFORD

Sequence analysis

# Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping

Hongyi Xin<sup>1,\*</sup>, John Greth<sup>2</sup>, John Emmons<sup>2</sup>, Gennady Pekhimenko<sup>1</sup>, Carl Kingsford<sup>3</sup>, Can Alkan<sup>4,\*</sup> and Onur Mutlu<sup>2,\*</sup>

Xin+, <u>"Shifted Hamming Distance: A Fast and Accurate SIMD-friendly Filter</u> to Accelerate Alignment Verification in Read Mapping", Bioinformatics 2015.

#### GateKeeper: FPGA-Based Acceleration



#### GateKeeper: FPGA-Based Acceleration

 Mohammed Alser, Hasan Hassan, Hongyi Xin, Oguz Ergin, Onur Mutlu, and Can Alkan
 "GateKeeper: A New Hardware Architecture for Accelerating Pre-Alignment in DNA Short Read Mapping" *Bioinformatics*, [published online, May 31], 2017.
 [Source Code]
 [Online link at Bioinformatics Journal]

## GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping

Mohammed Alser 🖾, Hasan Hassan, Hongyi Xin, Oğuz Ergin, Onur Mutlu 🖾, Can Alkan 🖾

*Bioinformatics*, Volume 33, Issue 21, 1 November 2017, Pages 3355–3363, https://doi.org/10.1093/bioinformatics/btx342

Published: 31 May 2017 Article history •

## In-Memory DNA Sequence Analysis

Jeremie S. Kim, Damla Senol Cali, Hongyi Xin, Donghyuk Lee, Saugata Ghose, Mohammed Alser, Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu,

 "GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies"
 <u>BMC Genomics</u>, 2018.

 Proceedings of the <u>16th Asia Pacific Bioinformatics Conference</u> (APBC), Yokohama, Japan, January 2018.
 [Slides (pptx) (pdf)]
 [Source Code]
 [arxiv.org Version (pdf)]
 [Talk Video at AACBB 2019]

### GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies

Jeremie S. Kim<sup>1,6\*</sup>, Damla Senol Cali<sup>1</sup>, Hongyi Xin<sup>2</sup>, Donghyuk Lee<sup>3</sup>, Saugata Ghose<sup>1</sup>, Mohammed Alser<sup>4</sup>, Hasan Hassan<sup>6</sup>, Oguz Ergin<sup>5</sup>, Can Alkan<sup>4\*</sup> and Onur Mutlu<sup>6,1\*</sup>

*From* The Sixteenth Asia Pacific Bioinformatics Conference 2018 Yokohama, Japan. 15-17 January 2018

## Shouji (障子) [Alser+, Bioinformatics 2019]

Mohammed Alser, Hasan Hassan, Akash Kumar, Onur Mutlu, and Can Alkan, "Shouji: A Fast and Efficient Pre-Alignment Filter for Sequence Alignment" *Bioinformatics*, [published online, March 28], 2019. [Source Code] [Online link at Bioinformatics Journal]

> Bioinformatics, 2019, 1–9 doi: 10.1093/bioinformatics/btz234 Advance Access Publication Date: 28 March 2019 Original Paper

OXFORD

Sequence alignment

## Shouji: a fast and efficient pre-alignment filter for sequence alignment

#### Mohammed Alser<sup>1,2,3,\*</sup>, Hasan Hassan<sup>1</sup>, Akash Kumar<sup>2</sup>, Onur Mutlu<sup>1,3,\*</sup> and Can Alkan<sup>3,\*</sup>

<sup>1</sup>Computer Science Department, ETH Zürich, Zürich 8092, Switzerland, <sup>2</sup>Chair for Processor Design, Center For Advancing Electronics Dresden, Institute of Computer Engineering, Technische Universität Dresden, 01062 Dresden, Germany and <sup>3</sup>Computer Engineering Department, Bilkent University, 06800 Ankara, Turkey

\*To whom correspondence should be addressed.

Associate Editor: Inanc Birol

Received on September 13, 2018; revised on February 27, 2019; editorial decision on March 7, 2019; accepted on March 27, 2019

#### SneakySnake [Alser+, Bioinformatics 2020]

Mohammed Alser, Taha Shahroodi, Juan-Gomez Luna, Can Alkan, and Onur Mutlu, "SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs" *Bioinformatics*, to appear in 2020. [Source Code] [Online link at Bioinformatics Journal] Mohammed Alser, Taha Shahroodi, Juan-Gomez Luna, Can Alkan, and Onur Mutlu, "SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Bioinformatics doi.10.1093/bioinformatics/xxxxx Advance Access Publication Date: Day Month Year

Manuscript Category

OXFORD

**Subject Section** 

#### SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter for CPUs, GPUs, and FPGAs

Mohammed Alser<sup>1,2,\*</sup>, Taha Shahroodi<sup>1</sup>, Juan Gómez-Luna<sup>1,2</sup>, Can Alkan<sup>4,\*</sup>, and Onur Mutlu<sup>1,2,3,4,\*</sup>

<sup>1</sup>Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland

<sup>2</sup>Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8006, Switzerland

<sup>3</sup>Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh 15213, PA, USA

<sup>4</sup>Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey

## GenASM Framework [MICRO 2020]

Damla Senol Cali, Gurpreet S. Kalsi, Zulal Bingol, Can Firtina, Lavanya Subramanian, Jeremie S. Kim, Rachata Ausavarungnirun, Mohammed Alser, Juan Gomez-Luna, Amirali Boroumand, Anant Nori, Allison Scibisz, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu, "GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis" *Proceedings of the <u>53rd International Symposium on Microarchitecture</u> (<i>MICRO*), Virtual, October 2020.
 [Lighting Talk Video (1.5 minutes)]
 [Lightning Talk Slides (pptx) (pdf)]
 [Slides (pptx) (pdf)]

#### GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis

Damla Senol Cali<sup>†</sup><sup>™</sup> Gurpreet S. Kalsi<sup>™</sup> Zülal Bingöl<sup>▽</sup> Can Firtina<sup>◊</sup> Lavanya Subramanian<sup>‡</sup> Jeremie S. Kim<sup>◊†</sup> Rachata Ausavarungnirun<sup>⊙</sup> Mohammed Alser<sup>◊</sup> Juan Gomez-Luna<sup>◊</sup> Amirali Boroumand<sup>†</sup> Anant Nori<sup>™</sup> Allison Scibisz<sup>†</sup> Sreenivas Subramoney<sup>™</sup> Can Alkan<sup>▽</sup> Saugata Ghose<sup>\*†</sup> Onur Mutlu<sup>◊†▽</sup>
 <sup>†</sup>Carnegie Mellon University <sup>™</sup>Processor Architecture Research Lab, Intel Labs <sup>¬</sup>Bilkent University <sup>◊</sup>ETH Zürich
 <sup>‡</sup>Facebook <sup>⊙</sup>King Mongkut's University of Technology North Bangkok <sup>\*</sup>University of Illinois at Urbana–Champaign

## SeGraM Framework [ISCA 2022]

Damla Senol Cali, Konstantinos Kanellopoulos, Joel Lindegger, Zulal Bingol, Gurpreet S. Kalsi, Ziyi Zuo, Can Firtina, Meryem Banu Cavlak, Jeremie Kim, Nika MansouriGhiasi, Gagandeep Singh, Juan Gomez-Luna, Nour Almadhoun Alserr, Mohammed Alser, Sreenivas Subramoney, Can Alkan, Saugata Ghose, and Onur Mutlu,
 "SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping"
 Proceedings of the <u>49th International Symposium on Computer Architecture</u> (ISCA), New York, June 2022.

arXiv version

#### SeGraM: A Universal Hardware Accelerator for Genomic Sequence-to-Graph and Sequence-to-Sequence Mapping

Damla Senol Cali<sup>1</sup> Konstantinos Kanellopoulos<sup>2</sup> Joël Lindegger<sup>2</sup> Zülal Bingöl<sup>3</sup> Gurpreet S. Kalsi<sup>4</sup> Ziyi Zuo<sup>5</sup> Can Firtina<sup>2</sup> Meryem Banu Cavlak<sup>2</sup> Jeremie Kim<sup>2</sup> Nika Mansouri Ghiasi<sup>2</sup> Gagandeep Singh<sup>2</sup> Juan Gómez-Luna<sup>2</sup> Nour Almadhoun Alserr<sup>2</sup> Mohammed Alser<sup>2</sup> Sreenivas Subramoney<sup>4</sup> Can Alkan<sup>3</sup> Saugata Ghose<sup>6</sup> Onur Mutlu<sup>2</sup>

<sup>1</sup>Bionano Genomics <sup>2</sup>ETH Zürich <sup>3</sup>Bilkent University <sup>4</sup>Intel Labs <sup>5</sup>Carnegie Mellon University <sup>6</sup>University of Illinois Urbana-Champaign

#### https://arxiv.org/pdf/2205.05883.pdf

## FPGA-based Near-Memory Analytics

 Gagandeep Singh, Mohammed Alser, Damla Senol Cali, Dionysios Diamantopoulos, Juan Gómez-Luna, Henk Corporaal, and Onur Mutlu, "FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications" IEEE Micro (IEEE MICRO), 2021.

# FPGA-based Near-Memory Acceleration of Modern Data-Intensive Applications

Gagandeep Singh<sup>◊</sup> Mohammed Alser<sup>◊</sup> Damla Senol Cali<sup>⋈</sup>

**Dionysios Diamantopoulos**<sup>∇</sup> **Juan Gómez-Luna**<sup>◊</sup>

Henk Corporaal<sup>★</sup> Onur Mutlu<sup>◊ ⋈</sup>

◇ETH Zürich <sup>™</sup>Carnegie Mellon University
 \*Eindhoven University of Technology <sup>▽</sup>IBM Research Europe

## In-Storage Genome Filtering [ASPLOS 2022]

Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, and Onur Mutlu,
 "GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis"
 Proceedings of the <u>27th International Conference on Architectural Support for</u> Programming Languages and Operating Systems (ASPLOS), Virtual, February-March 2022.
 [Lightning Talk Slides (pptx) (pdf)]

#### GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis

Nika Mansouri Ghiasi<sup>1</sup> Jisung Park<sup>1</sup> Harun Mustafa<sup>1</sup> Jeremie Kim<sup>1</sup> Ataberk Olgun<sup>1</sup> Arvid Gollwitzer<sup>1</sup> Damla Senol Cali<sup>2</sup> Can Firtina<sup>1</sup> Haiyu Mao<sup>1</sup> Nour Almadhoun Alserr<sup>1</sup> Rachata Ausavarungnirun<sup>3</sup> Nandita Vijaykumar<sup>4</sup> Mohammed Alser<sup>1</sup> Onur Mutlu<sup>1</sup>

<sup>1</sup>ETH Zürich <sup>2</sup>Bionano Genomics <sup>3</sup>KMUTNB <sup>4</sup>University of Toronto

## Future of Genome Sequencing & Analysis

Mohammed Alser, Zülal Bingöl, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, Onur Mutlu <u>"Accelerating Genome Analysis: A Primer on an Ongoing Journey"</u> IEEE Micro, August 2020.



#### FPGA-Based Near-Memory Acceleration of Modern Data-Intensive Applications

July-Aug. 2021, pp. 39-48, vol. 41 DOI Bookmark: 10.1109/MM.2021.3088396

MinION from ONT

SmidgION from ONT

## COVID-19 Nanopore Sequencing (I)

#### SARS-CoV-2 Whole genome sequencing



From ONT (<u>https://nanoporetech.com/covid-19/overview</u>)

## COVID-19 Nanopore Sequencing (II)

How are scientists using nanopore sequencing to research COVID-19?

#### 



From ONT (<u>https://nanoporetech.com/covid-19/overview</u>)

## Accelerating Genome Analysis: Overview

 Mohammed Alser, Zulal Bingol, Damla Senol Cali, Jeremie Kim, Saugata Ghose, Can Alkan, and Onur Mutlu,
 "Accelerating Genome Analysis: A Primer on an Ongoing Journey" IEEE Micro (IEEE MICRO), Vol. 40, No. 5, pages 65-75, September/October 2020.
 [Slides (pptx)(pdf)]
 [Talk Video (1 hour 2 minutes)]

## Accelerating Genome Analysis: A Primer on an Ongoing Journey

Mohammed Alser ETH Zürich

Zülal Bingöl Bilkent University

Damla Senol Cali Carnegie Mellon University

Jeremie Kim ETH Zurich and Carnegie Mellon University Saugata Ghose University of Illinois at Urbana–Champaign and Carnegie Mellon University

Can Alkan Bilkent University

**Onur Mutlu** ETH Zurich, Carnegie Mellon University, and Bilkent University

## Beginner Reading on Genome Analysis

Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu "From Molecules to Genomic Variations to Scientific Discovery: Intelligent Algorithms and Architectures for Intelligent Genome Analysis" Computational and Structural Biotechnology Journal, 2022 [Source code]



Review

From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures



Mohammed Alser\*, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, Onur Mutlu\*

ETH Zurich, Gloriastrasse 35, 8092 Zürich, Switzerland

#### https://arxiv.org/pdf/2205.07957.pdf

Many Interesting Things Are Happening Today in Computer Architecture

## **More Demanding Workloads**



# Computing is Bottlenecked by Data

## Data is Key for AI, ML, Genomics, ...

Important workloads are all data intensive

 They require rapid and efficient processing of large amounts of data

- Data is increasing
  - We can generate more than we can process

#### Data is Key for Future Workloads



#### **In-memory Databases**

[Mao+, EuroSys'12; Clapp+ (**Intel**), IISWC'15]



#### **In-Memory Data Analytics**

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Graph/Tree Processing** [Xu+, IISWC'12; Umuroglu+, FPL'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'15]

#### Data Overwhelms Modern Machines





#### **In-memory Databases**

#### **Graph/Tree Processing**

#### Data → performance & energy bottleneck



#### In-Memory Data Analytics

[Clapp+ (**Intel**), IISWC'15; Awan+, BDCloud'15]



**Datacenter Workloads** [Kanev+ (**Google**), ISCA'I 5]

#### Data is Key for Future Workloads



Chrome

**Google's web browser** 



#### **TensorFlow Mobile**

Google's machine learning framework



**Google's video codec** 



#### SAFARI

#### Data Overwhelms Modern Machines





#### SAFARI

#### Data Movement Overwhelms Modern Machines

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan Knies, Parthasarathy Ranganathan, and Onur Mutlu, "Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks" Proceedings of the <u>23rd International Conference on Architectural Support for Programming</u> <u>Languages and Operating Systems</u> (ASPLOS), Williamsburg, VA, USA, March 2018.

#### 62.7% of the total system energy is spent on data movement

#### Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks

Amirali Boroumand1Saugata Ghose1Youngsok Kim2Rachata Ausavarungnirun1Eric Shiu3Rahul Thakur3Daehyun Kim4,3Aki Kuusela3Allan Knies3Parthasarathy Ranganathan3Onur Mutlu<sup>5,1</sup>

#### Data Movement Overwhelms Accelerators

 Amirali Boroumand, Saugata Ghose, Berkin Akin, Ravi Narayanaswami, Geraldo F. Oliveira, Xiaoyu Ma, Eric Shiu, and Onur Mutlu,
 "Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks"
 Proceedings of the <u>30th International Conference on Parallel Architectures and Compilation</u> <u>Techniques</u> (PACT), Virtual, September 2021.
 [Slides (pptx) (pdf)]
 [Talk Video (14 minutes)]

#### > 90% of the total system energy is spent on memory in large ML models

#### **Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks**

Amirali Boroumand<sup>†</sup>Saugata Ghose<sup>‡</sup>Berkin Akin<sup>§</sup>Ravi Narayanaswami<sup>§</sup>Geraldo F. Oliveira<sup>★</sup>Xiaoyu Ma<sup>§</sup>Eric Shiu<sup>§</sup>Onur Mutlu<sup>★†</sup>

<sup>†</sup>Carnegie Mellon Univ. <sup>•</sup>Stanford Univ. <sup>‡</sup>Univ. of Illinois Urbana-Champaign <sup>§</sup>Google <sup>\*</sup>ETH Zürich

#### Data Movement vs. Computation Energy



#### A memory access consumes ~100-1000X the energy of a complex addition

#### Data Movement vs. Computation Energy



A memory access consumes 6400X the energy of a simple integer addition Many Interesting Things Are Happening Today in Computer Architecture

#### Many Novel Concepts Investigated Today

- New Computing Paradigms (Rethinking the Full Stack)
  - Processing in Memory, Processing Near Data
  - Neuromorphic Computing, Quantum Computing
  - Fundamentally Secure and Dependable Computers
- New Accelerators & Systems (Algorithm-Hardware Co-Designs)
  - Artificial Intelligence & Machine Learning
  - Graph & Data Analytics, Vision, Video
  - Genome Analysis
- New Memories, Storage Systems, Interconnects, Devices
  - Non-Volatile Main Memory, Intelligent Memory Systems, Quantum
  - High-Speed Interconnects, Disaggregated Systems

#### Increasingly Demanding Applications

# Dream

# and, they will come

As applications push boundaries, computing platforms will become increasingly strained.

#### Increasingly Diverging/Complex Tradeoffs



A memory access consumes 6400X the energy of a simple integer addition

#### Increasingly Complex Systems

#### Past systems



#### Increasingly Complex Systems



(General Purpose) GPUs

#### Increasingly Complex Systems on Chip



Source: https://www.anandtech.com/show/16252/mac-mini-apple-m1-tested

## Bigger and More Powerful Systems (2021)



#### Computer Architecture Today

- Computing landscape is very different from 10-20 years ago
- Applications and technology both demand novel architectures



General Purpose GPUs

#### Computer Architecture Today (II)

- You can revolutionize the way computers are built, if you understand both the hardware and the software (and change each accordingly)
- You can invent new paradigms for computation, communication, and storage
- Recommended book: Thomas Kuhn, "The Structure of Scientific Revolutions" (1962)
  - Pre-paradigm science: no clear consensus in the field
  - Normal science: dominant theory used to explain/improve things (business as usual); exceptions considered anomalies
  - Revolutionary science: underlying assumptions re-examined

#### Computer Architecture Today (II)

 You can revolutionize the way computers are built, if you understand both the hardware and the software (and change each accordingly)



#### Takeaways

- It is an exciting time to be understanding and designing computing architectures
- Many challenging and exciting problems in platform design
  - That no one has tackled (or thought about) before
  - That can have huge impact on the world's future
- Driven by huge hunger for data (Big Data), new applications (ML/AI, graph analytics, genomics), ever-greater realism, ...
   We can easily collect more data than we can analyze/understand
- Driven by significant difficulties in keeping up with that hunger at the technology layer
  - □ Five walls: Energy, reliability, complexity, security, scalability

## Let's Start with Some Puzzles

a.k.a. Computer Architecture resembles Building Architecture

#### What Is This?



#### What About This?



## What Do the Following Have in Common?

#### Gare do Oriente, Lisbon



Source: By Martín Gómez Tagle - Lisbon, Portugal, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=13764903

#### Milwaukee Art Museum



Source: By Andrew C. from Flagstaff, USA - Flickr, CC BY 2.0, https://commons.wikimedia.org/w/index.php?curid=379223

#### Athens Olympic Stadium



Source: By Spyrosdrakopoulos - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=16172519

#### City of Arts and Sciences, Valencia



#### Florida Polytechnic University (I)



Source: http://www.architectmagazine.com/design/buildings/florida-polytechnic-university-designed-by-santiago-calatrava\_o

#### Oculus, New York City



Source: https://www.dezeen.com/2016/08/29/santiago-calatrava-oculus-world-trade-center-transportation-hub-new-york-photographs-hufton-crow/

# What do All Those Have in Common with Bahnhof Stadelhofen?

#### Answer: All Designed by a Famous Architect

- ETH Alumnus, PhD Civil Engineering
- "The train station has several of the features that became signatures of his work; straight lines and right angles are rare."



Santiago Calatrava Valls (born 28 July 1951) is a Spanish architect, structural engineer, sculptor and painter, particularly known for his bridges supported by single leaning pylons, and his railway stations, stadiums, and museums, whose sculptural forms often resemble living organisms.<sup>[1]</sup> His best-known works include the Milwaukee Art Museum, the Turning Torso tower in Malmo, Sweden, the Margaret Hunt Hill Bridge in Dallas, Texas, and the Museum of Tomorrow in Rio de Janeiro,

#### Your First Comp. Architecture Assignment

- Go and find the closest Calatrava building to this classroom
   For those who like a challenge, find the furthest building that was designed by Calatrava to his classroom <sup>(i)</sup>
- Appreciate the beauty & out-of-the-box and creative thinking
- Think about tradeoffs in the design
  - Strengths, weaknesses, goals of design
- Derive principles on your own for good design and innovation
- Due date: Any time during or after this course
  - Later during the course is better
  - Apply what you have learned in this course
  - Think out-of-the-box

#### But First, Today's First Assignment

## Find The Differences of This and That

#### This



#### That



#### Many Tradeoffs Between Two Designs

• You can list them after you complete the first assignment...

#### Aside: Evaluation Criteria for the Designs

- Functionality (Does it meet the specification?)
- Reliability
- Space requirement
- Cost
- Expandability
- Comfort level of users
- Happiness level of users
- Aesthetics
- Security

■ How to evaluate goodness of design is always a critical question → "Performance" evaluation and metrics

# A Key Question

- How was Calavatra able to design especially his key buildings?
- Can have many guesses
  - (Very) hard work, perseverance, dedication (over decades)
  - Experience
  - Creativity, Out-of-the-box thinking
  - A good understanding of past designs
  - Good judgment and intuition
  - Strong skill combination (math, architecture, art, engineering, ...)
  - Funding (\$\$\$, luck, initiative, entrepreneurialism
  - Strong understanding of and commitment to fundamentals
    Drive single decision
  - Principled design
  - **.**..
  - You will be exposed to and hopefully develop/enhance many of these skills in this course

# Principled Design

- To me, there are two overriding principles to be found in nature which are most appropriate for building:
  - one is the optimal use of material,
  - the other the capacity of organisms to change shape, to grow, and to move."
  - Santiago Calatrava

"Calatrava's constructions are inspired by natural forms like plants, bird wings, and the human body."

#### Gare do Oriente, Lisbon, Revisited



Source: By Martín Gómez Tagle - Lisbon, Portugal, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=13764903 Source: http://www.arcspace.com/exhibitions/unsorted/santiago-calatrava/

# A Principled Design

# Zoomorphic architecture

From Wikipedia, the free encyclopedia

**Zoomorphic architecture** is the practice of using animal forms as the inspirational basis and blueprint for architectural design. "While animal forms have always played a role adding some of the deepest layers of meaning in architecture, it is now becoming evident that a new strand of biomorphism is emerging where the meaning derives not from any specific representation but from a more general allusion to biological processes."<sup>[1]</sup>

Some well-known examples of Zoomorphic architecture can be found in the TWA Flight Center building in New York City, by Eero Saarinen, or the Milwaukee Art Museum by Santiago Calatrava, both inspired by the form of a bird's wings.<sup>[3]</sup>

#### What Does This Remind You Of?



Source: https://www.dezeen.com/2016/08/29/santiago-calatrava-oculus-world-trade-center-transportation-hub-new-york-photographs-hufton-crow/

#### Design [edit]

Calatrava said that the Oculus resembles a bird being released from a child's hand. The roof was originally

designed to mechanically open to increase light and ventilation to the enclosed space. Herbert Muschamp, architecture critic of *The New York Times*, compared the design to the Bethesda Terrace and Fountain in Central Park, and wrote in 2004:

### Strengths and Praise

66 Santiago Calatrava's design for the World Trade Center PATH station should satisfy those who believe that buildings planned for ground zero must aspire to a spiritual dimension. Over the years, many people have discerned a metaphysical element in Mr. Calatrava's work. I hope New Yorkers will detect its presence, too. With deep appreciation, I congratulate the Port Authority for commissioning Mr. Calatrava, the great Spanish architect and engineer, to design a building with the power to shape the future of New York. It is a pleasure to report, for once, that public officials are not overstating the case when they describe a design as breathtaking.<sup>[43]</sup>

**99** 

### Design Constraints and Criticism

However, Calatrava's original soaring spike design was scaled back because of security issues. The *New York Times* observed in 2005:

66 In the name of security, Santiago Calatrava's bird has grown a beak. Its ribs have doubled in number and its wings have lost their interstices of glass.... [T]he main transit hall, between Church and Greenwich Streets, will almost certainly lose some of its delicate quality, while gaining structural expressiveness. It may now evoke a slender stegosaurus more than it does a bird.<sup>[45]</sup>

#### Stegosaurus

From Wikipedia, the free encyclopedia

For the pachycephalosaurid of a similar name, see Stegoceras.

2

A1 11 11

**Stegosaurus** (/<u>stege'sorres</u>/<sup>[1]</sup>) is a genus of armored dinosaur. Fossils of this genus date to the Late Jurassic period, where they are found in Kimmeridgian to early Tithonian aged strata, between 155 and 150 million years ago, in the western United States and Portugal. Several

Source: https://en.wikipedia.org/wiki/Stegosaurus

Susannah Maidment et al. & Natural History Museum, London - Maidment SCR, Brassey C, Barrett PM (2015) The Postcranial Skeleton of an Exceptionally Complete Individual of the Plated Dinosaur Stegosaurus stenops (Dinosauria: Thyreophora) from the Upper Jurassic Morrison Formation of Wyoming, U.S.A. PLoS ONE 10(10): e0138352. doi:10.1371/journal.pone.0138352

### Design Constraints: Noone is Immune

However, Calatrava's original soaring spike design was scaled back because of security issues. The *New York Times* observed in 2005:

In the name of security, Santiago Calatrava's bird has grown a beak. Its ribs have doubled in number and its wings have lost their interstices of glass.... [T]he main transit hall, between Church and Greenwich Streets, will almost certainly lose some of its delicate quality, while gaining structural expressiveness. It may now evoke a slender stegosaurus more than it does a bird.<sup>[45]</sup>

The design was further modified in 2008 to eliminate the opening and closing roof mechanism because of budget and space constraints.<sup>[46]</sup>

The Transportation Hub has been dubbed "the world's most expensive transportation hub" for its massive cost for reconstruction—\$3.74 billion dollars.<sup>[48][58]</sup> By contrast, the proposed two-mile PATH extension

Source: https://en.wikipedia.org/wiki/World\_Trade\_Center\_station\_(PATH)

99

# The Lecture Was Slightly Different When I Was at CMU

#### What Is This?



#### Answer: Masterpiece of A Famous Architect

# Fallingwater

From Wikipedia, the free encyclopedia

**Fallingwater** or **Kaufmann Residence** is a house designed by architect Frank Lloyd Wright in 1935 in rural southwestern Pennsylvania, 43 miles (69 km) southeast of Pittsburgh.<sup>[4]</sup> The home was built partly over a waterfall on Bear Run in the Mill Run section of Stewart Township, Fayette County, Pennsylvania, in the Laurel Highlands of the Allegheny Mountains.

*Time* cited it after its completion as Wright's "most beautiful job";<sup>[5]</sup> it is listed among *Smithsonian's* Life List of 28 places "to visit before you die."<sup>[6]</sup> It was designated a National Historic Landmark in 1966.<sup>[3]</sup> In 1991, members of the American Institute of Architects named the house the "best all-time work of American architecture" and in 2007, it was ranked twenty-ninth on the list of America's Favorite Architecture according to the AIA.



# Find The Differences of This and That



#### This



250





# A Key Question

- How was Wright able to design his masterpiece?
- Can have many guesses
  - (Very) hard work, perseverance, dedication (over decades)
  - Experience
  - Creativity, Out-of-the-box thinking
  - A good understanding of past designs
  - Good judgment and intuition
  - Strong skill combination (math, architecture, art, engineering, ...)
  - Funding (\$\$\$, luck, initiative, entrepreneurialism
  - Strong understanding of and commitment to fundamentals
    Dringing of design
  - Principled design
  - •..
  - You will be exposed to and hopefully develop/enhance many of these skills in this course

### A Quote from The Architect Himself

"architecture [...] based upon principle, and not upon precedent"



# A Principled Design

# **Organic architecture**

From Wikipedia, the free encyclopedia

**Organic architecture** is a philosophy of architecture which promotes harmony between human habitation and the natural world through design approaches so sympathetic and well integrated with its site, that buildings, furnishings, and surroundings become part of a unified, interrelated composition.

A well-known example of organic architecture is Fallingwater, the residence Frank Lloyd Wright designed for the Kaufmann family in rural Pennsylvania. Wright had many choices to locate a home on this large site, but chose to place the home directly over the waterfall and creek creating a close, yet noisy dialog with the rushing water and the steep site. The horizontal striations of stone masonry with daring cantilevers of colored beige concrete blend with native rock outcroppings and the wooded environment.

# A Key Question

- How was Wright able to design his masterpiece?
- Can have many guesses
  - (Very) hard work, perseverance, dedication (over decades)
  - Experience
  - Creativity, Out-of-the-box thinking
  - A good understanding of past designs
  - Good judgment and intuition
  - Strong skill combination (math, architecture, art, engineering, ...)
  - Funding (\$\$\$, luck, initiative, entrepreneurialism
  - Strong understanding of and commitment to fundamentals
    Dringing of design
  - Principled design
  - •..
  - You will be exposed to and hopefully develop/enhance many of these skills in this course



 It all starts from the basic building blocks and design principles

And, knowledge of how to use, apply, enhance them

Underlying technology might change (e.g., steel vs. wood)
 but methods of taking advantage of technology bear resemblance
 methods used for design depend on the principles employed

# The Same Applies to Processor Chips

There are basic building blocks and design principles



4 cores



Intel Core i7 8 cores



IBM Cell BE 8+1 cores



IBM POWER7 8 cores

Sun Niagara II 8 cores



Nvidia Fermi 448 "cores"



Intel SCC 48 cores, networked



Tilera TILE Gx 100 cores, networked

# The Same Applies to Computing Systems

There are basic building blocks and design principles



258 source: http://www.sia-online.org (semiconductor industry association)

# The Same Applies to Computing Systems

#### There are basic building blocks and design principles







Source: https://taxistartup.com/wp-content/uploads/2015/03/UK-Self-Driving-Cars.jpg







# Apple M1 Max System on Chip (2021)



Source: <u>https://www.anandtech.com/show/17024/apple-m1-max-performance-review</u>

### Google Tensor Processing Unit (~2016)





**Figure 3.** TPU Printed Circuit Board. It can be inserted in the slot for an SATA disk in a server, but the card uses PCIe Gen3 x16.

**Figure 4.** Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

#### Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit", ISCA 2017.

#### Google TPU Generation IV (2021)



#### New ML applications (vs. TPU3):

- Computer vision
- Natural Language Processing (NLP)
- Recommender system
- Reinforcement learning that plays Go

250 TFLOPS per chip in 2021 vs 90 TFLOPS in TPU3



https://spectrum.ieee.org/tech-talk/computing/hardware/heres-how-googles-tpu-v4-ai-chip-stacked-up-in-training-tests\_

# TESLA Full Self-Driving Computer (2019)

- ML accelerator: 260 mm<sup>2</sup>, 6 billion transistors, 600 GFLOPS GPU, 12 ARM 2.2 GHz CPUs.
- Two redundant chips for better safety.





# Cerebras's Wafer Scale ML Engine-2 (2021)



 The largest ML accelerator chip (2021)

850,000 cores



Cerebras WSE-2 2.6 Trillion transistors 46,225 mm<sup>2</sup> Largest GPU 54.2 Billion transistors 826 mm<sup>2</sup> NVIDIA Ampere GA100

https://www.anandtech.com/show/14758/hot-chips-31-live-blogs-cerebras-wafer-scale-deep-learning

https://www.cerebras.net/cerebras-wafer-scale-engine-why-we-need-big-chips-for-deep-learning/

# Google's Video Coding Unit (2021)

Warehouse-Scale Video Acceleration: Co-design and Deployment in the Wild





# (a) Chip floorplan (b) Two chips on a PCBA Figure 5: Pictures of the VCU

Source: https://dl.acm.org/doi/pdf/10.1145/3445814.3446723

#### UPMEM Processing-in-DRAM Engine (2019)

#### Processing in DRAM Engine

 Includes standard DIMM modules, with a large number of DPU processors combined with DRAM chips.

#### Replaces standard DIMMs

- DDR4 R-DIMM modules
  - 8GB+128 DPUs (16 PIM chips)
  - Standard 2x-nm DRAM process



Large amounts of compute & memory bandwidth



https://www.upmem.com/video-upmem-presenting-its-true-processing-in-memory-solution-hot-chips-2019/

### Different Platforms, Different Goals



#### Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

JUAN GÓMEZ-LUNA, ETH Zürich, Switzerland IZZAT EL HAJI, American University of Beruti, Lebanon IVAN FERNANDEZ, ETH Zürich, Switzerland and University of Malaga, Spain CHRISTINA GIANNOULA, ETH Zürich, Switzerland and NTUA, Greece GERALDO F. OLIVEIRA, ETH Zürich, Switzerland ONUR MUTLU, ETH Zürich, Switzerland

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound for such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workload is insufficient to amorize the cost of main memory access. Fundamentally addressing this data movement builteneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PdM).

Recent research explores different forms of PIM architectures, motivated by the emergence of new 3Dstacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip.

This paper provides the first comprehensive analysis of the first publicly-available real-world PM architecture. We make two evolution string, we conduct an experimental characterization of the UPMRM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwisht, yielding new insights. Second, we present PIM (*Coressing in-Homery benchmarks*), a benchmark suite of 16 worldoads from different application domains (e.g., dense/sparse linear algebra, dathases, data analytics, graph processing, neural networks, bioinformatics, image processing), which we identify as memory-bound. We evaluate the performance and acaling characteristics of PIM benchmarks on the UPMRM PIM architecture, and compare their performance and energy consumption to their stateof-the-art CPU and CPU counterparts. Our extensive evaluation conducted on two real UPMRM-based PIM systems with 64 and 2550 PUP sproids new insights about suitability of different worldoads to the PIM systems reares of future PIM systems.



https://arxiv.org/pdf/2105.03814.pdf

#### Samsung Function-in-Memory DRAM (2021)





[3D Chip Structure of HBM with FIMDRAM]

ISSCC 2021 / SESSION 25 / DRAM / 25.4

25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank-Level Parallelism, for Machine Learning Applications

Young-Cheon Kwon', Suk Han Lee', Jaehoon Lee', Sang-Hyuk Kwon', Je Min Ryu', Jong-Pil Son', Seongil O', Hak-Soo Yu', Haesuk Lee', Soo Young Kim', Youngmin Cho', Jin Guk Kim', Jongyoon Choi', Hyun-Sung Shin', Jin Kim', BengSeng Phuah', HyoungMin Kim', Myeong Jun Song', Ahn Choi', Daeho Kim', SooYoung Kim', Eun-Bong Kim', David Wang', Shinhaeng Kang', Yuhwan Ro<sup>3</sup>, Seungwoo Seo<sup>3</sup>, JoonHo Song<sup>3</sup>, Jaeyoun Youn', Kyomin Sohn', Nam Sung Kim'

<sup>1</sup>Samsung Electronics, Hwaseong, Korea <sup>2</sup>Samsung Electronics, San Jose, CA <sup>3</sup>Samsung Electronics, Suwon, Korea

#### Samsung AxDIMM (2021)



Ke et al. "Near-Memory Processing in Action: Accelerating Personalized Recommendation with AxDIMM", IEEE Micro (2021)

#### AliBaba PIM Recommendation System (2022)

DRAM Die Photo (36Gb)





Neural Engine Region

JE I

Figure 29.1.7: Die micrographs of DRAM die, NE and ME. Detailed specifications of DRAM die and logic die.

#### 29.1 184QPS/W 64Mb/mm<sup>2</sup> 3D Logic-to-DRAM Hybrid Bonding with Process-Near-Memory Engine for Recommendation System

Dimin Niu<sup>1</sup>, Shuangchen Li<sup>1</sup>, Yuhao Wang<sup>1</sup>, Wei Han<sup>1</sup>, Zhe Zhang<sup>2</sup>, Yijin Guan<sup>2</sup>, Tianchan Guan<sup>3</sup>, Fei Sun<sup>1</sup>, Fei Xue<sup>1</sup>, Lide Duan<sup>1</sup>, Yuanwei Fang<sup>1</sup>, Hongzhong Zheng<sup>1</sup>, Xiping Jiang<sup>4</sup>, Song Wang<sup>4</sup>, Fengguo Zuo<sup>4</sup>, Yubing Wang<sup>4</sup>, – Bing Yu<sup>4</sup>, Qiwei Ren<sup>4</sup>, Yuan Xie<sup>1</sup>

#### Recall: Takeaways

 It all starts from the basic building blocks and design principles

And, knowledge of how to use, apply, enhance them

Underlying technology might change (e.g., steel vs. wood)
 but methods of taking advantage of technology bear resemblance
 methods used for design depend on the principles employed

## Basic Building Blocks

- Electrons
- Transistors
- Logic Gates
- Combinational Logic Circuits
- Sequential Logic Circuits
  - Storage Elements and Memory
- • •
- Cores
- Caches
- Interconnect
- Memories

## Reading Assignments for This Week

#### Chapter 1 in Harris & Harris







# Supplementary Lecture Slides on Binary Numbers

### Recall: High-Level Goals of This Course

- In Digital Design & Computer Architecture
- Understand the basics
- Understand the principles (of design)
- Understand the precedents
- Based on such understanding:
  - learn how a modern computer works underneath
  - evaluate tradeoffs of different designs and ideas
  - implement a principled design (a simple microprocessor)
  - learn to systematically debug increasingly complex systems
  - Hopefully enable you to develop novel, out-of-the-box designs
- The focus is on basics, principles, precedents, and how to use them to create/implement good designs

#### Recall: Why These Goals?

- Because you are here for a Computer Science degree
- Regardless of your future direction, learning the principles of digital design & computer architecture will be useful to
  - design better hardware
  - design better software
  - design better systems
  - make better tradeoffs in design
  - understand why computers behave the way they do
  - solve problems better
  - think "in parallel"
  - think critically
  - •..

## Course Info and Logistics

## If You Need Help

- Post your question on Moodle Q&A Forum
  - https://moodleapp2.let.ethz.ch/course/view.php?id=19395
  - We will create a forum on Moodle for each activity
  - Preferred for technical questions
- Write an e-mail to:
  - digitaltechnik@lists.inf.ethz.ch
  - The instructor and all assistants will receive this e-mail
- Come to office hours
  - We will provide office locations & Zoom links
  - **TBD**

## Where to Get Up-to-date Course Info?

- Website:
  - https://pooyanjamshidi.github.io/csce212/
  - Lecture slides and (videos)
  - Readings
  - Course schedule, handouts, FAQs
  - Software
  - Any other useful information for the course
  - Check frequently for announcements and due dates
  - □ This is your single point of access to all resources
- TA

## Reading Assignments for This Week

#### Chapter 1 in Harris & Harris

 Chapters 1-2 in Patt and Patel (encouraged)



### Reading Assignments for Next Week

- Combinational Logic chapters from both books
  - Harris and Harris, Chapter 2
  - Patt and Patel, Chapter 3
- Check the course website for all future readings
  - Required
  - Recommended
  - Mentioned

#### Future Lectures and Assignments

- You can also anticipate (and plan for) future lectures and assignments based on Spring 2023 schedule:
  - https://pooyanjamshidi.github.io/csce212/lectures/

### Takeaways

- It is an exciting time to be understanding and designing computing architectures
- Many challenging and exciting problems in platform design
  - That no one has tackled (or thought about) before
  - That can have huge impact on the world's future
- Driven by huge hunger for data (Big Data), new applications (ML/AI, graph analytics, genomics), ever-greater realism, ...
   We can easily collect more data than we can analyze/understand
- Driven by significant difficulties in keeping up with that hunger at the technology layer
  - □ Five walls: Energy, reliability, complexity, security, scalability

### Major High-Level Goals of This Course

In Computer Architecture

- Understand the basics
- Understand the principles (of design)
- Understand the precedents
- Based on such understanding:
  - learn how a modern computer works underneath
  - evaluate tradeoffs of different designs and ideas
  - implement a principled design (a simple microprocessor)
  - Bolice Hopefully enable you to develop novel, out-of-the-box designs
- The focus is on basics, principles, precedents, and how to use them to create/implement good designs, tradeoffs are important!

#### Why These Goals?

- Because you are here for a Computer Science degree
- Regardless of your future direction, learning the principles of computer architecture will be useful to
  - design better systems (software + hardware)
  - make better tradeoffs in design
  - understand why computers behave the way they do
  - solve problems better
  - think "in parallel"
  - think critically
  - • • •

I presume you all know the number systems?

- Binary Number
- Hexadecimal Numbers
- Bits, Bytes, Words
- least significant bit (lsb), most significant bit (msb)
- Least Significant Byte (LSB), Most Significant Byte (MSB)
- KB, MB, GB, TB
- Binary Addition
- Signed Binary Numbers