### **Energy-Aware Computing Systems**

Energiebewusste Rechensysteme

V. Components and Subsystems

Timo Hönig

June 4, 2020



#### Preface: The Parts vs. The Whole

- "The Whole is Greater Than The Sum of Its Parts" (Aristoteles)
  - synergy → working together
  - the purpose of individual parts (components) may be unrelated to the achieved whole (overall system)
- necessary preliminary work
  - construction of systems requires meaningful assembly of the individual parts
  - ...the sum of *parts* does not become a greater whole by accident...







Agenda

Preface

**Terminology** 

Summary

Operating Domains

Scopes and Frontiers Monitoring and Control

Components and Subsystems

**Energy-Aware Processing Strategies** Data Processing and Computing (CPU)

Volatile Data (Uncore, Memory)

©thoenig EASY (ST 2020, Lecture 5) Preface

### Abstract Concept: Components and Subsystems

- **components** and subsystems
  - component: constituent part or element
  - hardware components
  - $\hookrightarrow$  implementation of basic system functions
  - between components implement subsystems...









### Abstract Concept: Components and Subsystems

- components and subsystems
  - overall systems are composed of subsystem
  - software subsystems
  - $\hookrightarrow$  hardware drivers and interaction  $\rightarrow$  logic
  - → local operation with a global scope
  - duty and high art of computing
    - drive functionalities of hardware components
  - $\hookrightarrow$  correct
  - ⇔ efficient (i.e., performance) characteristics)
  - → with minimal effort (i.e., low) energy demand)





©thoenig EASY (ST 2020, Lecture 5) Terminology

### Monitoring and Control

- higher level monitoring
  - software tracks (global) system state
  - operation states of components (i.e., active, idle, standby, sleep)
- diversified control
  - components have varying characteristics  $\rightarrow$  different control mechanisms
  - subsystems that operate components are heterogeneous...



...and so are the energy-aware processing strategies.

### Scopes and Frontiers

- considerations with regards to the impact and scope
- local and global scope
  - fast path to deep sleep state (i.e., without query towards higher level abstractions)
  - may (unnecessarily) stall other components when functionality is needed (e.g., ramp-up delay)



- ullet consider reordering of actions okeep quality of service (e.g., performance) but reduce energy demand?
- runtime reordering (dynamic), programming reordering (static)





©thoenig EASY (ST 2020, Lecture 5) Operating Domains - Scopes and Frontiers

### **Energy-Aware Processing Strategies**

- all processing strategies depend on individual system components  $(\rightarrow \text{ hardware})$  and responsible **subsystems**  $(\rightarrow \text{ software})$
- data processing and computing  $\rightarrow CPU$ 
  - general purpose CPU cores as components
  - strategies to reduce energy demand under acceptance of moderate performance impacts
- 2. volatile data  $\rightarrow$  uncore, memory
  - uncore and memory as components
  - reduce energy demand of memory components under consideration of necessary performance (i.e., memory bandwidth)



### Data Processing and Computing

CPU

recap: **conflicting goals** for reducing the energy demand of computation-bound and memory-bound operations



 naïve approach: run memory-bound and CPU-bound threads with low and high clock speed, respectively



©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

15 - 29

### Data Processing and Computing

**CPU** 

recap: **conflicting goals** for reducing the energy demand of computation-bound and memory-bound operations



- improved energy-aware processing strategies
  - 1. memory-aware scheduling (combining strategy)
  - 2. load/store and execute (sequencing strategy)
  - 3. thread assignment to heterogeneous cores (assigning strategy)

# O

### Data Processing and Computing

CPU

recap: conflicting goals for reducing the energy demand of computation-bound and memory-bound operations



- considerations and problems of the naïve approach:
  - dynamic characteristics of workloads
  - simple system model (# cores, interlocked voltages, cache size)
  - input-dependent, variable size of working set
  - costs for frequency switching

0

©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

15 - 29

## Memory-aware Scheduling (Combining)

CPU

- contention between cores as to resource demand (i.e., cache, memory)
- quad core processor (clock speed 1.6 GHz to 2.4 GHz)
- shared L2 cache by cores in pairs, memory shared by all cores



**Figure 1.** Normalized runtime of microbenchmarks running on the Core2 Ouad

[4, 5]

- aluadd: compute-bound
- stream{-fit2,-fit1}: memory-bound, varying size of working set



©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems - Data Processing and Computing

### Memory-aware Scheduling (Combining)

CPU

contention between cores as to resource demand (i.e., cache, memory)

- quad core processor (clock speed 1.6 GHz to 2.4 GHz)
- shared L2 cache by cores in pairs, memory shared by all cores



Figure 1. Normalized runtime of microbenchmarks running on the Core2 Quad

[4, 5]

■ penalty depends on contention ← process characteristics

identification of memory-bound process by number of memory transactions



©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

16-29

### Memory-aware Scheduling (Combining)

**CPU** 

proposed strategy: combined scheduling to reduce contention

co-scheduling of compute-bound and memory-bound processes, based on the concept of Gang scheduling [6]



Figure 4. Sorted scheduling. Bars correspond to memory intensity. [4, 5]

scale to lowest frequency if no compute-bound processes are ready
→ only memory-bound processes are ready

scale to highest frequency if **at least one** compute-bound process is ready  $\rightarrow$  best results (i.e., lowest EDP) [5]

# O

## Memory-aware Scheduling (Combining)

- proposed strategy: **combined scheduling** to **reduce contention**
- co-scheduling of compute-bound and memory-bound processes, based on the concept of Gang scheduling [6]



Figure 4. Sorted scheduling. Bars correspond to memory [4, 5] intensity.

- group CPU cores into pairs of two
- run processes with complementary resource demands on each pair



©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

17 - 29

### Memory-aware Scheduling (Combining)

CPU

- proposed strategy: **combined scheduling** to **reduce contention**
- co-scheduling of compute-bound and memory-bound processes, based on the concept of Gang scheduling [6]



Figure 4. Sorted scheduling. Bars correspond to memory intensity. [4, 5]

- limitations and considerations
  - inferences with scheduling strategy  $\rightarrow$  risk of priority inversion
  - scheduling policy only effective for specific sizes of working set

©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

memory hierarchy and cache sizes must be considered

- proposed strategy: sequenced execution to extend phases of homogenous operations
- fundamental idea based on computer architecture which provides performance improvements with decrease in complexity

Decoupled Access/Execute Computer Architectures (Smith 1982, [7])







©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

18 - 29

### Load/Store and Execute (Sequencing)

**CPU** 

create two streams for operations of the same kind



Decoupled access phase: load/store execute phase: compute

#### Access Phase

- prefetch data into caches, write intermediate results to memory
- run with low clock speed

#### **Execute Phase**

- execute operations on data in hot caches (i.e., computations)
- run with high clock speed

- proposed strategy: sequenced execution to extend phases of homogenous operations
- fundamental idea based on computer architecture which provides performance improvements with decrease in complexity



architecture



©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems - Data Processing and Computing

### Load/Store and Execute (Sequencing)

CPU

create two streams for operations of the same kind



- gains and benefits (cf. [2])
  - reduce voltage and frequency thrashing
  - eliminate unnecessary CPU stalling and memory wait cycles
- limitations and considerations
  - lacktriangle compiler support ightarrow open target system and components
  - synchronization efforts (i.e., branches)





### Thread Assignment to Heterogeneous Cores

CPU

proposed strategy: assigning homogenous operations to heterogeneous cores

exploit characteristics at the hardware level (i.e., heterogeneous cores)



0

©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

20-29

**CPU** 

### Thread Assignment to Heterogeneous Cores

proposed strategy: assigning homogenous operations to heterogeneous cores

exploit characteristics at the hardware level (i.e., heterogeneous cores)



### Thread Assignment to Heterogeneous Cores

CPU

- proposed strategy: assigning homogenous operations to heterogeneous cores
- exploit characteristics at the hardware level (i.e., heterogeneous cores)
- application of previously proposed strategies (i.e., combining, sequencing) depends on
  - last level cache
  - memory interconnect

...



C

©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Data Processing and Computing

20 – 29

#### Volatile Data

Uncore, Memory

- CPU centric approaches (i.e., DVFS with general purpose CPU cores) influence only parts of a system's performance and energy demand
- fine-grained energy demand processing strategies must consider additional components
  - uncore (caches, memory and I/O controllers)
  - memory
  - (external) peripheral





Figure 1. Area breakdown of the OpenSPARC T2 SoC.

[3]







### Volatile Data: Caches, Memory and I/O Controllers

#### 8-Core Intel® Core™ i7-5960X **Processor Extreme Edition**



Core Core Cache\*\* Core Core & Core Core

Queue, Uncore, I/O

Memory Controller

18-Core Intel® Xeon™ E5-2696 v3 Processor

Intel® Xeon™ E5-2696 v3 Processo

Transistor count: 2.6 Billion Die size: 354 mm² (intel Transistor count: 5.96 Billion Die size: 662 mm²



Core

Core

Core

- until SandyBridge: linked core and uncore voltages and frequencies
- since Haswell: individual core and uncore voltages and frequencies



©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems – Volatile Data

23-29

#### Considerations and Caveats

- subsystem control hardware at component level
  - implementation of complex software mechanisms
  - ullet influence on multiple components o multiple dimensions
- cross-component interferences
  - processor cores vs. uncore components vs. memory
  - ...plus external data paths (I/O, network)
- impact of strategies
  - overhead of energy-aware processing strategies
  - → state monitoring
  - → control algorithms
- upcoming challenges
  - non-volatile memory
  - power capping at component-level

### Volatile Data: Memory

- significant power demand of memory
- DDR memory can operate at multiple frequencies
- explore dynamic voltage and frequency scaling for memory
- apply classic DVFS approach
  - lower frequency directly reduces switching power
  - lower frequencies allow lower voltages





Figure 5: Memory latency in as a function of channel bandwidth demand.

©thoenig EASY (ST 2020, Lecture 5) Components and Subsystems-Volatile Data

24 - 29

[1]

Memory

### Subject Matter

- hardware components must be controlled by software subsystems
- achieve low energy demand of the overall system without sacrificing performance (too much)
- composition of components and subsystem determines the benefit of the overall approach  $\rightarrow$  "greater whole"
- reading list for Lecture 6:
  - Yuvraj Agarwal et al.

### **Occupancy-Driven Energy Management** for Smart Building Automation

Proceedings of the ACM Workshop on Embedded Sensing Systems for Energy-Efficiency in Building (BuildSys), 2010.



#### Reference List I

- [1] DAVID, H.; FALLIN, C.; GORBATOV, E.; HANEBUTTE, U. R.; MUTLU, O.:
   Memory Power Management via Dynamic Voltage/Frequency Scaling.
   In: Proceedings of the 8th ACM International Conference on Autonomic Computing (ICAC'11), 2011, S. 31–40
- [2] KOUKOS, K.; BLACK-SCHAFFER, D.; SPILIOPOULOS, V.; KAXIRAS, S.: Towards More Efficient Execution: A Decoupled Access-execute Approach. In: Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS'13), 2013, S. 253–262
- [3] LI, Y.; MUTLU, O.; GARDNER, D. S.; MITRA, S.: Concurrent Autonomous Self-test for Uncore Components in System-on-Chips. In: Proceedings of the 28th VLSI Test Symposium (VTS'10) IEEE, 2010, S. 232–237
- [4] MERKEL, A.; BELLOSA, F.: Memory-aware Scheduling for Energy Efficiency on Multicore Processors. In: Proceedings of the Workshop on Power Aware Computing and Systems (HotPower'08), 2008, S. 123–130
- [5] MERKEL, A.; STOESS, J.; BELLOSA, F.: Resource-conscious Scheduling for Energy Efficiency on Multicore Processors. In: Proceedings of the 2010 ACM SIGOPS European Conference on Computer Systems (EuroSys'10), 2010, S. 153–166



#### Reference List II

[6] Ousterhout, J. K. u. a.:

Scheduling Techniques for Concurrent Systems.

In: Proceedings of the 1982 International Conference on Distributed Computing Systems (ICDCS'82) Bd. 82, 1982, S. 22–30

[7] SMITH, J. E.:

Decoupled Access/Execute Computer Architectures.

In: Proceedings of the 9th Annual Symposium on Computer Architecture (ISCA'82), 1982, S. 112–119

[8] Weissel, A.; Bellosa, F.:

Process Cruise Control: Event-Driven Clock Scaling for Dynamic Power Management.

In: Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES'02) ACM, 2002, S. 238–246

