# Unified Memory Protection with Multi-granular MAC and Integrity Tree for Heterogeneous Processors

**Sunho Lee**<sup>1</sup>, Seonjin Na<sup>2</sup>, Jeongwon Choi<sup>1</sup> Jinwon Pyo<sup>1</sup>, Jaehyuk Huh<sup>1</sup>

KAIST<sup>1</sup>, Georgia Tech<sup>2</sup>



**CASYS** Computer Architecture and Systems Lab



Heterogeneous processor: SoC with CPU, GPU, NPU



- Heterogeneous processor: SoC with CPU, GPU, NPU
- Data <u>confidentiality</u> & <u>integrity</u> are essential



- [1] Lest We Remember: Cold-Boot Attacks on Encryption Keys (USENIX Security 2008)
- [2] RAMBleed: Reading Bits in Memory Without Accessing Them (S&P 2020)
- [3] Direct Memory Attack the Kernel (DEFCON 2016; PCILeech)
- [4] Handbook of Applied Cryptography (Menezes, Alfred J. et. al.)

- Heterogeneous processor: SoC with CPU, GPU, NPU
- Data <u>confidentiality</u> & <u>integrity</u> are essential







Rowhammer Attack [2]

Cold Boot Attack [1]

DMA Attack [3]



Replay Attack [4]

- [1] Lest We Remember: Cold-Boot Attacks on Encryption Keys (USENIX Security 2008)
- [2] RAMBleed: Reading Bits in Memory Without Accessing Them (S&P 2020)
- [3] Direct Memory Attack the Kernel (DEFCON 2016; PCILeech)
- [4] Handbook of Applied Cryptography (Menezes, Alfred J. et. al.)

- Heterogeneous processor: SoC with CPU, GPU, NPU
- Data <u>confidentiality</u> & <u>integrity</u> are essential

System-on-a-Chip (SoC) based Heterogeneous Processor

#### Memory protection is necessary for heterogeneous processors



Cold Boot Attack [1] Rowhammer Attack [2]



DMA Attack [3



Replay Attack [4]

1] Lest We Remember: Cold-Boot Attacks on Encryption Keys (USENIX Security 2008)

[2] RAMBleed: Reading Bits in Memory Without Accessing Them (S&P 2020)

[3] Direct Memory Attack the Kernel (DEFCON 2016; PCILeech) [4] Handbook of Applied Cryptography (Menezes, Alfred J. et. al.)

 Existing memory protections are tailored to specific, individual access patterns



- Existing memory protections are tailored to specific, individual access patterns
  - Common Counters [1] → GPU medium-grained pattern



- Existing memory protections are tailored to specific, individual access patterns
  - Common Counters  $[1] \rightarrow$  GPU medium-grained pattern
  - Studies of S/W-based counters → NPU S/W-detected pattern



- Existing memory protections are tailored to specific, individual access patterns
  - Common Counters  $[1] \rightarrow$  GPU medium-grained pattern
  - Studies of S/W-based counters → NPU S/W-detected pattern
- Heterogeneous processor → diverse access pattern



- Existing memory protections are tailored to specific, individual access patterns
  - Common Counters  $[1] \rightarrow$  GPU medium-grained pattern
  - Studies of S/W-based counters → NPU S/W-detected pattern
- Heterogeneous processor → diverse access pattern
  - A **unified** memory protection for all access patterns



- Existing memory protections are tailored to specific, individual access patterns
  - Common Counters  $[1] \rightarrow$  GPU medium-grained pattern
  - Studies of S/W-based counters → NPU S/W-detected pattern
- Heterogeneous processor → diverse access pattern
  - A **unified** memory protection for all access patterns
  - Limitation of prior studies: Bypassing integrity tree optimization



- Existing memory protections are tailored to specific, individual access patterns
  - Common Counters  $[1] \rightarrow$  GPU medium-grained pattern
  - Studies of S/W-based counters → NPU S/W-detected pattern

This study constructs a <u>unified</u> memory protection scheme with <u>integrity tree optimization</u> for heterogeneous processors

- A unified memory protection for all access patterns
- Limitation of prior studies: Bypassing integrity tree optimization















- Critical factors of memory protection
  - Amount of counters and MACs: Granularity



- Critical factors of memory protection
  - Amount of counters and MACs: Granularity
  - Overhead of recursive validation: Height of integrity tree



- Critical factors of memory protection
  - Amount of counters and MACs: Granularity
  - Overhead of recursive validation: Height of integrity tree
- 34% delay with 29% data traffic increment



- Critical factors of memory protection
  - Amount of counters and MACs: Granularity
  - Overhead of recursive validation: Height of integrity tree
- 34% delay with 29% data traffic increment Significant overhead caused by

the conventional 64B-granular protection with a full integrity tree



- Major access chunks (consecutive access blocks)
  - Fine-grained (64B): CPU



- Major access chunks (consecutive access blocks)
  - Fine-grained (64B): CPU
  - Medium-grained (512B, 4KB): GPU



- Major access chunks (consecutive access blocks)
  - Fine-grained (64B): CPU
  - Medium-grained (512B, 4KB): GPU
  - Coarse-grained (32KB): NPU



- Major access chunks (consecutive access blocks)
  - Fine-grained (64B): CPU
  - Medium-grained (512B, 4KB): GPU
  - Coarse-grained (32KB): NPU



Matching security granularity to access granularity

- Major access chunks (consecutive access blocks)
  - Fine-grained (64B): CPU
  - Medium-grained (512B, 4KB): GPU
  - Coarse-grained (32KB): NPU



Matching security granularity to access granularity
→ Requirement: <u>Multi-granularity</u> for MACs and counters



Conventional<br/>Fine-granular<br/>MACMACData











[1] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)



Multi-granular MAC  $\rightarrow$  Only managing the security granularity

[1] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)
# Multi-granular MAC



Multi-granular MAC → Only managing the security granularity What about multi-granular <u>counters</u>?

[1] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)

 Due to integrity tree, prior studies bypass integrity tree under specific conditions.

 Due to integrity tree, prior studies bypass integrity tree under specific conditions.



#### Common Counters [1]



[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

 Due to **integrity tree**, prior studies bypass integrity tree under specific conditions.





[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

Due to integrity tree, prior studies bypass
integrity tree under specific conditions.



[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

Coarse

Due to integrity tree, prior studies bypass



[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)



- [2] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [5] SoftVN: Efficient Memory Protection via Software-provided Version Numbers (ISCA 2022)

<sup>[1]</sup> Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)



- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [5] SoftVN: Efficient Memory Protection via Software-provided Version Numbers (ISCA 2022)



- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [5] SoftVN: Efficient Memory Protection via Software-provided Version Numbers (ISCA 2022)



- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [5] SoftVN: Efficient Memory Protection via Software-provided Version Numbers (ISCA 2022)



- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [5] SoftVN: Efficient Memory Protection via Software-provided Version Numbers (ISCA 2022)



#### Multi-granular counter integrity tree is necessary



[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

2] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)

- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [5] SOLVIN. Efficient Memory Protection via Soltware-provided Version Numbers (ISCA 2022)









Multi-granular tree



- Multi-granular tree
  - Counters w/ varying granularities are mapped to different levels



- Multi-granular tree
  - Counters w/ varying granularities are mapped to different levels



- Multi-granular tree
  - Counters w/ varying granularities are mapped to different levels
  - Fetches fewer counters
  - Shortens recursive validation path





- Multi-granular MAC&Tree
  - Dynamically supports multi-granular MACs and a counter tree

- Multi-granular MAC&Tree
  - Dynamically supports multi-granular MACs and a counter tree



- Multi-granular MAC&Tree
  - Dynamically supports multi-granular MACs and a counter tree



- Multi-granular MAC&Tree
  - Dynamically supports multi-granular MACs and a counter tree
  - Key Idea: Merging MACs/counters & pruning a counter tree



- Multi-granular MAC&Tree
  - Dynamically supports multi-granular MACs and a counter tree
  - Key Idea: Merging MACs/counters & pruning a counter tree



**1.** How to dynamically detect granularity

- Multi-granular MAC&Tree
  - Dynamically supports multi-granular MACs and a counter tree
  - Key Idea: Merging MACs/counters & pruning a counter tree



How to dynamically detect granularity
How to switch granularity



- Access tracker
  - Records accessed addresses





- Access tracker
  - Records accessed addresses
  - Consecutive access bits are set







- Access tracker
  - Records accessed addresses
  - Consecutive access bits are set
- Granularity detection engine
  - Computes a new granularity







- Access tracker
  - Records accessed addresses
  - Consecutive access bits are set
- Granularity detection engine
  - Computes a new granularity
  - Updates granularity table







Granularity switching engine





Granularity Table

- Granularity switching engine
  - Loads additional data
    - $\rightarrow$  Old counters, MACs, data blocks





- Granularity switching engine
  - Loads additional data
    - $\rightarrow$  Old counters, MACs, data blocks
  - Computes counters, MACs







Computes a new MAC & a counter

- Granularity switching engine
  - Loads additional data
    - $\rightarrow$  Old counters, MACs, data blocks
  - Computes counters, MACs
  - Re-encrypts old data



Data



Computes a new MAC & a counter
# Granularity Switching (Fine $\rightarrow$ Coarse)

- Granularity switching engine
  - Loads additional data
    - $\rightarrow$  Old counters, MACs, data blocks
  - Computes counters, MACs
  - Re-encrypts old data
  - Updates & prunes integrity tree







# Granularity Switching (Fine $\rightarrow$ Coarse)

- Granularity switching engine
  - Loads additional data
    - $\rightarrow$  Old counters, MACs, data blocks
  - Computes counters, MACs
  - Re-encrypts old data
  - Updates & prunes integrity tree



oarse

Fine-

arained

→Coarse

# Granularity Switching (Fine $\rightarrow$ Coarse)

- Granularity switching engine
  - Loads additional data
    - $\rightarrow$  Old counters, MACs, data blocks
  - Computes counters, MACs



#### Granularity switching requires significant overhead $\rightarrow$ Lazy switching



ChampSim (CPU) + MGPUSim (GPU) + mNPUsim (NPU)

- ChampSim (CPU) + MGPUSim (GPU) + mNPUsim (NPU)
- Configuration: Similar to NVIDIA Orin
  - ARM Cortex CPU + Ampere GPU + 2 x NVDLA with LPDDR4

|                    | CPU (Jetson AGX<br>Orin ARM Cortex)  | GPU (Jetson AGX<br>Orin Ampere) | NPU (NVDLA)                           |
|--------------------|--------------------------------------|---------------------------------|---------------------------------------|
| Compute<br>Engine  | 8-core                               | 14 SMs                          | 45 x 45<br>Systolic Array             |
| On-chip<br>Storage | Cache<br>(L1: 64KB, L2: 2MB)         | Cache<br>(L1: 192 KB, L2: 4MB)  | Scratchpad Memory<br>(2.2MB in total) |
| Frequency          | 2.2GHz                               | 1GHz                            | 1GHz                                  |
| Memory<br>System   | 2.4GHz, 17GB/s, LPDDR4 Memory System |                                 |                                       |

- Workloads & Scenarios
  - 14 workloads, 250 scenarios (all combinations)

- Workloads & Scenarios
  - 14 workloads, 250 scenarios (all combinations)
  - Access pattern: Fine ff f c cc Coarse | Diverse (d)
  - Traffic per cycles: Small (s) Medium (m) Large (l)

|     | Workloads (access pattern-traffic per cycles)               |  |
|-----|-------------------------------------------------------------|--|
| CPU | bw (ff-s), gcc (ff-s), mcf (ff-m), xal (f-m), ray (ff-s)    |  |
| GPU | syr2k (ff-m), pr (f-m), sten (c-l), mm (cc-m), floyd (d-s), |  |
| NPU | ncf (c-s), dlrm (c-s), sfrnn (c-l), alex (cc-m)             |  |

**14%** improvement with **11%** data reduction



- **14%** improvement with **11%** data reduction
- Combining prior subtree optimization [1-4]
  - Performance improvement:  $\underline{14\%} \rightarrow \underline{21\%}$



[1] Bonsai Merkle Forests: Efficiently Achieving Crash Consistency in Secure Persistent Memory (MICRO 2021)

- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)
- [4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- **14%** improvement with **11%** data reduction
- Combining prior subtree optimization [1-4]
  - Performance improvement:  $\underline{14\%} \rightarrow \underline{21\%}$
  - CTR-only (<u>7%</u>), +MAC (<u>7%</u>), +Prior tree optimization (<u>7%</u>)



[1] Bonsai Merkle Forests: Efficiently Achieving Crash Consistency in Secure Persistent Memory (MICRO 2021)

- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)
- [4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- **14%** improvement with **11%** data reduction
- Combining prior subtree optimization [1-4]
  - Performance improvement:  $\underline{14\%} \rightarrow \underline{21\%}$
  - CTR-only (<u>7%</u>), +MAC (<u>7%</u>), +Prior tree optimization (<u>7%</u>)



[1] Bonsai Merkle Forests: Efficiently Achieving Crash Consistency in Secure Persistent Memory (MICRO 2021)

[2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)

[3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)

[4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- **14%** improvement with **11%** data reduction
- Combining prior subtree optimization [1-4]
  - Performance improvement:  $\underline{14\%} \rightarrow \underline{21\%}$
  - CTR-only (<u>7%</u>), +MAC (<u>7%</u>), +Prior tree optimization (<u>7%</u>)



[1] Bonsai Merkle Forests: Efficiently Achieving Crash Consistency in Secure Persistent Memory (MICRO 2021)

- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)

[4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

#### **Evaluation Result (Per Processing Unit)**

## **Evaluation Result (Per Processing Unit)**

- Performance improvement of each processing unit
  - CPU (<u>24%</u>), GPU (<u>23%</u>), NPU (<u>10%</u>)



## **Evaluation Result (Per Processing Unit)**

- Performance improvement of each processing unit
  - CPU (<u>24%</u>), GPU (<u>23%</u>), NPU (<u>10%</u>)



Unified memory protection for heterogeneous processor

#### **Our Unified Memory Protection Scheme**



- Unified memory protection for heterogeneous processor
  - Multi-granular MAC & Integrity Tree





- Unified memory protection for heterogeneous processor
  - Multi-granular MAC & Integrity Tree
  - Challenge: Diverse access pattern

#### **Our Unified Memory Protection Scheme**



- Unified memory protection for heterogeneous processor
  - Multi-granular MAC & Integrity Tree
  - Challenge: Diverse access pattern
- Improvement: <u>14%</u> (w/o subtree opt.), <u>21%</u> (w/ subtree opt.)



#### **Our Unified Memory Protection Scheme**

## Thank you

#### **Backup Slide**

\* <u>Scale-up</u> (Fine  $\rightarrow$  Coarse)











\* <u>Scale-down</u> (Coarse  $\rightarrow$  Fine)





1) Detect scale-down







Granularity switching requires significant overhead!  $\rightarrow$  Lazy switching

#### Lazy Switching Overhead by MAC


$$Coarse MAC = HASH (Fine MACs)$$

\* <u>Scale-up</u>













- <u>97.2%</u> of reqs → Hidden by <u>lazy switching & R/O</u>
  - Only <u>2.8%</u> of reqs makes moderate overhead (Id data chunks)



- **<u>97.2%</u>** of reqs  $\rightarrow$  Hidden by **<u>lazy switching & R/O</u>** 
  - Only <u>2.8%</u> of reqs makes moderate overhead (Id data chunks)



Lazy switching considerably reduces switching overhead!!

Backup

Granul. detection

Granul.  $\rightarrow$  Store detection  $\rightarrow$  next granul.

Granul.  $\rightarrow$  Store detection  $\rightarrow$  next granul.  $\rightarrow$  Granul. switch after next access













- 91.2% of reqs → Hidden by lazy switching
  - Only **<u>8.2%</u>** of reqs makes low overhead (read req  $\rightarrow$  write req)



• Proper granularity  $\rightarrow$  Reduce security metadata

- Proper granularity  $\rightarrow$  Reduce security metadata
- Wrong granularity  $\rightarrow$  Data load penalty

- Proper granularity → Reduce security metadata
- Wrong granularity  $\rightarrow$  Data load penalty



- Proper granularity → Reduce security metadata
- Wrong granularity → Data load penalty



Coarse Reqs

- Proper granularity → Reduce security metadata
- Wrong granularity  $\rightarrow$  Data load penalty



Coarse Reqs ■ X 4 ■ X 4 ■ X 9

- Proper granularity → Reduce security metadata
- Wrong granularity  $\rightarrow$  Data load penalty



- Proper granularity → Reduce security metadata
- Wrong granularity → Data load penalty



- Proper granularity  $\rightarrow$  Reduce security metadata
- Wrong granularity → Data load penalty



- Proper granularity  $\rightarrow$  Reduce security metadata
- Wrong granularity → Data load penalty



- Proper granularity → Reduce security metadata
- Wrong granularity → Data load penalty



- Proper granularity  $\rightarrow$  Reduce security metadata
- Wrong granularity → Data load penalty



- Proper granularity  $\rightarrow$  Reduce security metadata
- Wrong granularity → Data load penalty



Granularity-managed MAC&tree makes efficient memory protection

### **Prior Domain-specific Memory Protections**

 No prior study using integrity tree pruning or multi-granular MAC&counter

# **Prior Domain-specific Memory Protections**

- No prior study using integrity tree pruning or multi-granular MAC&counter
- 1. Dual-granular & GPU-optimized Counter<sup>[1]</sup>



Dual CTRs, limited CTRs, CTR-only, device-specific

[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
## **Prior Domain-specific Memory Protections**

- No prior study using integrity tree pruning or multi-granular MAC&counter
- 1. Dual-granular & GPU-optimized Counter [1]



- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)

## **Prior Domain-specific Memory Protections**

- No prior study using integrity tree pruning or multi-granular MAC&counter
- 1. Dual-granular & GPU-optimized Counter [1]



- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)
- [3] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)
- [4] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [5] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [6] SoftVN: Efficient Memory Protection via Software-provided Version Numbers (ISCA 2022)

## **Prior Domain-specific Memory Protections**

- No prior study using integrity tree pruning or multi-granular MAC&counter
- 1. Dual-granular & GPU-optimized Counter [1]



- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)
- [3] TNPU: Supporting Trusted Execution with Tree-less Integrity Protection for Neural Processing Unit (HPCA 2022)
- [4] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [5] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)
- [6] SoftVN: Efficient Memory Protection via Software-provided Version Numbers (ISCA 2022)

Secure On-chip Unsecure Off-chip



CTR-mode encryption: confidentiality

Secure On-chip Unsecure Off-chip

Ciphertext

CTR-mode encryption: confidentiality

CTR-mode Encryption

Secure On-chip Unsecure Off-chip



CTR-mode encryption: confidentiality



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



- CTR-mode encryption: confidentiality
- MAC authentication: value-integrity
- Freshness validation: replay-attack



 Prior hotness-based integrity tree optimization scheme (Subtree optimization)[1-4]

**Prior Subtree Optimization** 



- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)
- [4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- Prior hotness-based integrity tree optimization scheme (Subtree optimization)[1-4]
  - Caching highly used roots of subtrees



- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)
- [4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- Prior hotness-based integrity tree optimization scheme (Subtree optimization)[1-4]
  - Caching highly used roots of subtrees
  - Pruned unused nodes

#### **Prior Subtree Optimization**



- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)
- [4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- Prior hotness-based integrity tree optimization scheme (Subtree optimization)[1-4]
  - Caching highly used roots of subtrees
  - Pruned unused nodes



- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)
- [4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- Prior hotness-based integrity tree optimization scheme (Subtree optimization)[1-4]
  - Caching highly used roots of subtrees
  - Pruned unused nodes



- [2] Scalable Memory Protection in the PENGLAI Enclave (OSDI 2021)
- [3] Efficient Distributed Secure Memory with Migratable Merkle Tree (HPCA 2023)
- [4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

- Prior hotness-based integrity tree optimization scheme (Subtree optimization)[1-4]
  - Caching highly used roots of subtrees
  - Pruned unused nodes



#### Multi-granular MAC&Tree further improves prior solutions!!

[4] Data Enclave: A Data-Centric Trusted Execution Environment (HPCA 2024)

1. Granularity Switching

1. Granularity Switching 2. Granularity Detection

1. Granularity Switching 2. Granularity Detection

3. Multi-granularity Memory Protection



3. Multi-granularity **Memory Protection** 



MAC/CTR Merging & CTR Tree Pruning



3. Multi-granularity **Memory Protection** 



MAC/CTR Merging & CTR Tree Pruning Dynamic Access Tracking

1. Granularity Switching

2. Granularity Detection

Access Bits

CTR Tree MAC Data Coarse-grained CTR Tree MAC MAC

MAC/CTR Merging & CTR Tree Pruning Dynamic Access Tracking

3. Multi-granularity Memory Protection

1. Granularity Switching

2. Granularity Detection

CTR Tree MAC Data Coarse-grained CTR Tree MAC Data

MAC/CTR Merging & CTR Tree Pruning



Dynamic Access Tracking

3. Multi-granularity Memory Protection
1. Granularity Switching

2. Granularity Detection

CTR Tree MAC Data Coarse-grained CTR Tree MAC Data

MAC/CTR Merging & CTR Tree Pruning



Dynamic Access Tracking

3. Multi-granularity Memory Protection

Access Tracker

Addr

Granul. Detection

Engine

Granularity Table

1. Granularity Switching

2. Granularity Detection

Access Bits

CTR Tree MAC Data **Coarse-grained** CTR Tree MAC Data

MAC/CTR Merging

& CTR Tree Pruning

Dynamic Access Tracking

<<

Granularity-aware Protection

3. Multi-granularity

**Memory Protection** 

Backup

Access

Tracker

Addr

Granul. Detection

Engine

Granularity Table

1. Granularity Switching

2. Granularity Detection

Access Bits



MAC/CTR Merging

Dynamic Access Tracking

<<

Granularity-aware Protection

& CTR Tree Pruning

CTR/MAC Addr. loa ALU Compute Engine

3. Multi-granularity

**Memory Protection** 

1. Granularity Switching

2. Granularity Detection





3. Multi-granularity Memory Protection



MAC/CTR Merging & CTR Tree Pruning Dynamic Access Tracking

Granularity-aware Protection

Granularity Switching





























#### **Recent Memory Protection Studies**

### **Recent Memory Protection Studies**

| Study              | Target          | Multi<br>CTR | Int. Tree<br>Opt. | Multi<br>MAC | Dynamic<br>Update | Target<br>App. |
|--------------------|-----------------|--------------|-------------------|--------------|-------------------|----------------|
| SoftVN             | CPU             | 0            | Х                 | Х            | Х                 | ML-specific    |
| Common<br>Counters | GPU             | Dual         | Х                 | Х            | Х                 | General        |
| Adaptive           | GPU             | Х            | Х                 | Dual         | 0                 | General        |
| TNPU               | NPU             | 0            | Х                 | Х            | Х                 | ML-specific    |
| Tunable<br>Tree    | NPU             | 0            | Sub<br>Optimal    | Х            | Х                 | General        |
| MGX                | NPU             | 0            | Х                 | 0            | Х                 | ML-specific    |
| GuardNN            | NPU             | 0            | Х                 | Х            | Х                 | ML-specific    |
| TensorTEE          | CPU+NPU         | 0            | Х                 | 0            | 0                 | ML-specific    |
| Ours               | CPU+GPU<br>+NPU | 0            | Optimal           | 0            | 0                 | General        |

## **Prior Integrity Tree Optimization**

# **Prior Integrity Tree Optimization**

| Study                     | Target          | Multi<br>CTR | Int. Tree<br>Opt. | Multi<br>MAC | Dynamic<br>Update | Target<br>App. |
|---------------------------|-----------------|--------------|-------------------|--------------|-------------------|----------------|
| Bonsai Merkle<br>Forests  | CPU             | Х            | Sub<br>Optimal    | Х            | Х                 | General        |
| PENGLAI                   | GPU             | Х            | Sub<br>Optimal    | Х            | Х                 | General        |
| Migratable<br>Merkle Tree | GPU             | Х            | Sub<br>Optimal    | Х            | Х                 | General        |
| Data Enclave              | NPU             | Х            | Sub<br>Optimal    | Х            | Х                 | General        |
| Ours                      | CPU+GPU<br>+NPU | 0            | Optimal           | 0            | Ο                 | General        |

Coarse-MAC & counter encryption, integrity validation

- Coarse-MAC & counter encryption, integrity validation
  - $CTR_{course(1 \cdots k)} = MAX(CTR_{fine 1}, CTR_{fine 2}, \cdots, CTR_{fine k})$

- Coarse-MAC & counter encryption, integrity validation
  - $CTR_{course(1 \cdots k)} = MAX(CTR_{fine 1}, CTR_{fine 2}, \cdots, CTR_{fine k})$
  - $MAC_{course(1 \cdots k)} =$ HASH( $\cdots$  HASH(HASH( $MAC_{fine 1}$ ),  $MAC_{fine 2}$ ),  $\cdots$ ,  $MAC_{fine k}$ )

- Coarse-MAC & counter encryption, integrity validation
  - $CTR_{course(1 \cdots k)} = MAX(CTR_{fine 1}, CTR_{fine 2}, \cdots, CTR_{fine k})$
  - $MAC_{course(1 \cdots k)} =$ HASH( $\cdots$  HASH(HASH( $MAC_{fine 1}$ ),  $MAC_{fine 2}$ ),  $\cdots$ ,  $MAC_{fine k}$ )



 $CTR_{course(1 \cdots 8)}$ 

- Coarse-MAC & counter encryption, integrity validation
  - $CTR_{course(1 \cdots k)} = MAX(CTR_{fine 1}, CTR_{fine 2}, \cdots, CTR_{fine k})$
  - $MAC_{course(1 \cdots k)} =$ HASH( $\cdots$  HASH(HASH( $MAC_{fine 1}$ ),  $MAC_{fine 2}$ ),  $\cdots$ ,  $MAC_{fine k}$ )



- Coarse-MAC & counter encryption, integrity validation
  - $CTR_{course(1 \cdots k)} = MAX(CTR_{fine 1}, CTR_{fine 2}, \cdots, CTR_{fine k})$
  - $MAC_{course(1 \cdots k)} =$ HASH( $\cdots$  HASH(HASH( $MAC_{fine 1}$ ),  $MAC_{fine 2}$ ),  $\cdots$ ,  $MAC_{fine k}$ )



- Coarse-MAC & counter encryption, integrity validation
  - $CTR_{course(1 \cdots k)} = MAX(CTR_{fine 1}, CTR_{fine 2}, \cdots, CTR_{fine k})$
  - $MAC_{course(1 \cdots k)} =$ HASH( $\cdots$  HASH(HASH( $MAC_{fine 1}$ ),  $MAC_{fine 2}$ ),  $\cdots$ ,  $MAC_{fine k}$ )



- Coarse-MAC & counter encryption, integrity validation
  - $CTR_{course(1 \cdots k)} = MAX(CTR_{fine 1}, CTR_{fine 2}, \cdots, CTR_{fine k})$
  - $MAC_{course(1 \cdots k)} =$ HASH( $\cdots$  HASH(HASH( $MAC_{fine 1}$ ),  $MAC_{fine 2}$ ),  $\cdots$ ,  $MAC_{fine k}$ )



Chunk-level index computation

- Chunk-level index computation
- Recursive parent call from leaf counters

- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters

| Chunk Index: 0 | Chunk Index: 1 | Chunk Index: 2 |       |
|----------------|----------------|----------------|-------|
|                | 0123456789     | •              | ) • • |
- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



(# of Parents)

- Chunk-level index computation
- Recursive parent call from leaf counters



= sqrt{Arity}(Granularity)

- Chunk-level index computation
- Recursive parent call from leaf counters



- Chunk-level index computation
- Recursive parent call from leaf counters



### **Workload Analysis & Selected Scenarios**

### **Workload Analysis & Selected Scenarios**

Workloads & Scenarios

### **Workload Analysis & Selected Scenarios**

- Workloads & Scenarios
  - 14 workloads, 250 scenarios (all combinations)
  - Access pattern: Fine ff f c cc Coarse | Diverse (d)
  - Traffic per cycles: Small (s) Medium (m) Large (l)

|    |                                            | Workloads (access pattern-traffic per cycles)                         |    |
|----|--------------------------------------------|-----------------------------------------------------------------------|----|
|    | CPU                                        | bw (ff-s), gcc (ff-s), mcf (ff-m), xal (f-m), ray (ff-s)              |    |
|    | GPU                                        | syr2k (ff-m), pr (f-m), sten (c-l), mm (cc-m), floyd (d-s),           |    |
|    | NPU                                        | ncf (c-s), dlrm (c-s), sfrnn (c-l), alex (cc-m)                       |    |
| ID |                                            | (CPU, GPU, NPU1, NPU2)                                                |    |
| ff | (bw, s                                     | syr2k, ncf, dlrm), (mcf, syr2k, sfrnn, dlrm), (gcc, floyd, sfrnn, ncf | F) |
| f  | (xal, pr, sfrnn, ncf), (xal, pr, ncf, ncf) |                                                                       |    |
| С  | (gco                                       | c, sten, alex, dlrm), (bw, sten, ncf, ncf), (mcf, sten, sfrnn, sfrnn) |    |
| СС | (xa                                        | I, mm, alex, dlrm), (ray, mm, alex, alex), (ray, Floyd, alex, alex)   |    |
|    |                                            |                                                                       |    |

### **Rowhammer Attacks**



[1] Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors (ISCA 2014)

[2] RAMBleed: Reading Bits in Memory Without Accessing Them (S&P 2020)

[3] DeepHammer: Depleting the Intelligence of Deep Neural Networks through Targeted Chain of Bit Flips (USENIX Security 2020)

### More Design Descriptions in Our Paper

- Lazy-switching analysis
- Cacheline fragmentation issue
- CTR/MAC addressing for multi-granularity
- Coarse-grained memory protection engine using parallel counter sharing and nested MAC hasing
- Misprediction handler
- Efficient granularity representation
- Hardware overhead
- Comparison to prior subtree optimization schemes

### More Results in Our Paper

- The ratio of stream chunks
- Performance analysis of selected scenarios
- End-to-end performance
- Drawbacks of the per-device (static) granularity
- Performance comparison with dual-granularity
- Switching overhead measurement
- Security cache hit ratio improvement
- Hardware overhead

### **Temp Slide**

### **Research Objective**

# Constructs a **general** and efficient memory protection scheme for heterogeneous processors

### **Research Objective**

# Constructs a **general** and efficient memory protection scheme for heterogeneous processors

• Challenge 1: Heterogeneous processors have **diverse access pattern** 

#### **Conventional Memory Protection: High Overhead**





# Constructs a **general** and efficient memory protection scheme for heterogeneous processors

- Challenge 1: Heterogeneous processors have **diverse access pattern**
- Challenge 2: Each prior protection **only for a specific access pattern**

#### **Conventional Memory Protection: High Overhead**



# Constructs a **general** and efficient memory protection scheme for heterogeneous processors

- Challenge 1: Heterogeneous processors have **diverse access pattern**
- Challenge 2: Each prior protection **only for a specific access pattern** 
  - For example, GPU coarse-grained pattern, NPU software-detected pattern

#### **Conventional Memory Protection: High Overhead**



# Constructs a **general** and efficient memory protection scheme for heterogeneous processors

- Challenge 1: Heterogeneous processors have **diverse access pattern**
- Challenge 2: Each prior protection **only for a specific access pattern** 
  - For example, GPU coarse-grained pattern, NPU software-detected pattern
- → We unified prior studies with our novel multi-granular tree
  Our Unified Memory Protection Scheme

| System-on-a-Chip (SoC) based<br>Heterogeneous Processor | Fa |
|---------------------------------------------------------|----|
|                                                         | -  |



### **Multi-granular MAC and Counter**

- Multi-granular MAC and counter
  - Multi-granular MAC and counter fetches small # of MACs and counters for coarse-grained access



Multi-granularity can reduce memory protection overhead However, how maintain a counter integrity tree?

**Counter-mode Protection** 

#### **Counter-mode Protection**



#### **Counter-mode Protection**
















**Conventional 64B-granular Protection** 







**Prior Domain-specific Memory Protection** 



#### **Prior Domain-specific Memory Protection**

1. Common Counters [1]

[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)



----Norm. Data Traffic



#### **Prior Domain-specific Memory Protection**

1. Common Counters<sup>[1]</sup>

2. Dual-MAC<sup>[2]</sup>

[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)[2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)



#### **Prior Domain-specific Memory Protection**

1. Common Counters<sup>[1]</sup>

2. Dual-MAC<sup>[2]</sup>

3. Software-managed Granularity [3-4]

[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

[2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)

[3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)



#### **Prior Domain-specific Memory Protection**



3. Software-managed Granularity [3-4]

- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)



[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

[2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)

Dual CTRs, limited CTRs, CTR-only, device-specific

[3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)





3. Software-managed Granularity [3-4]

- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)



#### **Prior Domain-specific Memory Protection**



#### **Conventional 64B-granular Protection**



3. Software-managed Granularity [3-4]

- [1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)
- [2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)
- [3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)
- [4] GuardNN: Secure Accelerator Architecture for Privacy-preserving Deep Learning (DAC 2022)



[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

[2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)

[3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)



[1] Common Counters: Compressed Encryption Counters for Secure GPU Memory (HPCA 2021)

[2] Adaptive Security Support for Heterogeneous Memory on GPUs (HPCA 2022)

[3] MGX: Near-zero Overhead Memory Protection for Data-intensive Accelerators (ISCA 2022)