

### Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory

Qing Wang, Youyou Lu, Jiwu Shu

Tsinghua University



#### Problem: low memory utilization in datacenters

< 65% in Google, Alibaba, and Snowflake<sup>[1,2,3]</sup>

#### Root Cause: imbalanced memory usages across servers

- Some servers are CPU-bound, but some are memory-bound
- Cannot use memory beyond a local server





#### Server 1 (CPU-bound)

Server 2 (memory-bound)

[1] Who Limits the Resource Efficiency of My Datacenter: An Analysis of Alibaba Datacenter Traces (IWQoS'19)

[2] Borg: the Next Generation (EuroSys'20)

[3] Memtrade: A Disaggregated-Memory Marketplace for Public Clouds (arXiv'21)

#### **Memory Disaggregation**

Physically separate CPU and memory into <u>network-attached</u> components

#### **Memory Disaggregation**

Physically separate CPU and memory into <u>network-attached</u> components



#### **Memory Disaggregation**

Physically separate CPU and memory into <u>network-attached</u> components



### **Memory Disaggregation**

Physically separate CPU and memory into <u>network-attached</u> components



larg Benefits:

- <sup>1-2</sup> ✓ Independently scaling memory and CPU
- Men ✓ Flexibly assembling resources for apps
  - ✓ Efficiently sharing memory between apps

### **Memory Disaggregation**

Physically separate CPU and memory into <u>network-attached</u> components



larg Benefits:

- <sup>1-2</sup> ✓ Independently scaling memory and CPU
- Men ✓ Flexibly assembling resources for apps
  - ✓ Efficiently sharing memory between apps

high memory utilization

#### Key Enabler: Remote Direct Memory Access (RDMA)

- High bandwidth: 100/200/400Gbps
- ✤ Low latency: RTT < 2us</p>
- Directly access remote memory: read, write, atomic (e.g., cas)



In this work, we explore how to design a high-performance tree index on disaggregated memory (DM)



Compute Servers (CSs) Memory Servers (MSs)

#### Reexamine Existing RDMA-based Tree Indexes

Using RPC to handle index write operations (i.e., insert/update/delete)
 EXAMPLE Cell [ATC'16], FaRM-Tree [SIGMOD'19]

<u>Issue</u>: Cannot be deployed on DM — near-zero computation power at memory-side

### Reexamine Existing RDMA-based Tree Indexes

- Using RPC to handle index write operations (i.e., insert/update/delete)
   EXAMPLE Cell [ATC'16], FaRM-Tree [SIGMOD'19]
   Issue: Cannot be deployed on DM near-zero computation power at memory-side
- One-sided approach: leveraging RDMA read/write/atomic for all index ops
   EXAMPLE FG [SIGMOD'19] Issue: Low write performance

6

#### **Reexamine Existing RDMA-based Tree Indexes**

- Using RPC to handle index write operations (i.e., insert/update/delete)
   EXAMPLE Cell [ATC'16], FaRM-Tree [SIGMOD'19]
   Issue: Cannot be deployed on DM near-zero computation power at memory-side
- 2. One-sided approach: leveraging RDMA read/write/atomic for all index ops

| EXAMPLE FG [SIGMOD'19]       | <b>FG</b> (Zipf 0.99) | Throughput (Mops) | 50th Lat. (us) | 99th Lat. (us) |
|------------------------------|-----------------------|-------------------|----------------|----------------|
| Issue: Low write performance | 5% Write              | 31.8              | 4.9            | 14.9           |
|                              | 50% Write             | 0.34              | 10             | 19890          |

### **Reexamine Existing RDMA-based Tree Indexes**

- Using RPC to handle index write operations (i.e., insert/update/delete)
   EXAMPLE Cell [ATC'16], FaRM-Tree [SIGMOD'19]
   Issue: Cannot be deployed on DM near-zero computation power at memory-side
- 2. One-sided approach: leveraging RDMA read/write/atomic for all index ops

**EXAMPLE** FG [SIGMOD'19] Issue: Low write performance

| <b>FG</b> (Zipf 0.99) | Throughput (Mops) | 50th Lat. (us) | 99th Lat. (us) |
|-----------------------|-------------------|----------------|----------------|
| 5% Write              | 31.8              | 4.9            | 14.9           |
| 50% Write 🗧           | 0.34              | 10             | 19890          |

Low throughput & High latency w/ 8 MSs and 8 CSs

### **Reexamine Existing RDMA-based Tree Indexes**

- I. Using RPC to handle index write operations (i.e., insert/update/delete) EXAMPLE Cell [ATC'16], FaRM-Tree [SIGMOD'19] Issue: Cannot be deployed on DM — near-zero computation power at memory-side
- One-sided approach: leveraging RDMA read/write/atomic for all index ops 2.

EXAMPLE FG [SIGMOD'19] Throughput (Mops) 50th Lat. (us) 99th Lat. (us) **FG** (Zipf 0.99) Issue: Low write performance 5% Write 14.9 31.8 4.9 50% Write 19890 0.34

3. Hardware modification or SmartNICs for offloading index opsow throughput & High latency w/ 8 MSs and 8 CSs EXAMPLE HT-Tree [HotOS'19] Issue: High TCO (total cost of ownership)

10

### Our Goal

Our Goal: building a tree index on disaggregated memory that can deliver high performance (for both read/write ops) with commodity RDMA NICs

(I) Excessive Round Trips

### (I) Excessive Round Trips

Four round trips when modifying a tree node (FG [SIGMOD'19])



cas = compare and swap ; faa = fetch and add

### (I) Excessive Round Trips

Four round trips when modifying a tree node (FG [SIGMOD'19])



cas = compare and swap ; faa = fetch and add

(2) Slow Synchronization Primitives — RDMA lock

#### (2) Slow Synchronization Primitives — RDMA lock



154 threads acquire/release 10240 locks in an MS, RDMA cas for lock acquisition and faa for release (FG)

#### (2) Slow Synchronization Primitives — RDMA lock



#### (2) Slow Synchronization Primitives — RDMA lock



when contention appears

I. Expensive in-NIC concurrency control
- NICs serialize atomic verbs w/ 2-PCle-txn critical path



#### (2) Slow Synchronization Primitives — RDMA lock



154 threads acquire/release 10240 locks in an MS,

RDM Performance of lock collapses <sup>e (FG)</sup> when contention appears I. Expensive in-NIC concurrency control
 NICs serialize atomic verbs w/ 2-PCIe-txn critical path



2. Unnecessary retries

- Lock retries consume limited RDMA throughput

#### (2) Slow Synchronization Primitives — RDMA lock



RDM Performance of lock collapses <sup>e (FG)</sup> when contention appears I. Expensive in-NIC concurrency control
NICs serialize atomic verbs w/ 2-PCle-txn critical path



#### 2. Unnecessary retries

- Lock retries consume limited RDMA throughput

#### 3. Lacking Fairness

- Do not consider fairness, starving some clients and further inducing high tail latency

(3) Write Amplification

### (3) Write Amplification

Lots of indexes use lock-free lookup to eliminate read locks:

- Issue RDMA read to fetch tree node
- Detect inconsistent data due to concurrent writes via checksum or versions
- Retry if data is inconsistent

### (3) Write Amplification

Lots of indexes use lock-free lookup to eliminate read locks:

- Issue RDMA read to fetch tree node
- Detect inconsistent data due to concurrent writes via checksum or versions
- Retry if data is inconsistent



#### **Checksum-based**

Writer: modify entries, checksum = crc(node)
Reader: if checksum == crc(node) ?

#### (3) Write Amplification

Lots of indexes use lock-free lookup to eliminate read locks:

- Issue RDMA read to fetch tree node
- Detect inconsistent data due to concurrent writes via checksum or versions
- Retry if data is inconsistent

#### **Checksum-based**

Writer: modify entries, checksum = crc(node) Reader: if checksum == crc(node) ?

**ver**<sub>a</sub> 
$$\langle K_1, V_1 \rangle$$
 ...  $\langle K_n, V_n \rangle$  **ver**<sub>b</sub>

#### Version-based

Writer: ver<sub>a</sub>++, modify entries, ver<sub>b</sub>++ Reader: if ver<sub>a</sub> == ver<sub>b</sub> ?

#### (3) Write Amplification

Lots of indexes use lock-free lookup to eliminate read locks:

- Issue RDMA read to fetch tree node
- Detect inconsistent data due to concurrent writes via checksum or versions
- Retry if data is inconsistent



Checksum-based

**ver**<sub>a</sub> 
$$\langle K_1, V_1 \rangle$$
  $\cdots$   $\langle K_n, V_n \rangle$  **ver**<sub>b</sub>

Version-based

In these two mechanisms, writers must write back the whole tree node, even when modifying an individual KV entry, inducing write amplification

### Outline

- Background & Motivation
- Sherman A Write-Optimized B+Tree on Disaggregated Memory
- Section 2
- Summary

### **Sherman Overview**

### Sherman is a B+Tree index on disaggregated memory

- B-link tree structure (sibling pointer)
- Tree nodes are across many MSs
- ✤ One-sided RDMA for all index ops
- Index cache at CSs
  - caching internal tree nodes
  - reducing remote accesses
- Concurrency control
  - write-write conflicts:
    - node-grained exclusive locks
  - read-write conflicts:
    - lock-free search w/ versions







Reducing round trips



Command combination









### **Command combination**

**Observation:** RDMA write commands are executed in order at receivers

### **Command combination**

Observation: RDMA write commands are executed in order at receivers Command combination: client threads issue dependent RDMA writes simultaneously

### **Command combination**

Observation: RDMA write commands are executed in order at receivers Command combination: client threads issue dependent RDMA writes simultaneously



Combine write-back and lock release

### **Command combination**

Observation: RDMA write commands are executed in order at receivers Command combination: client threads issue dependent RDMA writes simultaneously



Combine write-back and lock release

Checkout paper for other cases of combination

**Observation:** RDMA NICs can expose **on-chip memory (SRAM)** for usages

**Observation:** RDMA NICs can expose **on-chip memory (SRAM)** for usages



#### **Observation:** RDMA NICs can expose **on-chip memory (SRAM)** for usages

- Store locks in on-chip mem of MSs' NICs
  - an array called Global Lock Table (GLT)
  - hash [addr of tree node] => position in GLT
  - eliminate PCIe txn at MSs



#### **Observation:** RDMA NICs can expose **on-chip memory (SRAM)** for usages

- Store locks in on-chip mem of MSs' NICs
  - an array called Global Lock Table (GLT)
  - hash [addr of tree node] => position in GLT
  - eliminate PCIe txn at MSs
- ✤ Hierarchical structure
  - Maintain a mirror of GLT at each CS: Local Lock Table
  - first get local lock, then global one
  - avoid unnecessary across-network retries
  - bind a wait queue to each local lock, boosting fairness



19

#### **Observation:** RDMA NICs can expose **on-chip memory (SRAM)** for usages

- Store locks in on-chip mem of MSs' NICs
  - an array called Global Lock Table (GLT)
  - hash [addr of tree node] => position in GLT
  - eliminate PCIe txn at MSs
- ✤ Hierarchical structure
  - Maintain a mirror of GLT at each CS: Local Lock Table
  - first get local lock, then global one
  - avoid unnecessary across-network retries
  - bind a wait queue to each local lock, boosting fairness
- Handover mechanism
  - Hand over a lock from one thread to another locally
  - reduce one round trip



MSs

- Make entries in leaf nodes unsorted
  - avoid shift operation on insert/delete

- Make entries in leaf nodes unsorted
  - avoid shift operation on insert/delete
- Two-level version in leaf nodes
  - Inde-level version protects leaf nodes => increment when insert/update/delete KV
  - entry-level version protects KV entries => increment when nodes split/merge



- Make entries in leaf nodes unsorted
  - avoid shift operation on insert/delete
- Two-level version in leaf nodes
  - node-level version protects leaf nodes => increment when insert/update/delete KV
  - entry-level version protects KV entries => increment when nodes split/merge



- Make entries in leaf nodes unsorted
  - avoid shift operation on insert/delete
- Two-level version in leaf nodes
  - Inde-level version protects leaf nodes => increment when insert/update/delete KV
  - entry-level version protects KV entries => increment when nodes split/merge



#### Sherman tailors the B+Tree layout to mitigate write amplification

- Make entries in leaf nodes unsorted
  - avoid shift operation on insert/delete
- Two-level version in leaf nodes
  - Inde-level version protects leaf nodes => increment when insert/update/delete KV
  - entry-level version protects KV entries => increment when nodes split/merge



Whether two node-level vers are equal ? two entry-level vers are equal ? If no, retry

#### Outline

- Background & Motivation
- Sherman A Write-Optimized B+Tree on Disaggregated Memory
- Section 2
- Summary

#### Hardware Platform

Machine \* 8

| CPU | 2 Intel Xeon E5-2650 (12 core)                         |
|-----|--------------------------------------------------------|
| Mem | I28GB DRAM                                             |
| NIC | 100Gbps Mellanox ConnectX-5<br>w/ 256KB on-chip memory |
| OS  | CentOS 7.7 , Linux kernel 3.10.0                       |

#### Hardware Platform

Machine \* 8

| CPU | 2 Intel Xeon E5-2650 (12 core)                         |   |
|-----|--------------------------------------------------------|---|
| Mem | 128GB DRAM                                             | _ |
| NIC | 100Gbps Mellanox ConnectX-5<br>w/ 256KB on-chip memory | L |
| OS  | CentOS 7.7 , Linux kernel 3.10.0                       |   |

We emulate each machine as one MS and one CS

- MS: 64GB DRAM and 2 CPU cores
- CS: IGB DRAM and 22 CPU cores

#### Hardware Platform

Machine \* 8

| CPU | 2 Intel Xeon E5-2650 (12 core)                         | _ |
|-----|--------------------------------------------------------|---|
| Mem | 128GB DRAM                                             | _ |
| NIC | 100Gbps Mellanox ConnectX-5<br>w/ 256KB on-chip memory |   |
| OS  | CentOS 7.7 , Linux kernel 3.10.0                       | _ |

We emulate each machine as one MS and one CS

- MS: 64GB DRAM and 2 CPU cores
- CS: IGB DRAM and 22 CPU cores

#### Compared System: FG [SIGMOD'19]

- One-sided RDMA for all index ops, so it can be deployed on DM
- RDMA locks for write-write conflicts; checksum for read-write conflicts
- We add CS-side index cache for FG, for fair comparison

#### Hardware Platform

Machine \* 8

| CPU | 2 Intel Xeon E5-2650 (12 core)                         | _ |
|-----|--------------------------------------------------------|---|
| Mem | I 28GB DRAM                                            |   |
| NIC | 100Gbps Mellanox ConnectX-5<br>w/ 256KB on-chip memory |   |
| OS  | CentOS 7.7 , Linux kernel 3.10.0                       | _ |

We emulate each machine as one MS and one CS

- MS: 64GB DRAM and 2 CPU cores
- CS: IGB DRAM and 22 CPU cores

#### Compared System: FG [SIGMOD'19]

- One-sided RDMA for all index ops, so it can be deployed on DM
- RDMA locks for write-write conflicts; checksum for read-write conflicts
- We add CS-side index cache for FG, for fair comparison

Benchmark: YCSB, Zipfian 0.99; 8B key & 8B value, I billion KV; IKB node; 500MB index cache



Write-intensive(50% lookup, 50% update/insert)



I. Sherman improves throughput significantly under write-intensive workloads

2. All techniques contribute to the high write efficiency



- I. Sherman improves throughput significantly under write-intensive workloads
- 2. All techniques contribute to the high write efficiency



- I. Sherman improves throughput significantly under write-intensive workloads
- 2. All techniques contribute to the high write efficiency

#### 99th Percentile Latency (176 client threads)



Read-intensive (95% lookup, 5% update/insert)



### 99th Percentile Latency (176 client threads)



Sherman lowers tail latency by reducing round trips and boosting concurrency efficiency

#### Outline

- Background & Motivation
- Sherman A Write-Optimized B+Tree on Disaggregated Memory
- Section 2
- Summary

✤ Goal

Suilding a fast tree index on disaggregated memory with commodity RDMA NICs

✤ Goal

Suilding a fast tree index on disaggregated memory with commodity RDMA NICs

\* Key Idea

Combining RDMA hardware features with RDMA-friendly software techniques

\* Goal

Suilding a fast tree index on disaggregated memory with commodity RDMA NICs

\* Key Idea

Combining RDMA hardware features with RDMA-friendly software techniques

#### Techniques in Sherman

- Command combination Reducing round trips
- Hierarchical on-chip lock Accelerating concurrent accesses
- Two-level version layout Mitigating write amplification

#### \* Goal

Suilding a fast tree index on disaggregated memory with commodity RDMA NICs

#### \* Key Idea

Combining RDMA hardware features with RDMA-friendly software techniques

#### Techniques in Sherman

- Command combination Reducing round trips
- Hierarchical on-chip lock Accelerating concurrent accesses
- Two-level version layout Mitigating write amplification

#### Results

 Sherman improves throughput and 99th percentile latency by one order of magnitude on typical write-intensive workloads



# Thanks & QA

#### Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory



Contact Information: q-wang18@mails.tsinghua.edu.cn