Paged attention kernel optimization(I)

Introduction

During recent interviews, I have been asked a question about my experience with optimizing
complex kernels or experience on architecture higher than sm89. Unfortunately, I don’t have much experience on this topic, but a few days ago, I have earned a H100x8 server for some reason, so I deside to spend a weekend to optimize my decode paged attention kernel in NanoPD. The post is the first part of the optimization process, which contains my reading notes on the vLLM and flash infer kernels.

VLLM’s paged attention kernel

First let us admire the vLLM’s paged attention kernel, which is implemented in this file. In the following part, I will line by line analyze the code and try to understand the optimization techniques used in this kernel.

Block sum kernel

Ahead of directly analysing the paged attention kernel, we need to understand the block sum kernel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
template < int NUM_WARPS> 
inline __device__ float block_sum(float* red_smem, float sum) {
// Decompose the thread index into warp / lane.
int warp = threadIdx.x / WARP_SIZE;
int lane = threadIdx.x % WARP_SIZE;
// Compute the sum per warp.
#pragma unroll
for (int mask = WARP_SIZE / 2; mask > = 1; mask /= 2) {
sum += VLLM_SHFL_XOR_SYNC(sum, mask);
}
// Warp leaders store the data to shared memory.
if (lane == 0) {
red_smem[warp] = sum;
}
// Make sure the data is in shared memory.
__syncthreads();
// The warps compute the final sums.
if (lane < NUM_WARPS) {
sum = red_smem[lane];
}
// Parallel reduction inside the warp.
#pragma unroll
for (int mask = NUM_WARPS / 2; mask > = 1; mask /= 2) {
sum += VLLM_SHFL_XOR_SYNC(sum, mask);
}
// Broadcast to other threads.
return VLLM_SHFL_SYNC(sum, 0);
}

First reduce in a warp, then store the result in shm, then reduce in the warp again, which is a common reduction pattern. No bank conflict as Warp Size is 32 or less, and the shared memory is float array, so each element is 4 bytes, which means each warp will write to a different bank. The reduction in the warp is done by shuffle instructions, which can be very efficient. But warp divergence can happen in the reduction, but it is not a big problem because the reduction is done in log2(NUM_WARPS) steps, which is small. The final result is broadcasted to all threads in the block, which can be efficient if NUM_WARPS is small.

Paged attention kernel

Now the main dish comes, first let us look at the kernel signature:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Grid: (num_heads, num_seqs, max_num_partitions).
template < typename scalar_t, typename cache_t, int HEAD_SIZE, int BLOCK_SIZE,
int NUM_THREADS, vllm: :Fp8KVCacheDataType KV_DTYPE,
bool IS_BLOCK_SPARSE,
int PARTITION_SIZE = 0> // Zero means no partitioning.
__device__ void paged_attention_kernel(
float* __restrict__ exp_sums, // [num_seqs, num_heads, max_num_partitions]
float* __restrict__ max_logits, // [num_seqs, num_heads,
// max_num_partitions]
scalar_t* __restrict__ out, // [num_seqs, num_heads, max_num_partitions,
// head_size]
const scalar_t* __restrict__ q, // [num_seqs, num_heads, head_size]
const cache_t* __restrict__ k_cache, // [num_blocks, num_kv_heads,
// head_size/x, block_size, x]
const cache_t* __restrict__ v_cache, // [num_blocks, num_kv_heads,
// head_size, block_size]
const int num_kv_heads, // [num_heads]
const float scale,
const int* __restrict__ block_tables, // [num_seqs, max_num_blocks_per_seq]
const int* __restrict__ seq_lens, // [num_seqs]
const int max_num_blocks_per_seq,
const float* __restrict__ alibi_slopes, // [num_heads]
const int q_stride, const int kv_block_stride, const int kv_head_stride,
const float* k_scale, const float* v_scale, const int tp_rank,
const int blocksparse_local_blocks, const int blocksparse_vert_stride,
const int blocksparse_block_size, const int blocksparse_head_sliding_step)

First consider the template parameters:

1
2
3
4
5
6
7
8
typename scalar_t; // Q/K/V data type, can be float16, bfloat16, int8, etc.
typename cache_t; // KV cache store data type, can be float16, bfloat16, used for quant
int HEAD_SIZE; // head size, can be 64, 128, 256, etc.
int BLOCK_SIZE; // the block size of paged attention, can be 128, 256, etc.
int NUM_THREADS; // the number of threads per block, can be 128, 256
vllm: :Fp8KVCacheDataType KV_DTYPE; // kv quantization data type, can be int8, int4, etc.
bool IS_BLOCK_SPARSE; // whether the attention is block sparse, if true, the block_tables will be used to determine which blocks are valid.
int PARTITION_SIZE; // the partition size of the attention.

Partition attention:
At the decode stage, the query may be fewer than the prefill stage, but we still need to tackle a sequence-long kv cache. So the idea is to cut the kv cache into multiple partitions, and each partition will be processed by one kernel. Thus we achieved better parallelism and memory access pattern. The partition size can be tuned for better performance, and it is usually set to be a multiple of the block size.(from Tri Daos’ 2023 work Flash-Decoding for long-context inference)

Then consider the grid and block configuration:
The comment said Grid:(num_heads, num_seqs, max_num_partitions):

1
2
3
blockIdx.x: head index, range from 0 to num_heads-1
blockIdx.y: sequence index, range from 0 to num_seqs-1
blockIdx.z: partition index, range from 0 to max_num_partitions-1

Each cuda block will process one sequence of one head for one partition.

Output parameters:

1
2
3
float * __restrict__ exp_sums; // [num_seqs, num_heads, max_num_partitions]
float * __restrict__ max_logits; // [num_seqs, num_heads, max_num_partitions]
scalar_t * __restrict__ out; // [num_seqs, num_heads, max_num_partitions, head_size]

They are the middle results of Flash-Decoding. When PARTITION_SIZE > 0, each partition computes local softmax, expsum, weighted ouput and then reduce them to get the final output. PARTITION_SIZE = 0 means no partition, the kernel will compute the final output directly without reduction.

Input parameters:

1
2
3
4
5
6
7
8
9
10
11
12
13
scalar_t * __restrict__ q; // [num_seqs, num_heads, head_size]
cache_t * __restrict__ k_cache; // [num_blocks, num_kv_heads, head_size/x, block_size, x]
cache_t * __restrict__ v_cache; // [num_blocks, num_kv_heads, head_size, block_size]
int num_kv_heads;
float scale;
int * __restrict__ block_tables; // [num_seqs, max_num_blocks_per_seq]
int * __restrict__ seq_lens; // [num_seqs]
int max_num_blocks_per_seq;
float * __restrict__ alibi_slopes; // [num_heads]
int q_stride, kv_block_stride, kv_head_stride;
float *k_scale, *v_scale;
int tp_rank;
int blocksparse_local_blocks, blocksparse_vert_stride, blocksparse_block_size, blocksparse_head_sliding_step;
  • q: the query tensor, each sequence has one query vector for each head, so the shape is [num_seqs, num_heads, head_size].
  • k_cache and v_cache: the kv cache tensor, each sequence has multiple blocks of kv cache, each block has multiple kv heads, each kv head has multiple key/value vectors, so the shape is [num_blocks, num_kv_heads, head_size/x, block_size, x], where we split head_size into x parts to vectorize the memory access for better performance. The actual head size is head_size/x * x = head_size, and the actual block size is block_size * x.
  • num_kv_heads: the number of kv heads for each head.
  • scale: the scaling factor for the attention, usually set to be 1/sqrt(head_size).
  • block_tables: the block table for block sparse attention, each sequence has a block table to indicate which blocks are valid, so the shape is [num_seqs, max_num_blocks_per_seq].
  • seq_lens: the actual sequence length for each sequence, so the shape is [num_seqs].
  • max_num_blocks_per_seq: the maximum number of blocks for each sequence, used for block sparse attention.
  • alibi_slopes: the alibi slopes for each head when we don’t use RoPE, we do not discuss it here, so just ignore it.
  • q_stride, kv_block_stride, kv_head_stride: the stride for accessing the q, k_cache and v_cache tensors, avoid calculating the stride in the kernel for better performance.
  • k_scale and v_scale: the scaling factor for the quantized k and v, used for unquantization.
  • tp_rank: the tensor parallel rank of the current process, used for partitioning the kv cache for tensor parallelism.
  • blocksparse_local_blocks, blocksparse_vert_stride, blocksparse_block_size, blocksparse_head_sliding_step: the parameters for block sparse attention, used for calculating the valid blocks for each head.

Next comes the kernel body, first to decide the work space of now block:

1
2
3
4
5
const int seq_idx = blockIdx.y; 
const int partition_idx = blockIdx.z;
const int max_num_partitions = gridDim.z;
constexpr bool USE_PARTITIONING = PARTITION_SIZE > 0;
const int seq_len = seq_lens[seq_idx];

Read the sequence index and partition index from the block index, and read the sequence length from the seq_lens tensor.

Then calculate the start and end position of the current partition:

1
const int num_seq_blocks = DIVIDE_ROUND_UP(seq_len, BLOCK_SIZE); 

Calculate how many blocks are there for the current sequence, which is the sequence length divided by the block size, rounded up.

1
2
const int num_blocks_per_partition =
USE_PARTITIONING? PARTITION_SIZE / BLOCK_SIZE: num_seq_blocks;

Calculate how many blocks are there for each partition.

1
2
3
4
5
const int start_block_idx =
USE_PARTITIONING? partition_idx * num_blocks_per_partition: 0;
const int end_block_idx =
MIN(start_block_idx + num_blocks_per_partition, num_seq_blocks);
const int num_blocks = end_block_idx - start_block_idx;

Obviously, we can infer from the code itself.

Then convert the block range to the token range:

1
2
3
const int start_token_idx = start_block_idx * BLOCK_SIZE; 
const int end_token_idx = MIN(start_token_idx + num_blocks * BLOCK_SIZE, seq_len);
const int num_tokens = end_token_idx - start_token_idx;

Thread Group Design

1
constexpr int THREAD_GROUP_SIZE = MAX(WARP_SIZE / BLOCK_SIZE, 1); 

A thread group is a group of threads that work together to process the QK product for one token.

One warp has 32 threads and a KV block has BLOCK_SIZE tokens, so the design is to let threads balance the workload of one block. Each token receives WARP_SIZE / BLOCK_SIZE threads to compute the QK product.

1
constexpr int NUM_THREAD_GROUPS = NUM_THREADS / THREAD_GROUP_SIZE; 

Calculate how many thread groups are there in one block, which is the total number token can be processed in parallel.

1
constexpr int NUM_TOKENS_PER_THREAD_GROUP = DIVIDE_ROUND_UP(BLOCK_SIZE, WARP_SIZE); 

If BLOCK_SIZE > WARP_SIZE, each thread group will process multiple tokens.

1
2
3
4
constexpr int NUM_WARPS = NUM_THREADS / WARP_SIZE; 
const int thread_idx = threadIdx.x;
const int warp_idx = thread_idx / WARP_SIZE;
const int lane = thread_idx % WARP_SIZE;

Standard thread index calculation.

GQA head projection

1
2
3
4
const int head_idx = blockIdx.x; 
const int num_heads = gridDim.x;
const int num_queries_per_kv = num_heads / num_kv_heads;
const int kv_head_idx = head_idx / num_queries_per_kv;

GQA means multi Q head with shared K/V, num_queries_per_kv is the number of query heads that share the same kv head, so we can calculate the kv head index from the query head index.

Vector Type Definition

1
constexpr int VEC_SIZE = MAX(16 / (THREAD_GROUP_SIZE * sizeof(scalar_t)), 1); 

The goal is to vectorize the memory access in a thread group to read 16 bytes of data which targeting LDG.128 instruction.

For an example: THREAD_GROUP_SIZE = 4, scalar_t = float16, then VEC_SIZE = 16 / (4 * 2) = 2, which means each thread will read 2 float16 elements once, which is 4 bytes, and the whole thread group will read 16 bytes once.

Then we can define the vector type for the q/k/v:

1
2
3
using K_vec = typename Vec< scalar_t, VEC_SIZE> : :Type; 
using Q_vec = typename Vec< scalar_t, VEC_SIZE> : :Type;
using Quant_vec = typename Vec< cache_t, VEC_SIZE> : :Type;

Then we can calculate how many elements in one thread:

1
2
constexpr int NUM_ELEMS_PER_THREAD = HEAD_SIZE / THREAD_GROUP_SIZE; 
constexpr int NUM_VECS_PER_THREAD = NUM_ELEMS_PER_THREAD / VEC_SIZE;

HEAD_SIZE elements is assigned to one thread group, each thread process NUM_ELEMS_PER_THREAD elements. Then make groups according to the VEC_SIZE for vectorized memory access, each thread will process NUM_VECS_PER_THREAD vectors.

Coordinate in the Thread Group

1
2
const int thread_group_idx = thread_idx / THREAD_GROUP_SIZE; 
const int thread_group_offset = thread_idx % THREAD_GROUP_SIZE;

Calculate the thread group index and the offset of the thread in the thread group, which will be used for memory access and reduction.

Load the Query to registers.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
 // Load the query to registers.
// Each thread in a thread group has a different part of the query.
// For example, if the thread group size is 4, then the first thread in
// the group has 0, 4, 8, ... th vectors of the query, and the second thread
// has 1, 5, 9, ... th vectors of the query, and so on. NOTE(woosuk): Because
// q is split from a qkv tensor, it may not be contiguous.
const scalar_t* q_ptr = q + seq_idx * q_stride + head_idx * HEAD_SIZE;
__shared__ Q_vec q_vecs[THREAD_GROUP_SIZE][NUM_VECS_PER_THREAD];
#pragma unroll
for (int i = thread_group_idx; i < NUM_VECS_PER_THREAD;
i += NUM_THREAD_GROUPS) {
const int vec_idx = thread_group_offset + i * THREAD_GROUP_SIZE;
q_vecs[thread_group_offset][i] =
*reinterpret_cast< const Q_vec*> (q_ptr + vec_idx * VEC_SIZE);
}
__syncthreads(); // TODO(naed90): possible speedup if this is replaced with a
// memory wall right before we use q_vecs

Q‘s shape is [num_seqs, num_heads, head_size] and the grid shape is [num_heads, num_seqs, max_num_partitions], so the query for the current block can be calculated by q +seq_idx * q_stride + head_idx * HEAD_SIZE. The shared memory q_vecs layout is suitable for the thread group design.

Then each thread in the thread group(assume it’s size is 4) will traverse a row of the shm, thread 0, 4, 8, 12 will traverse the first row, then their thread group idx is 0, 1, … NUM_THREAD_GROUPS - 1, we assign them to a row of the shm and repeat it. For example, if NUM_THREAD_GROUPS = 4, then thread 0, 4, 8, 12 will traverse(0, 1, 2, 3), (4, 5, 6, 7)…
The Q access model is computed by vec_idx = thread_group_offset + i * THREAD_GROUP_SIZE, meaning that the first row of shm stores (0, 4, 8, 12), the second row stores (1, 5, 9, 13), and so on. This access pattern can ensure coalesced memory access for the query.

Why we store the Query in shared memory instead of registers?
The query will be reused for multiple thread groups. After storing it into shm, each thread can access the query of other thread groups for free.

Shared Memory Plan

1
2
3
extern __shared__ char shared_mem[]; 
float* logits = reinterpret_cast< float*> (shared_mem);
__shared__ float red_smem[2 * NUM_WARPS];

The shared_mem is used to store the attention score of all the tokens and the red_smem is used for the reduction.

1
constexpr int x = 16 / sizeof(cache_t); 

K cache’s layout is [num_blocks, num_kv_heads, head_size/x, block_size, x], so we need to calculate the x for vectorized memory access.

1
float qk_max = -FLT_MAX; 

Each thread keep a local max used for online softmax.

1
const int* block_table = block_tables + seq_idx * max_num_blocks_per_seq; 

block_tables‘s layout is [num_seqs, max_num_blocks_per_seq], we get the block table information.

Here we would not consider the block sparse case, so just ignore the block sparse related code for now.

Main Loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
for (int block_idx = start_block_idx + warp_idx; block_idx < end_block_idx; 
block_idx += NUM_WARPS) {
// NOTE(woosuk): The block number is stored in int32. However, we cast it to
// int64 because int32 can lead to overflow when this variable is multiplied
// by large numbers (e.g., kv_block_stride).
// For blocksparse attention: skip computation on blocks that are not
// attended
if constexpr (IS_BLOCK_SPARSE) {
const int k_bs_block_id = block_idx * BLOCK_SIZE / blocksparse_block_size;
const bool is_remote =
((k_bs_block_id + bs_block_offset) % blocksparse_vert_stride == 0);
const bool is_local =
(k_bs_block_id > q_bs_block_id - blocksparse_local_blocks);
if (!is_remote & & !is_local) {
for (int i = 0; i < NUM_TOKENS_PER_THREAD_GROUP; i++) {
const int physical_block_offset =
(thread_group_idx + i * WARP_SIZE) % BLOCK_SIZE;
const int token_idx = block_idx * BLOCK_SIZE + physical_block_offset;

if (thread_group_offset == 0) {
// NOTE(linxihui): assign very large number to skipped tokens to
// avoid contribution to the sumexp softmax normalizer. This will
// not be used at computing sum(softmax*v) as the blocks will be
// skipped.
logits[token_idx - start_token_idx] = -FLT_MAX;
}
}
continue;
}
}
const int64_t physical_block_number =
static_cast< int64_t> (block_table[block_idx]);

// Load a key to registers.
// Each thread in a thread group has a different part of the key.
// For example, if the thread group size is 4, then the first thread in
// the group has 0, 4, 8, ... th vectors of the key, and the second thread
// has 1, 5, 9, ... th vectors of the key, and so on.
for (int i = 0; i < NUM_TOKENS_PER_THREAD_GROUP; i++) {
const int physical_block_offset =
(thread_group_idx + i * WARP_SIZE) % BLOCK_SIZE;
const int token_idx = block_idx * BLOCK_SIZE + physical_block_offset;
K_vec k_vecs[NUM_VECS_PER_THREAD];

#pragma unroll
for (int j = 0; j < NUM_VECS_PER_THREAD; j++) {
const cache_t* k_ptr =
k_cache + physical_block_number * kv_block_stride +
kv_head_idx * kv_head_stride + physical_block_offset * x;
const int vec_idx = thread_group_offset + j * THREAD_GROUP_SIZE;
const int offset1 = (vec_idx * VEC_SIZE) / x;
const int offset2 = (vec_idx * VEC_SIZE) % x;

if constexpr (KV_DTYPE == Fp8KVCacheDataType: :kAuto) {
k_vecs[j] = *reinterpret_cast< const K_vec*> (
k_ptr + offset1 * BLOCK_SIZE * x + offset2);
} else {
// Vector conversion from Quant_vec to K_vec.
Quant_vec k_vec_quant = *reinterpret_cast< const Quant_vec*> (
k_ptr + offset1 * BLOCK_SIZE * x + offset2);
k_vecs[j] = fp8: :scaled_convert< K_vec, Quant_vec, KV_DTYPE> (
k_vec_quant, *k_scale);
}
}

// Compute dot product.
// This includes a reduction across the threads in the same thread group.
float qk = scale * Qk_dot< scalar_t, THREAD_GROUP_SIZE> : :dot(
q_vecs[thread_group_offset], k_vecs);
// Add the ALiBi bias if slopes are given.
qk += (alibi_slope!= 0)? alibi_slope * (token_idx - seq_len + 1): 0;

if (thread_group_offset == 0) {
// Store the partial reductions to shared memory.
// NOTE(woosuk): It is required to zero out the masked logits.
const bool mask = token_idx > = seq_len;
logits[token_idx - start_token_idx] = mask? 0.f: qk;
// Update the max value.
qk_max = mask? qk_max: fmaxf(qk_max, qk);
}
}
}
KV block assign loop
1
2
for (int block_idx = start_block_idx + warp_idx; block_idx < end_block_idx; 
block_idx += NUM_WARPS)

Each warp tackle a few KV blocks, if there are 4 warps and 16 blocks, then the warp 1 will tackle block 1, 5, 9, 13.

Physical block location
1
2
const int64_t physical_block_number =
static_cast< int64_t> (block_table[block_idx]);

The block table is used to map the logical block index to the physical block index in the KV cache, which is used for block sparse attention.

Inner loop assign token in a warp
1
2
3
4
for (int i = 0; i < NUM_TOKENS_PER_THREAD_GROUP; i++) { 
const int physical_block_offset =
(thread_group_idx + i * WARP_SIZE) % BLOCK_SIZE;
const int token_idx = block_idx * BLOCK_SIZE + physical_block_offset;

Usually, NUM_TOKENS_PER_THREAD_GROUP is 1, which means each thread group process one token.

Load K
1
2
3
4
5
6
7
8
9
K_vec k_vecs[NUM_VECS_PER_THREAD]; 

for (int j = 0; j < NUM_VECS_PER_THREAD; j++) {
const cache_t* k_ptr =
k_cache + physical_block_number * kv_block_stride +
kv_head_idx * kv_head_stride + physical_block_offset * x;
const int vec_idx = thread_group_offset + j * THREAD_GROUP_SIZE;
const int offset1 = (vec_idx * VEC_SIZE) / x;
const int offset2 = (vec_idx * VEC_SIZE) % x;

K cache’s layout is [num_blocks, num_kv_heads, head_size/x, block_size, x], so the base ptr is located to [physical_block_number, kv_head_idx, 0, physical_block_offset, 0]. This is the start position of the k vector for the current token. Then the vec_idx is used like the Q loading part to ensure thread_group_offset thread process the vec_idxth vector.

1
2
k_vecs[j] = *reinterpret_cast< const K_vec*> (
k_ptr + offset1 * BLOCK_SIZE * x + offset2);

Then we can load the k vector to registers.
Assume THREAD_GROUP_SIZE=4, VEC_SIZE=2, x=8, NUM_VECS_PER_THREAD=8, a thread group will process a token as below:

1
2
3
4
5
6
7
8
thread_group_offset=0, j=0: vec_idx=0, offset1=0, offset2=0 → [0,1]
thread_group_offset=1, j=0: vec_idx=1, offset1=0, offset2=2 → [2,3]
thread_group_offset=2, j=0: vec_idx=2, offset1=0, offset2=4 → [4,5]
thread_group_offset=3, j=0: vec_idx=3, offset1=0, offset2=6 → [6,7]

thread_group_offset=0, j=1: vec_idx=4, offset1=1, offset2=0 → [8,9]
thread_group_offset=1, j=1: vec_idx=5, offset1=1, offset2=2 → [10,11]
...

Dot product and store the logits

1
2
3
4
// Compute dot product.
// This includes a reduction across the threads in the same thread group.
float qk = scale * Qk_dot< scalar_t, THREAD_GROUP_SIZE> : :dot(
q_vecs[thread_group_offset], k_vecs);

Each thread compute the dot product and reduce in the thread group by Qk_dot, which is implemented by warp shuffle.

1
2
3
4
5
if (thread_group_offset == 0) { 
const bool mask = token_idx > = seq_len;
logits[token_idx - start_token_idx] = mask? 0.f: qk;
qk_max = mask? qk_max: fmaxf(qk_max, qk);
}

Then store the logits to shared memory, and update the local max for softmax.

Reduction for max logits

1
2
3
4
5
6
7
8
9
10
11
// Perform reduction across the threads in the same warp to get the
// max qk value for each " warp" (not across the thread block yet).
// The 0-th thread of each thread group already has its max qk value.
#pragma unroll
for (int mask = WARP_SIZE / 2; mask > = THREAD_GROUP_SIZE; mask /= 2) {
qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
}
if (lane == 0) {
red_smem[warp_idx] = qk_max;
}
__syncthreads();

Reduce all the kv block processed by a warp.

1
2
3
4
5
6
7
 qk_max = lane < NUM_WARPS? red_smem[lane]: -FLT_MAX; 
#pragma unroll
for (int mask = NUM_WARPS / 2; mask > = 1; mask /= 2) {
qk_max = fmaxf(qk_max, VLLM_SHFL_XOR_SYNC(qk_max, mask));
}
// Broadcast the max qk value to all threads.
qk_max = VLLM_SHFL_SYNC(qk_max, 0);

Then cross warps to get the max qk value.

Then compute the exp and the exp sums:

1
2
3
4
5
6
7
8
float exp_sum = 0.f; 
for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
float val = __expf(logits[i] - qk_max);
logits[i] = val;
exp_sum += val;
}

exp_sum = block_sum< NUM_WARPS> (& red_smem[NUM_WARPS], exp_sum);

After this, we normalize the logits and get the softmax output:

1
2
3
4
5
const float inv_sum = __fdividef(1.f, exp_sum + 1e-6f); 
for (int i = thread_idx; i < num_tokens; i += NUM_THREADS) {
logits[i] *= inv_sum;
}
__syncthreads();

Load V into the memory

V cache’s layout is [num_blocks, num_kv_heads, head_size, block_size], which is different from K cache.

1
2
3
4
5
6
7
8
9
10
11
12
13
constexpr int V_VEC_SIZE = MIN(16 / sizeof(scalar_t), BLOCK_SIZE); 
using V_vec = typename Vec< scalar_t, V_VEC_SIZE> : :Type;
using L_vec = typename Vec< scalar_t, V_VEC_SIZE> : :Type;
using V_quant_vec = typename Vec< cache_t, V_VEC_SIZE> : :Type;
using Float_L_vec = typename FloatVec< L_vec> : :Type;

constexpr int NUM_V_VECS_PER_ROW = BLOCK_SIZE / V_VEC_SIZE;
constexpr int NUM_ROWS_PER_ITER = WARP_SIZE / NUM_V_VECS_PER_ROW;
constexpr int NUM_ROWS_PER_THREAD =
DIVIDE_ROUND_UP(HEAD_SIZE, NUM_ROWS_PER_ITER);

// NOTE(woosuk): We use FP32 for the accumulator for better accuracy.
float accs[NUM_ROWS_PER_THREAD];

There are BLOCK_SIZE elements in one row of V, split them in V_VEC_SIZE. A warp has WARP_SIZE threads, so one iteration can process NUM_ROWS_PER_ITER rows, and each thread will process NUM_ROWS_PER_THREAD rows.

Acc initialization

1
2
3
4
// Initialize the accumulators.
for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
accs[i] = 0.f;
}

Then comes the second main part: the matrix multiplication between the softmax output and the V vectors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
scalar_t zero_value; 
zero(zero_value);
for (int block_idx = start_block_idx + warp_idx; block_idx < end_block_idx;
block_idx += NUM_WARPS) {
// NOTE(woosuk): The block number is stored in int32. However, we cast it to
// int64 because int32 can lead to overflow when this variable is multiplied
// by large numbers (e.g., kv_block_stride).
// For blocksparse attention: skip computation on blocks that are not
// attended
if constexpr (IS_BLOCK_SPARSE) {
int v_bs_block_id = block_idx * BLOCK_SIZE / blocksparse_block_size;
if (!((v_bs_block_id + bs_block_offset) % blocksparse_vert_stride == 0) & &
!((v_bs_block_id > q_bs_block_id - blocksparse_local_blocks))) {
continue;
}
}
const int64_t physical_block_number =
static_cast< int64_t> (block_table[block_idx]);
const int physical_block_offset = (lane % NUM_V_VECS_PER_ROW) * V_VEC_SIZE;
const int token_idx = block_idx * BLOCK_SIZE + physical_block_offset;
L_vec logits_vec;
from_float(logits_vec, *reinterpret_cast< Float_L_vec*> (logits + token_idx -
start_token_idx));

const cache_t* v_ptr = v_cache + physical_block_number * kv_block_stride +
kv_head_idx * kv_head_stride;
#pragma unroll
for (int i = 0; i < NUM_ROWS_PER_THREAD; i++) {
const int row_idx = lane / NUM_V_VECS_PER_ROW + i * NUM_ROWS_PER_ITER;
if (row_idx < HEAD_SIZE) {
const int offset = row_idx * BLOCK_SIZE + physical_block_offset;
V_vec v_vec;

if constexpr (KV_DTYPE == Fp8KVCacheDataType: :kAuto) {
v_vec = *reinterpret_cast< const V_vec*> (v_ptr + offset);
} else {
V_quant_vec v_quant_vec =
*reinterpret_cast< const V_quant_vec*> (v_ptr + offset);
// Vector conversion from V_quant_vec to V_vec.
v_vec = fp8: :scaled_convert< V_vec, V_quant_vec, KV_DTYPE> (v_quant_vec,
*v_scale);
}
if (block_idx == num_seq_blocks - 1) {
// NOTE(woosuk): When v_vec contains the tokens that are out of the
// context, we should explicitly zero out the values since they may
// contain NaNs. See
// https://github.com/vllm-project/vllm/issues/641#issuecomment-1682544472
scalar_t* v_vec_ptr = reinterpret_cast< scalar_t*> (& v_vec);
#pragma unroll
for (int j = 0; j < V_VEC_SIZE; j++) {
v_vec_ptr[j] = token_idx + j < seq_len? v_vec_ptr[j]: zero_value;
}
}
accs[i] += dot(logits_vec, v_vec);
}
}
}

This code is similar to the previous loop for K.

Streams and Concurrency on CUDA

Introduction

I have learned CUDA kernel programming for a long time, but I have never learnt CUDA streams, only knowing that CUDA streams can be used to achieve concurrency. Today by reading the NVIDIA slides on CUDA streams, I have a better understanding of CUDA streams and concurrency.

Default stream

By default, all CUDA operations are issued into a single stream, called the default stream. Operations in the default stream are executed sequentially, and they are not concurrent with any other operations. The special behavior of the default stream is that it is wholely sync for host and device, which means each time we submit a operation to the default stream, the host will insert an implicit cudaDeviceSynchronize() after and before the operation. But there are several exceptions to this rule:

  • Kernel launches in the default stream are asynchronous with respect to the host, but they are still serialized with respect to other operations in the default stream.
  • cudaMemcpyAsync() and cudaMemsetAsync() operations in the default stream are asynchronous with respect to the host.
  • cudaMemcpy() in the same device.
  • cudaMemcpy() below 64KB between host and device.

Requirements for Concurrency

To achieve concurrency, we need to meet the following requirements:

  • Use non-default streams for concurrent operations.
  • cudaMemcpyAsync() with host from pinned memory.
  • sufficient resources must be available on the device to execute concurrent operations.

Some examples

1
2
3
4
5
6
7
8

cudaMalloc(& dev1, size);
double * host1 = (double *)malloc(& host, size);
...
cudaMemcpy(dev1, host1, size, cudaMemcpyHostToDevice);
kernel2<<< grid, block, 0>>> (..., dev2, ...);
kernel3<<< grid, block, 0>>> (..., dev3, ...);
cudaMemcpy(host4, dev4, size, cudaMemcpyDeviceToHost);

Above code will be executed synchronously, because all operations are issued into the default stream. Observing the nsys timeline, we can see that all operations are executed sequentially.

nsys timeline 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
cudaStream_t streams[NUM_STREAMS]; 
for(int i = 0; i < NUM_STREAMS; i ++){
cudaStreamCreate(& streams[i]);
}
for(int i = 0; i < NUM_STREAMS; i ++){
int offset = i * chunk;
cudaMemcpyAsync(dev1 + offset, host + offset, chunk_size, cudaMemcpyHostToDevice, streams[i]);
}
for(int i = 0; i < NUM_STREAMS; i ++){
int offset = i * chunk;
kernel1<<< (chunk + 255) / 256, 256, 0, streams[i]>>> (dev1 + offset, dev2 + offset, chunk);
}
for(int i = 0; i < NUM_STREAMS; i ++){
int offset = i * chunk;
cudaMemcpyAsync(host + offset, dev2 + offset, chunk_size,
cudaMemcpyDeviceToHost, streams[i]);
}

Above code will be executed concurrently, because we have issued operations into different streams. Observing the nsys timeline, we can see that all operations are executed concurrently.

nsys timeline 2

Another overlap example is as follows:

1
2
3
4
5
cudaMemcpy(dev1, host1, size, H2D); 
kernel2<<< grid, block>>> (dev2); // launch kernel is asynchronous with respect to the host.
some_CPU_method(); // overlap with kernel2
kernel3<<< grid, block>>> (dev3);
cudaMemcpy(host4, dev4, size, D2H);

In above code, kernel2 will be launched asynchronously with respect to the host, so some_CPU_method() can be executed concurrently with kernel2. However, kernel3 and cudaMemcpy() will be executed sequentially after kernel2, because they are issued into the default stream.

Explicit Synchronization

  • Synchronize everything: cudaDeviceSynchronize(): blocks host until all issued CUDA operations are completed.
  • Synchronize a stream: cudaStreamSynchronize(stream): blocks host until all operations in the specified stream are completed.
  • Synchronize using Events: cudaEventSynchronize(event): blocks host until the specified event is completed. Events can be used to measure the time between operations in different streams.

Some Event Using Examples

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cudaEvent_t start, stop; 
cudaEventCreate(& start);
cudaEventCreate(& stop);

cudaMemcpyAsync(dev1, host1, size, H2D, stream1);
cudaEventRecord(start, stream1); // record start event after memcpy

cudaMemcpyAsync(host2, dev2, size, D2H, stream2);
cudaStreamWaitEvent(stream2, start, 0); // make stream2 wait for the start event
kernel<<< grid, block, 0, stream2>>> (...); // kernel will execute after the start event is recorded
cudaEventRecord(stop, stream2); // record stop event after kernel launch
cudaEventSynchronize(stop); // wait for the stop event to complete
float elapsedTime;
cudaEventElapsedTime(& elapsedTime, start, stop); // calculate elapsed time between start and stop events
printf(" Elapsed time: %f ms\n" , elapsedTime);

Implicit Synchronization

Some operations will cause implicit synchronization, without knowing it, we may introduce unexpected synchronization points in our code, which can lead to performance degradation. Some examples of implicit synchronization are as follows:

  • cudaMallocHost()/cudaFreeHost(): These functions will block the host until all previously issued CUDA operations are completed, because they need to ensure that the pinned memory is not being used by any ongoing CUDA operations.
  • cudaMalloc(): This function will block the host until all previously issued CUDA operations are completed, because it needs to ensure that there are sufficient resources available on the device to allocate the requested memory.
  • cudaMemcpy(): This function needs to ensure that the data transfer is not being interfered by any ongoing CUDA operations.
  • cudaDeviceSetCacheConfig(): This function needs to ensure that the cache configuration is not being changed while any ongoing CUDA operations are using the cache, so it will block the host until all previously issued CUDA operations are completed.

The right way to avoid implicit synchronization is to assign the memory allocation and deallocation in the beginning and the end of the program, and use cudaMemcpyAsync() instead of cudaMemcpy() for data transfer between host and device.

Stream Scheduling

Take Fermi architecture as an example, it has 3 queues: 1 compute engine queue, 2 copy engine queues (one for H2D and one for D2H).

The shedule rule is as follows:
CUDA operations are pushed into the target queue based on the type of operation in the launch order. One operation is issued only when the three conditions are met:

  • In the same stream, all previously issued operations have been completed.
  • Ahead of the operation in the same queue, there is no other operation that is still executing.
  • The resources required for the operation are available on the device.

One blocked operation can block the entire queue even there are other operations in the queue belonging to different streams. So the launch order of operations can affect the performance of the program.

An example of stream scheduling is as follows:

stream scheduling1
stream scheduling2

Concurrent Kernel Scheduling

Normally, a signal is inserted into the queues, after the operation is issued, to indicate the completion of the operation. But for the compute engine queues, when compute kernels are issued sequentially, the signal is not inserted until the kernel is completed. So if there are multiple kernels issued into the compute engine queue, they will be executed sequentially, even if they belong to different streams.

In some situations this delay of signals can block other queues.

Conclusion

Maybe the slides I read is a bit old, but it still gives me a good understanding of CUDA streams and concurrency. I will try to use CUDA streams in my future projects to improve the performance of my code.

Foundation of Reinforcement learning(V)

Introduction

In the previous post, we have introduced the estimate of value function: MC and TD. In this post, we will introduce two important algorithms for estimating the action value function: SARSA and Q-learning.

Looking back at our previous post, now we have known ‘What is the best state’: estimating the state value function , but we still don’t know ‘What is the best action’: . Here we don’t know the transition probability , so we can’t directly compute the optimal policy. So, we need to estimate the action value function .

SARSA

For any (state, action, reward, next state, next action) executated by the policy , we can update the action value function as follows:

Foundation of Reinforcement learning(IV)

Introduction

In the previous posts, we have introduced the MDP and its solution. But in practice, we often do not have the simulation of the environment, which means we cannot directly apply our knowledge of MDP to solve the problem. There is indeed a method to simulate the environment, which is called model-based Reinforcement learning. However, in this post, we will focus on the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs.

Estimating Value Functions

In mode-based RL, value functions can be computed by DP methods as follows:

However, in model-free RL, we cannot directly access the and , but we have some ways to estimate the value function from episodes of experience.

Why we estimate the value function?
Because we can use the value function to derive the optimal policy, which is our ultimate goal. Besides value function can help us to reuse historical experience to make better decisions in the future, which is the essence of Reinforcement learning.

Here is a graph to introduce some methods to estimate the value function:

value estimation(Slide credit: David Silver)

Monte Carlo methods

Target: Learn from episodes of experience.

Review: accumulate reward function:

Review: value function is the expected return:
The Monte Carlo method use empirical mean cumulative reward instead of expected return to estimate the value function.

First-visit Monte Carlo method

The first-visit Monte Carlo method estimates the value function by averaging the returns following the first time a state is visited in an episode. The algorithm is as follows:

  1. Initialization:
    • For any ,, .
    • For any ,.
  2. Loop for each episode:
    • Generate an episode following policy : .
    • For t = T-1, T-2, …, 0:
      • If is the first time in the episode:
        • Append to .
        • .

The reason why we call it ‘first-visit’ is that we only update the value function for the first time we visit a state in an episode. Thus we can avoid the bias caused by multiple visits to the same state in an episode. However, this method may have high variance because it only uses one return for each state in an episode.

Incremental Monte Carlo method

The first-visit Monte Carlo method takes a lot of memory to store the returns for each state, which is not efficient. The incremental Monte Carlo method uses an incremental update rule to estimate the value function without storing all the returns. The algorithm is as follows:

  1. Initialization:
    • For any ,, , .
  2. Loop for each episode:
    • Generate an episode following policy : .
    • For t = T-1, T-2, …, 0:
      • If is the first time in the episode:
        • .

Interesting, online softmax also takes the same update rule as the incremental Monte Carlo method. Great job.

Besides, incremental MC provides more design space for us to tackle some problems in practice. For example, we can use a constant step size instead of to update the value function, which is called constant step size MC method. It is useful when the environment is non-stationary, which means the reward function and transition probability may change over time. In this case, we want to give more weight to recent returns than old returns, which can be achieved by using a constant step size :

Some properties of Monte Carlo methods

  1. Monte Carlo methods are model-free, which means they do not require the knowledge of the environment’s dynamics (transition probabilities and reward function).
  2. Monte Carlo methods take the simpliest approach to estimate the value function, which is to average the returns following the policy. However, this method may have high variance because it only uses one return for each state in an episode.
  3. One key to note is that Monte Carlo methods can only be applied to finite MDPs, which means the state space and action space must be finite.

Importance sampling

Let’s try to estimate a custom distribution ‘s expectation.
Then we reassign the importance sampling weight , we can rewrite the above equation as:

Off-policy Monte Carlo methods via Importance Sampling

We can use the cumulative reward function of policy to justify policy , and then weight the cumulative reward function by the importance ratio between and to estimate the value function of policy . The algorithm is as follows:

Every episode would be mutified by the importance sampling ratio:

So we then update the value function by:

Sample by importance sampling will significantly increase the variance of the return, which is because the importance sampling ratio can be very large when and are very different.

Temporal-Difference Learning

Temporal-Difference (TD) is a method combining the MC method and DP method, which name comes from the fact that it uses the diff of estimated value function at two consecutive time steps to update the value function. There are two key ideas in TD learning: TD error and TD target.

For state value funtion , after a transition from state to state with reward , the TD error is defined as:
The TD target is defined as:

As for the TD in Bellman expectation equation, the TD error is used in estimating the expect part.

Some details of TD learning

The simpliest TD learning algorithm is called TD(0), which updates the value function by the TD error at each time step. The key equation of TD(0) is as follows:

Why we update like this?
The Bellman expectation equation is rewritten as:


That’s all. We want to make the TD error as small as possible, which means we want to make the estimated value function as close as possible to the true value function. Thus, we can use the TD error to update the value function.

The TD method introduce the bootstrapping idea, which means we use the estimated value function to update the value function. This is different from the MC method, which uses the actual return to update the value function. The bootstrapping idea can significantly reduce the variance of the return, but it may introduce bias because we are using an estimated value function to update the value function.

Contrast between TD and MC methods

They have the same goal: Learn the value function from episodes of experience. However, they have different approaches to achieve this goal. The MC method uses the actual return to update the value function, which can have high variance but no bias. The TD method uses the estimated value function to update the value function, which can have low variance but may introduce bias.

TD method MC method
update value function like update value function like

The object of TD is , which is called TD target, while the object of MC is , which is the actual return. The TD method’s error is called TD error, which is defined as , while the MC method’s error is defined as .

The strengths and limitations of TD learning and MC learning

TD method can learn until the end of an episode:

  • After each step in an episode, TD method can update the value function use the former value function, which means it can learn until the end of an episode. However, MC method can only update the value function after the end of an episode, which means it cannot learn until the end of an episode.
  • TD method can learn from incomplete episodes, which means it can learn from episodes that are not terminated. However, MC method can only learn from complete episodes, which means it cannot learn from episodes that are not terminated.

Tradeoff between bias and variance

Estimator Bias Variance
MC Unbiased: Higher
TD (real) Unbiased: Lower
TD (actual) Biased: Lower

Note: The real TD target uses the true , which is unknown in practice. The actual TD target uses the current estimate , introducing bias. Despite the bias, TD typically has lower variance than MC because it bootstraps from a single step rather than a full trajectory.

Multi-step TD learning

The TD(0) method only uses the immediate reward and the estimated value of the next state to update the value function, which may not be sufficient to capture the long-term dependencies in the environment. The multi-step TD learning method uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. We will introduce it by leading into n-step cumulate reward function and n-step TD target.

N-step cumulate reward function

Consider the following n-step cumulate reward function:
It seems make sense to use the n-step cumulate reward function to update the value function, which is called n-step TD learning. The key equation of n-step TD learning is as follows:

N-step mean cumulate reward function

Can we take up the information of different n-step cumulate reward function to update the value function? The answer is yes.
We can use a weighted average of different n-step cumulate reward functions to update the value function, which is called n-step mean TD learning. The key weight figure of weighted average is as follows:

lambda weight

So the n-step mean cumulate reward function is defined as:
Then we can update the value function by:

This is called TD() method, which is a generalization of TD(0) and MC methods. When , TD() reduces to TD(0) method, and when , TD() reduces to MC method. Thus, by adjusting the value of , we can control the bias-variance tradeoff in the estimation of the value function.

Conclusion about TD(λ) method

  • Unless the is 0, TD() mothod is unbiased, because it’s a weighted average of unbiased n-step TD targets.

  • The variance of TD() method is lower than that of MC method, because it uses bootstrapping to update the value function, which can reduce the variance of the return. However, the variance of TD() method is higher than that of TD(0) method, because it uses more rewards and estimated values to update the value function, which can increase the variance of the return.

  • Empirically is not quite commom because fast credit assignment for a given action is preferred. So MC or TD(0) is more commonly used in practice. However, TD() method can be useful when we want to balance the bias-variance tradeoff in the estimation of the value function, which can be achieved by adjusting the value of .

So TD() use as variable while n-step TD use as variable. TD() is a generalization of n-step TD, which can be seen as a weighted average of infin-step TD targets. By adjusting the value of , we can control the bias-variance tradeoff in the estimation of the value function, which can be useful in practice when we want to balance the bias and variance in the estimation of the value function.

Conclusion

In this post, we have introduced the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs. We have introduced two methods to estimate the value function from episodes of experience: Monte Carlo methods and Temporal-Difference learning. We have also introduced the n-step TD learning method, which uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. Finally, we have discussed the bias-variance tradeoff in the estimation of the value function, which can be controlled by adjusting the value of in TD() method.

Foundation of Reinforcement learning(III)

Introduction

In the previous post, we have introduced the Bellman equation and the linear programming formulation for MDP. In this post, we will discuss the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment.

The Settings of RL

Typically, RL is framed as MDP, exploring the enviroment and learning the optimal policy.
Generally, we can only observe the episodes and usually, we do not have the model of the environment.

So we need to introduce the model into our RL project. Model-based RL actually is a method to solve the MDP.

Dynamic Programming Based RL

Dynamic Programming for finite MDP

Our objective function is simple, just the expected return, which is defined as:

In the first episode of the series, we introduced Backward induction, which we can start from the last step and then recursively solve the problem. However, this method is not efficient for large state space, and it requires the state transition.

But meanwhile, we can also use the Bellman equation of value function to tackle the problem:

Optimal Value Function

For a state , we can define the optimal value function as the maximum value function over all policies:
So the optimal value function is as follows:

So the best policy can be derived from the optimal value function as:

for any state and policy , there is:

Obviously, the value function relates to the policy, so we can iterate the optimal value function and the optimal policy until convergence. They are called Value Iteration and Policy Iteration respectively.

Value Iteration

For an MDP which is finite in both state and action space, we can use the value iteration to solve the problem. The value iteration is as follows:

  1. Initialize for all
  2. For each state , update the value function as:
  3. Repeat step 2 until convergence.

NOTE: There isn’t any specific order to update the value function, we can update the value function in any order. But the convergence rate may be different.

Sync & Async Value Iteration

Sync value iteration need to store two copies of value funtion:

  1. For any state , we update the value function as:
  2. After updating all states, we copy the new value function to the old value function:

Async value iteration only need to store one copy of value function:

Policy Iteration

The assumption of MDP is the same as the value iteration, which is finite in both state and action space. The policy iteration is as follows:

  1. Randomly initialize a policy and a value function for all
  2. Repeat the following steps until convergence:
  3. Policy Evaluation: For each state , update the value function as:
  4. Policy Improvement: For each state , update the policy as:

Obviously, the Policy Iteration will be more expensive than the Value Iteration, since it needs to evaluate the policy in each iteration. However, the Policy Iteration can converge faster than the Value Iteration, since it can update the policy in each iteration.

Let’s contrast the two methods:

Method Value Iteration Policy Iteration
Update Value function Policy and value function
uses Bellman optimality equation Bellman expectation equation

NOTE:

  1. Value iteration is a greedy method, we always use the best.
  2. Update the value function by Bellman equation in Policy Iteration is expensive.
  3. For smaller space MDP, Policy Iteration is faster than Value Iteration, but for larger space MDP, Value Iteration is faster than Policy Iteration.
  4. If there isn’t any state transition circle, the value iteration is better.

Bellman operators

In fact, we have introduced the Bellman operators in the previous post, but we haven’t discussed it in detail.

Why Policy Iteration and Value Iteration can converge to the optimal value function? The key is that Bellman operators are contraction mappings.

Bellman operator is the collection of below functions:

  1. Bellman expectation operator, usually denoted as , which is defined as:

  2. Bellman optimality operator, usually denoted as or , which is defined as:

They can be used on the state value function and action value function:

  • expectation operator is used in the policy iteration, used for computing
    the value function of a given policy, while is the inner loop of the policy iteration.
  • optimality operator is used in the value iteration, used for computing the optimal value function, while is the main loop of the value iteration.

Both the Bellman expectation operator and the Bellman optimality operator can be defined on the action value function and the state value function:

  1. Bellman expectation operator on V-function:

  1. Bellman optimality operator on V-function:

  1. Bellman expectation operator on Q-function:

  1. Bellman optimality operator on Q-function:

Due to the contraction property of the Bellman operators, we can guarantee the convergence of the value iteration and policy iteration to the optimal value function.

Model RL

In the above sections, our objective environment is a known MDP, all our methods are based on the assumption that we have the model of the environment, which is the transition probability and the reward function. However, in many real-world scenarios, we do not have the model of the environment, so we need to learn the model from the data.

There are two basic thoughts to learn the model of the environment:

  1. learn the state transition probability:

    where is the number of times we have observed the transition from state to state when taking action , and is the number of times we have observed taking action in state .

  2. learn the reward function :

    where is the number of times we have observed taking action in state , and is the average reward we have observed when taking action in state .

The simple simulate algorithm is as follows:

  1. randomly initialize a policy
  2. repeat the following steps until convergence:
  3. collect data by executing the policy in the environment, and store the transition data in a replay buffer.
  4. learn the model of the environment from the replay buffer, which includes learning the state transition probability and the reward function.
  5. solve the MDP with the learned model to get the optimal policy .

Other method to solve this is not learning the MDP, instead we learn the value function directly from the data, which is called model-free RL, we will discuss it in the next post.

Conclusion

In this post, we have introduced the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment. We have discussed the value iteration and policy iteration, which are two basic methods to solve the MDP. We have also introduced the Bellman operators, which are the key to guarantee the convergence of the value iteration and policy iteration. Finally, we have introduced the simple simulate algorithm, which is a method to learn the model of the environment and solve the MDP with the learned model. In the next post, we will discuss the model-free Reinforcement learning, which is a method to learn the value function directly from the data without learning the model of the environment.

Foundation of Reinforcement learning(II)

Introduction

Given the former post where we have introduced the MDP and some basic properties, we are now ready to discuss the MDP-based Reinforcement learning. But first, we need to introduce the solution of MDP.

Bellman Equation

If we have learned the previous post, we will know that there are two types of value function, state value function and action value function. Their mathematical definitions are as follows:

Instinctively, the state value function is the expected return when we start from state and follow policy . Similarly, the action value function is defined as:

Here, it represents the expected return when we start from state , take action , and then follow policy thereafter.

On the other hand, we have a accumulate reward function, which is defined as:
It can be recursively defined as:

So, we can rewrite the state value function as:
Above is the Bellman expectation equation for the state value function. Now, our question is to choose the best policy to maximize the value function. We can define the optimal state value function as:

Due to the Principle of Optimality: Each stage of an optimal policy must be optimal for the remaining stages, we can derive the Bellman optimality equation for the state value function as:

Above is the Bellman optimality equation for the state value function.

Linear Programming for MDP

We have a key observation that the Bellman optimality equation can be rewritten as a linear programming problem. The linear programming formulation for MDP is as follows:

Proof:
The first part is to show that the optimal value function is a feasible solution to the above linear programming problem. We can see that for any state and action , we have:
Thus, satisfies the constraints of the linear programming problem.
The second part is to show that any feasible solution satisfying the constraints must be greater than or equal to .
Given that the optimal policy choose one action for each state , LP constraints imply that for any state :
Let’s write them in matrix form:
while the is the reward vector under policy , and is the transition matrix under policy . We can rearrange the above inequality as:
Since , we can conclude that is invertible, and we can get its inverse as:
Obviously, the above inverse is a non-negative matrix. Thus, we can multiply both sides of the inequality by to get:
Since , we can conclude that .
So, we have shown that any feasible solution is greater than or equal to , and the optimal value function is a feasible solution. Thus, the optimal solution to the linear programming problem is .

Given we have the , we can easily derive the optimal policy as:

Here I asked claude to give me a simple explanation of the equivalence between this greedy like policy decision and the optimal policy. The formal proof may be need to use the fixed point theorem, but it is beyond the scope of this post. We just need to remember that the optimal policy can be derived from the optimal value function by choosing the action that maximizes the expected return.

The Dual Linear Programming for MDP

The dual linear programming formulation for MDP is as follows:

If the initial state distribution for all , then let
, the target function means the expected accumulated reward represented by the occupancy measure, and the constraints means the flow conservation constraints.

Assume that the optimal solution is , we can derive the optimal policy from Theorem 2 as:

Comparison between the Primal and Dual Linear Programming for MDP

Dimension Primal LP Dual LP
variable state value function occupancy measure
objective minimize maximize
constraints Bellman optimality constraints flow conservation constraints
explanation state value action frequency

Summary

At the beginning of this post, I tried to introduce the MDP-based Reinforcement learning, but I found that the solution of MDP takes a lot of space, so I just introduce the Bellman equation and the linear programming formulation for MDP. In the next post, I will introduce the value iteration and policy iteration algorithms for solving MDP, which are based on the Bellman equation.

Foundation of Reinforcement learning(I)

The category of decision making problem

dimension single step multi step
one person optimization problem RL, to the best situation
multi person static game dynamic game, MARL.etc

Dynamic programming

Dynamic program is used to solve the Sequential decision making problem, feature of this problem is that it’s decision making process is sequential, and the decision at one step will affect the next step, and the reward is received at the end of decision making process, not at each step.

For an example, given a maze like problem below, the agent need to find a way from Position A to Position B, and the time of each way is different. Agent need to find the way with the least time. A simple way to solve this is to list all the possible paths, but if there is a circle, if the map is large, this will be unfeasible.

A better way to solve this is Backward induction, we start from the end point, and for evey point we calculate the time to reach the end point, then we regart the selected point as the new end point, and repeat this process until we reach the start point. This is a dynamic programming method, and it can solve the problem in polynomial time. But due to we need find a backward path, this method is only suitable for DAG, if there is a circle, this method will fail.

maze

The example is just a introduction, we can summarize the features of dynamic programming as follows:

  • it start from the end, and caculate the best action for each state.
  • it traverse all the states, and for each state, it calculate the best action, and the value of this state.
  • it need to define the state, path(state transition), time(online reward)

So it lead to the Principle of Optimality:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Markov Decision Process

Stochastic Process

A stochastic process is a collection of random variables, which can be used to describe the evolution of a system over time. It’s mathematical definition is as follows:
This means that the probability of the next state depends on the current state and all the previous states .

Markov Process

Compared to stochastic process, Markov process has a stronger assumption, which is “the future is independent of the past given the present”. Mathematically, it’s definition is as follows:
This means that the probability of the next state only depends on the current state , and is independent of all the previous states .

Trying to understand it’s property is that the current state contains all the information about the past, so we can make decision based on the current state without worrying about the past.

Markov Decision Process

Markov Decision Process (MDP) provides a mathematical framework for modeling decision making in situations where the outcome is partly random, partly under the control of a decision maker. An MDP is defined by the following components:

  • State space (S): A set of all possible states in the environment.
  • Action space (A): A set of all possible actions that the agent can take
  • Transition function (P): A function that defines the probability of transitioning from one state to another given a specific action. It is denoted as , which represents the probability of transitioning to state from state after taking action .
  • Reward function (R): A function that defines the reward received after transitioning from one state to another given a specific action. It is denoted as , which represents the reward received after taking action in state . Sometimes only relates to the State.
  • Discount factor (γ): A factor that determines the importance of future rewards. It is a value between 0 and 1, where a value closer to 0 makes the agent prioritize immediate rewards, while a value closer to 1 makes the agent consider future rewards more heavily.

The dynamic feature of MDP

The whole process of MDP is dynamic as follows:

  1. The agent observes the current state .
  2. The agent selects an action based on its policy , which is a mapping from states to actions.
  3. The agent gets a reward .
  4. The MDP transitions to a new state according to the transition function .

The total reward that the agent receives over time is often defined as the discounted sum of rewards:

Markov Policy

In the context of MDP, a policy is a function that depends on the history:

But a Markov policy is a special type of policy that only depends on the current state:


In the RL setting, we usually assume that the policy is a Markov policy. Why?

The MDP has Markov property, which means the future is independent of the past given the present, so there is no special information in the history that can help us make better decision, so we can just use the current state to make decision.More informally, for any policy relying on the history, we can find a Markov policy that at least performs as well as it does, so we can just focus on Markov policy without loss of generality.proof(the 26th and 27th slides of the lecture)


The category of MDP Policy

At the time demension, we can categorize the policy into two types:

  • Stationary policy: A policy that does not change over time. It is defined as , which means the action taken in state is the same at any time step.
  • Non-stationary policy: A policy that can change over time. It is defined as , which means the action taken in state can be different at different time steps.

At the probability distribution demension, we can categorize the policy into two types:

  • Deterministic policy: A policy that always selects the same action for a given state. It is defined as , which means the action taken in state is always .
  • Stochastic policy: A policy that selects actions according to a probability distribution. It is defined as , which means the action taken in state is selected according to the probability

In the RL setting, we usually assume that the policy is a stationary policy. Why?
Typically, we consider the infinite horizon setting. There is also a proof that for any non-stationary policy, we can find a stationary policy that at least performs as well as it does, so we can just focus on stationary policy without loss of generality. proof(the 29th and 32th slides of the lecture)


The best policy for MDP

There is a theorem:

In a situation that the discount factor , while the state and action space are finite and the horizon is infinite, there exists a deterministic
and stationary policy that is optimal, which means for any policy , we have .

Proof: Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

The goal of MDP

Our goal is to choose the action to maximize the expected reward, which is defined as follows:

So we can define the value function for a policy as follows:
This means the expected reward that the agent can get starting from state and following policy .

Occupancy Measure

In MDP context, the occupancy measure is a way to represent the discounted state-action expectation under a policy , also known as state-action visitation distribution. It is defined as follows:

while the means the state transition, which is defined as follows:

On the other hand, the state occupancy measure is defined as follows:

How to compute the occupancy measure?

State occupancy measure

We assume that the initial state distribution is , then we can compute the state occupancy measure as follows:

then we can solve the fomula:

State-action occupancy measure

We can compute the state-action occupancy measure as follows:

Pay attention that the whole process is flow conservation. Because the state-action occupancy measure is the expected discounted number of times that the agent takes action in state , so the total flow into state must equal the total flow out of state . This is why we have the flow conservation constraint in the computation of occupancy measure.

Some Properties of Occupancy Measure

Obviously, from the definition of the measures:

We have two important theorems about the occupancy measure:

  • Theorem 1: For two policies and interacting with the same dynamic environment, if , then .
  • Theorem 2: Given a Occupancy measure , the only policy that can generate this occupancy measure is .

Accumulated reward for a policy

As we have defined the occupancy measure, we can compute the accumulated reward for a policy as follows:

Value function and Q function

The value function is used to evaluate a state or a state-action pair, given a policy .

The state value function is usually know as value function, which is defined as follows:

The state-action value function is usually know as Q function, which is defined as follows:


Obviously, we have the relationship between the value function and the Q function:
And we can also compute the value function and Q function using the occupancy measure as follows:


Summary

  • MDP provides us a simple but powerful mathematical framework to model the sequential decision making problem.

  • The five-tuple of MDP is defined as , which represents the state space, action space, transition function, reward function and discount factor respectively.

  • Markov poverty is the key assumption of MDP, which means the future is independent of the past given the present.

  • Policy is the function to choose the action, usually is a conditional probability distribution over actions given states, and we usually assume that the policy is a stationary policy.

  • Occupancy measure is a way to represent the discounted state-action expectation under a policy, which can be used to compute the accumulated reward for a policy.

  • State value function and state-action value function are used to evaluate a state or a state-action pair, given a policy, and they can be computed using the occupancy measure.

nanoPD:一个 LLM P/D 分离推理引擎的实现笔记

这个项目的起因是读 vllm, DistServe 和 Mooncake 的时候有些地方没有完全想清楚,觉得与其在论文层面反复打转,不如自己动手实现一遍,于是花了一段时间和 claude 大人一起从零写了一个支持 Prefill/Decode 分离调度的推理引擎,覆盖了从 CUDA 内核到自适应路由器的完整栈,代码大约 2000 行 Python 加 400 行 CUDA C++,每个模块有单独的文档,也顺手做了中英双语版。(菜菜勿喷呜呜)

背景和动机

LLM 推理的两个阶段在计算特性上的差异非常显著, prefill 是一次性处理整个 prompt 的过程,本质上是大规模的 GEMM ,算力密集, GPU 的计算单元在这个阶段是满载的;而 decode 每步只生成一个 token ,每次前向传播只有一个(或很少几个)新 token 需要计算,但需要从显存里读取所有历史 token 的 KV cache ,是典型的 memory bound 操作, GPU 的计算单元大部分时间在等数据,实际上 decode 阶段的算术强度非常低,主要的开销是 HBM 的读取带宽。

这两种操作对 GPU 资源的需求模式截然不同,把它们放在同一张卡上并发运行时会产生 SM 资源的竞争, prefill 的矩阵乘法会占用大量 SM ,导致同时在跑的 decode 请求的 attention 计算被推迟,表现出来就是 decode 的延迟在有 prefill 并发时会显著上升,这种干扰在高并发场景下尤其明显,也是 vLLM 等系统在负载较高时尾延迟劣化的一个重要原因。

P/D 分离( Disaggregated Serving )的思路是把 prefill 和 decode 分配到专用的 GPU 上,让两个阶段互不干扰,这个方向在工业界和学术界都有比较多的工作, DistServe 比较系统地分析了分离的收益, Mooncake 则是
月之暗面在生产系统中实践分离调度的工程经验,两篇文章读起来都很有意思,但读完之后我对一些具体的设计决策仍然有疑问,比如路由策略在不同硬件上的表现差异有多大,代价模型里的参数对结论的敏感性如何,这些问题通过实现一遍会有更直接的感受。

实现栈,从底到上

CUDA 内核部分手写了 paged attention kernel 和 KV store ops ,主要是想理解非连续内存块上的 attention 是怎么做的,传统的 attention 假设 KV cache 存储在连续内存里,但 paged KV cache 把显存切成固定大小的物理块,序列的 KV 数据分散在这些块中, attention 计算需要根据 blocktable 做间接寻址,实现上需要在 CUDA kernel 里根据 token 的位置先找到对应的物理块再读取数据,这个 gather 操作相比连续 KV cache 有额外的开销,但换来的是显存利用率的显著提升,因为不再需要为每条序列预先分配最大长度的连续显存,这个 tradeoff 在长序列和高并发场景下非常值得。内核的实现参考了 vLLM 的设计,但为了保持简单没有做 Flash Attention 那样的 tiling 优化,所以在长序列上性能差很多,这部分如果要真正优化的话工作量还是比较大的。( next work 预定了说是)

块管理器以 block 为粒度管理显存的分配和释放,逻辑上类似操作系统的虚拟内存管理,每个物理块有引用计数,引用计数降为零时才真正释放,支持 Copy-on-Write fork ,这个特性在 beam search 或者需要复制序列状态的场景下有用, fork 的时候共享物理块,只有在某条路径需要写入新 token 时才触发实际的物理块复制,避免了不必要的显存拷贝。

推理引擎实现了 chunked prefill ,长 prompt 会被切成固定大小的 chunk 和当前正在 decode 的请求交错执行,而不是一次性把整个 prompt 打进去阻塞所有 decode 请求,这个设计的好处是降低了 prefill 对 decode 的干扰,代价是单个请求的 prefill 总时间会因为被切分而略有增加,调度器维护 waiting 、 prefilling 、 running 三个队列,每一步决定哪些请求进入 prefill 、哪些进行 decode 、 chunk 大小是多少,这些调度决策对系统的整体延迟和吞吐影响很大,实现上采用了比较简单的启发式策略,没有做复杂的动态调整。

Worker 层分三类,

  • CollocatedWorker 在单卡上同时做 prefill 和 decode ,内部复用了推理引擎的调度器;
  • PrefillWorker 专门处理 prefill ,完成后把生成的 KV cache 从 GPU 显存提取到 pinned memory buffer (为什么我的服务器连 PCIe 都没有!!!),准备传输;
  • DecodeWorker 接收传输过来的 KV cache ,加载到自己的显存后把对应的序列加入 decode 队列。
    三类 Worker 可以在不同 GPU 上并发运行,由上层的 CentralScheduler 协调, CentralScheduler 把 collocated 和 disaggregated 两条 pipeline 跑在独立的线程上,每个 step 并发执行,然后汇总结果。

KV 传输部分用了独立的 transfer stream ,目标是和 compute stream 尽量 overlap ,减少等待时间,代码里用 torch.cuda. Event 来同步两个 stream 之间的依赖,实现上比较简单,没有做精细的 pipeline 重叠,实际测下来 overlap 的效果取决于传输数据量和计算量的比例,在小 batch 或短序列的时候效果不太明显。同时会自动检测当前硬件是否支持 P2P 直传,通过 torch 的 _check_p2p 接口查询,不支持的话 fallback 到 pinned memory relay ,也就是先把数据从 GPU 拷到 CPU 的 pinned memory ,再从 pinned memory 拷到目标 GPU ,这个路径的带宽会受限于 PCIe ,在没有 NVLink 的多卡环境下(比如 8 张 4090 通过 PCIe 互联)这个开销是比较显著的,实测大约 12.9 GB/s ,而有 NVLink 的 H20 可以达到约 392 GB/s ,差了将近 30 倍,这个差距直接决定了路由判断在两种硬件上的结论会有多大的不同。

代价模型和路由

为了决定每个请求走合并路径还是分离路径,我实现了一个解析式代价模型,思路是先在真实设备上跑 micro-benchmark 测四个参数, prefill 速度α( ms/token )、 decode 单步延迟 β( ms )、 prefill 对 decode 的干扰系数 γ( ms/token )、以及 GPU 间传输带宽 bw ( GB/s ),然后用这四个参数建立两条路径的延迟估算公式,每个请求来了之后实时计算预估延迟并选更小的那条,所有参数都来自实测,没有用论文里的理论值,因为不同的硬件和软件环境下这些数字差异很大,直接用实测值比从理论推导更可靠。

profiling 的过程大概是: prefill latency 用不同长度的 prompt 各跑若干次取中位数,用线性回归拟合得到 α; decode latency 在不同的 KV 长度和 batch size 下各测一遍,用来拟合 β 和 batch_thresh ( decode 从内存带宽瓶颈切换到算力瓶颈的拐点); interference 通过对比纯 decode 和混合 prefill+decode 下 decode 延迟的差值来测量,用线性回归拟合得到 γ; P2P 带宽直接测一次大块数据传输的时间来估算。

路由判断最后可以化简成比较两个都正比于 prompt 长度 L 的量,分离路径相比合并路径的额外代价是传输 KV cache 的时间,大约是 transfer_rate× L ,合并路径相比分离路径的额外代价是 prefill 对并发 decode 的干扰,大约是 γ × L × (system_load / batch_thresh),分离更划算的条件可以化简为 γ / transfer_rate > batch_thresh / system_load ,其中γ / transfer_rate 是一个只取决于硬件的比值,反映了”干扰有多贵”相对于”传输有多贵”的比例,而 batch_thresh / system_load 则反映了当前系统的负载程度。

在 RTX 4090 上实测,γ / transfer_rate ≈ 7.6 ,代入 batch_thresh = 16 可以得到 system_load ≥ 约 2.1 时分离就开始划算,也就是只要有两到三个并发请求在跑,分离路径在延迟上就已经比合并路径更好了;而在 H20 上这个比值达到 346 ,几乎在任何非零负载下分离都是更优的选择,两种硬件上的结论差距这么大主要是因为传输带宽差了将近 30 倍,γ 的差异相对没那么大( 0.087 vs 0.130 ms/token ),所以比值主要是被 transfer_rate 拉开的。这个结论确实感觉很不合理,直接原因就导致路由决策非常单一,理论上能充分展示路由决策的机器本人无钱无时间找到(谁来帮我考算分期中!),于是懒得改了。

此外还实现了一个输出长度预测器,因为路由需要估算 decode 阶段的代价,而 decode 代价正比于输出长度,但输出长度在请求来的时候是未知的,用了一个在线的贝叶斯预测器,按 prompt 长度分桶,每桶内维护历史输出长度的统计,新请求来了用对应桶的均值作为预测,桶内样本不足时 fallback 到全局均值,这部分比较粗糙,输出长度预测本身就是一个很难的问题,实际上即使是 vLLM 这样的成熟系统目前也还没有特别好的解法,这里的实现只是为了让路由可以跑起来,准确性有限。

测试和结果

写了一个 Poisson 到达过程的 benchmark ,模拟真实服务里请求按泊松过程到达的场景,固定到达率 λ,跑固定时长,然后统计完成的请求数、端到端延迟的分布、以及 drop 的请求数,这比简单的串行测试更接近实际的服务场景,因为串行测试本质上是测单个请求的延迟,没有并发,没有排队,和真实负载差距比较大。

结果上,在 RTX 4090 × 8 的环境下, adaptive 路由在中等到达率下的吞吐和延迟比 collocated 有一定改善,在 H20 上分离路径的优势更明显,和代价模型的预测基本一致,但实际的性能数字和成熟框架比是没有可比性的,因为缺少了 CUDA Graph 、 Flash Attention 、算子融合等优化,这里就不列具体数字了,主要是看各策略之间的相对趋势是否符合理论预期,从这个角度来看结果是比较合理的。
性能出现巨大问题的原因还有 disagg 路径没有像 collocated 做那么多小优化, such as cotinious batching 等等,在项目的文档中我做了具体说明。

另外一个有意思的观察是,在 H20 上 adaptive 的峰值吞吐反而低于 RTX 4090 ,原因是 H20 的 decode 步更快(β = 33ms vs 51ms ),更多请求在 collocated GPU 上就快速完成了, disaggregated 路径的利用率相对
更低,多卡的优势没有完全发挥出来,这有点反直觉,但想清楚之后也是合理的,更快的 decode 反而减少了分离的必要性,加之我较为 naive 的 paged attention 实现(肉眼可见有一处就可以改为 reduce 求和),使得 max_request_num 调大了就会影响 TTFT 等乱七八糟的事,本人又买不起 autodl 的 H20 * 4 呜呜呜,就这样吧。

一些感受

实现上没有做 CUDA Graph 、 Flash Attention 或量化这些会显著影响性能的优化(实际上懒了写不了了),主要是想保持每一层的逻辑足够清晰,方便对照论文理解设计决策,也方便文档解释,所以性能上和成熟的推理框架没有可比性,仅仅是作为理解系统设计的实践项目。

写完之后最大的收获可能不是哪个具体的技术细节,而是对整个系统各层之间的依赖关系有了更清晰的感受,代价模型的准确性依赖 micro-benchmark 的质量, micro-benchmark 又依赖引擎本身的实现是否足够接近真实推理路径,路由的效果依赖输出长度预测的准确性,调度器的策略影响 profiling 时测出来的 interference 数值,这些依赖关系在读论文的时候是模糊的,实现一遍之后会更具体地感受到每一层的假设在什么条件下成立,什么条件下会失效,以及各层之间的耦合程度,这种感受很难单纯通过读代码或读论文获得。

已经放到了 github 上,文档里对每层的设计决策有比较详细的说明。

3D Reconstruction Series

引言

更新,已决定停止更新( x


可以看到本文的 publishDate 是 4096-16-64, 实际上的 publishDate 是 2026-02-10 。
本文的初衷是一个长期更新的 3D recon 系列论文阅读,之前其实已经发过了一些该领域的论文的精读了,但是显然精读必然是不可长期持续的。因此,我想以本文——一个系列的形式记录对大多数论文的浅要阅读,当然如果有特别重要的论文,我也会单开一篇文章进行精读的。

本文的 cover image 是一个词云,记录了本文包含的工作的名称,希望它能不断地更新,成为一个 3D recon 领域的词云图谱。

CUT3R

CUT3R 的输入是视频序列,但是也可以 unordered (据作者所言训练的时候是无序训练的,但是推理的时候推理的时候是 dataloader 先计算重合率来进行初步排序。),使用一个 feed forward 网络预测 camera parameters 和点云。

cut3r

然后是一个 recurrent 模型,每一帧输入的时候添加一个 pose token 然后经过 encoder 和 decoder ,之后使用交叉注意力更新,之后再使用不同 head 来从中提取 output 。

显然这样缺少修正,对于长序列容易造成偏移。但是作者似乎也提到了一个 revisit 机制,在输入结束之后拿着全局的来做之前的预测,在 7scene 上的 acc 和 comp 是有改善的,但是 NRGBD 不怎么明显。

此外,作者也说因为数据集质量的原因,采用的 head 即使已经有一个 pose head 和 local points head ,也仍然要加入一个 world ptshead (缺乏高质量的数据集)。

">

是一个相对来说比较有趣的东西,模型结构如下:

pi3

首先与之前的最大不同是它没有显式地选取参考帧和一个特定的 scale factor ,像 VGGT 就是先选取了一个 ref frame 然后做重建,但是重建质量受 ref 影响很大,因此选择了一个方案,就是一次性将所有帧全部输入,所有帧之间均平等,然后 inference 出一组相对位姿和局部点云,这样就能规避确定某一个 frame 作为坐标原点造成的不确定性问题。

但是仔细一想,仍然不怎么好避免一个 ref 的问题,首先,在一个 batch 内部,虽然我们预测的是一组相对位姿,但是直觉上感觉仍然是把某一帧与其他帧不融洽所导致的原先的那种大的,显著的,偶然性的损失转化为了现在的看起来不明显的、高一致的、所有帧都有的系统性损失。但作者通过实验证明了损失会变小,其实这也是比较好解释的,因为原先的可能是依赖依赖……这种单向参考,而则进行了交叉注意力计算,仔细想来确实会更好。

其次,交叉注意力的复杂度大概是,显然对于长序列是不可接受的,作者训练和测试的时候均采用了有限个 batch 内 frame 的做法,但对于实际的长序列的话,感觉并不是很好做。如果切片进行拼接的话,显然也会面临 ref 的选择问题,但是这时候是一个 scene 之间的拼接,感觉确实会降低很多错误,如果分层做的话,也会降低误差,总之感觉似乎确实是一个不错的方案。

DA3

DA3 是字节 seed 的一个项目,可以说是力大飞砖,充分体现了工业界解决问题的规模( x 。

da3
DA3 的主要创新点在于:

  • 更简单的模型,作者的意思是 VGGT 即使结构很简单,但是由于其在 DINO 后接 AA 层的操作,因为 AA layers 是新训练的,因此过程中可能数据的利用率不高。而 DA3 选择了只利用 DINO 这一个方案,通过在 DINO 的层中变形数据完成了 AA 层所做的事情。因此, DA3 的几乎所有参数都是预训练过的,而 vggt 则有 的参数是从头开始训的,这是 DA3 的简洁之处。

  • 预测任务的简洁性。相比于 VGGT 通过不同 head 得出了不同结果, DA3 则使用了一个更新的表达方式: Ray-depth 表达,具体来说就是使用一个 Dual head 来分别输出一个像素的深度信息和光心与之相连的射线的信息,从而天然地同时包含了点云和 pose 信息,而且在设计 loss 的时候是可以加入一致性信息的。相比与 vggt ,这似乎加强了一致性,也提高了数据利用率,感觉 pose 和 pts3d 反而是不容易加入一致性的,作者做的消融实验也证实了这一点。

  • 使用 teacher 标定数据,首先训了一个 teacher 模型用于给深度不好的 frame 重新生成 depth ,之后依照这个 depths 训练。感觉最终效果也很依赖这个 teacher 模型。

但是, DA3 的弊端也有一些,他的效果确实非常好,但是阅读之后才发现他是用 128 x H100 训练的,这个规模确实有点难以复现。小算力情况下上面两条结论似乎很有帮助,可以尝试。

MapAnything

首先是 Meta 的项目,和 VGGT 难道不构成什么竞争关系嘛()

主要创新点在于他的输入很有意思,不同于 VGGT 还有以往的重建工作只输入图像序列, MapAnything 支持多种多样的输入,对于每一个输入都会通过一个 encoder 最后对齐到 DINOv2 输出的 image token 上,然后就是正常处理的流程,不过似乎它多加了一个 scale token ,用于预测 scale 信息。

mapanything

感觉其利用了 nlp 里面的多模态,证明了给定不同类型的输入其预测的准确性与相应的专家模型性能相似,这是很有价值的,因为他减少了很多训练量(虽然也是在 64xH200 上训了 10 天)。

另外一个比较有趣的地方在于,他最后的点云数据不是直接输出的,而是由 depth , ray , pose 联合输出,这解耦了 VGGT 的冗余预测模式,而且在设计 loss 的时候能保持更好的一致性,感觉这个跟 DA3 输出 Depth-ray 的做法还是很像的。

不过其缺点也非常明显,首先对于长序列情况下,其仍然没有摆脱的处理复杂度;其次模型是 offline 的,不过感觉各有各的应用场景;最后就是推理速度和显存占用,推理速度在 100frame 的时候就已经接近 10s ,而且这时的显存占用也已经来到了 65G 左右,即使采用了作者提出的 Mem Efficent 策略,即在 dpt 头采用串行计算策略也是 20G 左右,似乎有点太大了( x

此外,作者表示了在输入过程中模型无法对噪声数据进行处理,也就是说潜在的噪声可能会污染整个 transformer 的内容,另外融合时机是在 encoder 之后进行,而且是简单的相加,可能有更精细的融合方式。

AnySplat

anysplat

与之前讲过的大多数点云重建的工作不同, AnySplat 是 3dgs 重建。具体来讲就是他在 vggt 的基础上进行改造, backbone 与 vggt 相同,但其 head 则是一个 gaussian head, 一个 depth head ,还有一个 Camera head 。然后通过一个可微体素化将原本稠密的高斯球聚合到一起,训练的时候则监督:

  1. 每一帧位置的 rgb loss

  2. depth 的深度与 gaussian depth 的差异损失

  3. 相机参数与 vggt 预测出的损失

  4. 模型预测深度与 vggt 之间的深度差异

首先, 2 的 loss 保证了其几何一致性,也就是让不同视角的深度尽量保持一致,可以避免分层现象。此外,文章作者说他们实现了一个 Differentiable Voxelization ,可以有效解决生成的稠密高斯球产生的复杂度问题。

总体来说,这是一个高度模仿 vggt 的工作,只不过换了一下 head 和输出形式,其余部分都差不多。此外为 offline 的重建,看上去速度似乎还可以,但是同样面临长序列问题。另外,固定世界坐标系为第一张图片,去监督每一个绝对位姿是否正确,似乎也是存在所述的归纳偏置问题的。

RayZer

rayzer

令人耳目一新的自监督模型,训练过程只需要图片而不需要 gt 的 pose 和内参,训练过程大概是这样的:

  • 首先输入张图片,将其分为两个集合。

  • 然后模型通过 Camera Estimator 模块,预测出 pose 和 intrinsics 。

  • 之后对于 ,模型根据其对应的预测出来的 和本身的图片输入,生成场景的 token.

  • 然后对于,模型选择通过 预测出 然后监督 之间的损失,然后更新所有的值。

因此,推理时的大致步骤大概就是先把场景的已知几张图片输入得到 ,之后针对一个特定的 pose ,计算一个光线图,之后输入到 rendering decoder 里得到在这个特定的 pose 下的 rgb 图片。

感觉和 nerf 好像,都是一个隐式的表达整个场景,不过不同的是 RayZer 是一个更直接的模型,图里的三个模块每个都是 8 层 naive transformer , loss 仅由最后的 rgbloss 和 LPIPS loss 决定,感觉挺聪明的。不过感觉 rendering 部分采用的表现形式——类 raymap 形式似乎真的挺好用的。

另外,值得注意的是第一部分,在预测 pose 和 intrinsics 时,直接选取了中间帧作为参考帧,使得模型能跨越更长的距离。此外,如果说我们在第一部分就引入 ,能否实现定位功能?不过作者似乎做了消融实验,发现在训练的时候,从图像特征中提取几何关系比从一个未成形的 中提取容易得多。但是我觉着可以在 rendering 部分再添加一个 decoder 用于定位。

另外,这个模型完全打败了 LVSM (一个有监督的模型),感觉是一个非常惊艳的工作,看项目主页的 demo 视频感觉真的很不错啊。

Spa3R

spa3r

首先是一个自监督的模型,模型的 backbone 设计的有点复杂:

我们给出一个场景的 views ,然后将 views 分为 context view 和 target view ,首先将所有 views 通过一个改造过的 vggt (似乎是只引入了 head 之前的部分),改造内容是在 context Views 的 AA 层那里把 Target Views 给 mask 掉,然后得到 Context Views 的 feature 和 Target Views 的 camera token 和 Feature ,之后,数据流向两条路径:

  • Context views : 与一组可学习的 通往 Encoder ,然后得到 作为空间的隐式表征。

  • Target views : camera tokens 通过 camera head 生成 camera embeds ,然后与 一起输入到 Decoder 里生成对 Target views 的预测过的 feature ,然后将得到的预测 feature 与 进行监督得到 loss 。

推理的阶段我们就只看 Context Views 得到的 ,将与 qwen2.5 vl 得到的 输入到一个Adapter里,然后将这个 adapter 和 text prompt 输入到 llm 里得到最终结果。

首先,肉眼可见这项工作把大量的其他工作缝合到了一起, Target View 阶段用了 DINOv3 和 VGGT , 的后续处理用到了 qwen2.5 vl ,但是这篇文章叫 Spa3R 啊, Dust3R 被放到哪了呢?然后可训练的内容只有 Encoder 和 Decoder ,仅 6 层 Transformer ,而且通过两个作为 loss 进行训练,训练结束之后即丢弃 Decoder ,保留训好的 Encoder 和 q 。然后后续还有一个针对 Adapter 的一个微调,让其学到怎么生成一个合理的融合

模型做了几个消融实验:

  1. Target Views 阶段作者证明了同时使用 VGGT 和 DINO 会更好(包含语义和空间信息),这是一个比较显然的结论。

  2. 提取出一个场景 表征是一个更好的手段,相对于现有的几个类似于 VG-LLM 简单把所有特征输入到 llm 里效果更好(但是只提升了 3 个点,感觉有点低于预期,考虑到第二阶段训练只进行了 1 个 epoch ,有没有可能是训练量不够?我也是第一次读 VLM 相关的文章(),不过看具体的比分, Multi-Choice 涨分了,而 Numerical 几乎没变,确实是 make sense 的)。

  3. pose embedding 的影响, PRoPE 比 plucker 更好。

  4. Mask Ratio ,这也是一个比较显然的消融实验。

  5. Adapter 使用提高了点数,比较 make sense 。

模型只在 ScanNet 和 ScanNetpp 上进行了 pre train ,使用了 8 张 5090 进行训练,在 VSI-Bench 上达到了 58.6 的水平,超过了之前的大部分 model ,查看现在的 VSI-Bench Leaderboard ,其性能也是处于前列的(不过论文里的表格好像有些数据有点不对?可能有更新吧)。算是为领域开了一个新坑(),自监督看上去也不错()。

看上去这篇文章正在投 CVPR ,是笔者写阅读笔记的两天前才登上了 arxiv ,也不知道中没中,方法是很有趣

Spann3R

结构很复杂,首先大部分模型权重继承自 Dust3r ,然后模型的 backbone 大致如下:

spann3r

  • 预编码 :首先将一帧输入到 ViT Encoder 得到一个 ,此时我们手上还有一个上一帧的

  • 查询记忆: 根据 ,我们可以从历史记忆中查询出一个 来作为下一步的输入。

  • 主要推理部分: 之后我们将这两个 feature 输入到 Target Decoder 和 Reference Decoder ,这两个 Decoder 会做 self attention 和 Cross attention 然后分别得到

  • Heads : 对于 ,在推理阶段我们会使用一个 query head 来提取出,然而在训练阶段我们也会加入一个 head 将其转化为点云和置信度来监督训练;对于 ,我们会通过一个 reference head 将其重建出点云和置信度。

  • 记忆: 之后,根据 ,我们将其通过一个 Memory encoder + MLP head 生成一个 ,然后根据这个和点云通过一个 Memory Encoder 生成 ,之后会对已有记忆去重, 如果工作记忆已满剩下的就会进长期记忆然后做进一步处理。

这是一篇 24 年的文章了,主要创新点就在于他改良了 Dust3R ,使得可以对多个图片输出一个一致的全局坐标系下的点云,此外使用记忆方法,分层处理记忆。

但很显然的是,虽然该方法加入了记忆,但是记忆看上去也是近期记忆的方案,客观上因此而存在长距离漂移的现象,此外,如果遇到 reloop 现象,记忆是否能健康提取也会是一个比较大的问题。

做的消融实验大致有这几个:

  1. 关于记忆方面的消融实验,去掉长期记忆会引起很大的漂移现象,而注意力不截断的话也会引发噪声的干扰

  2. 关于长期记忆应该取多大:作者发现 1000-2000token 的过程中漂移得到极大修正,但是 4000+之后就不会有明显的提升,因此最后作者选择了 4000.

  3. Dust3R 采用了 exp confidence function ,本文将其改为了 sigmoid ,事实证明是有所改善的。

Flow4R

一个局限性很大的三维重建追踪方案,不过在表现形式上很有新意。

模型的 backbone 很优雅,首先接收两张图片作为输入,通过共享权重和 cross attention 的两个对称 encoder-ecoder-head 结构得到每张图的 其中, 是相机坐标系下的点云, 是一个场景流,描述每一个像素如何从本张图片移动到下一张点云,之后还有一个 指示哪个像素在求解 pose 的时候最可靠,最后的 是全局的置信度。

flow4r

得到这些元素之后,可以首先将 pose 通过最小二乘法求出:

是由 得到的,得到 pose 之后就可以做位姿流和场景流的分解,然后很多下游任务就可以进行处理了。

针对于长序列数据,作者提出了将第 1 张 frame 作为锚点,后续的每一张都与之输入处理,好处是可以通过 L2 norm 来归一化尺度,但是坏处也非常明显,一是稍微长一点的序列,就会出现遮挡现象,模型目前来看没有一种很好的应对方式;二是极其依赖第一张 frame 的质量,鲁棒性不算太好。观察其论文里呈现的 demo ,看起来也通常是对一个角落 or 一个相似视角区域做的重建,完整场景重建效果存疑。

此外,作者竟然只做了一个消融实验(能中吗?)对比了三种不同的网络预测和监督变体 :

  • 预测场景流 ,并用真实的  进行损失监督 。

  • 预测场景流 ,但用目标帧的真实 3D 点位置  进行监督 。

  • 直接预测目标帧的 3D 点位置 ,并用真实的  监督(场景流则通过简单的减法推导: ) 。

消融结果:实验证明,直接预测并监督 的性能最佳 。因此后来直接预测的实际上是

总体来说,这篇工作证明了一点,可以通过引入流的方式来完成 Dust3R 这种结构从静态到动态的拓展,但确实局限性很大。

这项工作似乎还没有开源()

AMB3R

把三维体素引入到了重建中,使得模型能够真正地从空间角度来考虑重建任务。简而言之就是之前的重建采用的 ViT 将图像分为一个一个 patch 造成隐式几何中缺乏空间紧凑性约束,于是论文作者想了一个办法把空间紧凑性加入到了 backbone 当中。

amb3r

大致的 backbone 分为前端和后端,其中,前端继承了 VGGT 的网络和参数,一张图片进入之后会经过 Encoder 得到一个初步的 feature ,然后数据的主题是向 decoder 移动,但是这部分 feature 也会使用一个 scale head 预测一个绝对尺度。

然后,进入 decoder 的 feature 会对 keyframes 做 cross attention ,这里的 keyframe 就可以理解为场景的隐式表达,经过该过程之后, decoder 就会输出一个 pointmap 和一个 confidence ,在推理阶段,之后会有一个门控机制:如果置信度足够高,那就直接进入下一阶段,反之则会将点云和 feature 变为体素,然后通过一个 point transformer 优化该体素的 feature ,之后再会逆变换变为 2Dfeature ,之后我们会将该 feature 注入到前端的 decoder 中,重新拿到一个高级的点云。

然后我们拿到了当前帧的点云以及物理尺度,然后系统会将该结构放大/缩小,然后根据 keyframes 和 VGGT 预测出的 pose 将该结构拼接到大的全局点云中,最后我们会评测该点云是否可以成为 keyframes ,然后将其处理掉。

将体素引入到点云重建里很厉害,作者做的几个消融实验:

  1. 移除了基于 sparse voxel 的后端,转而使用一个 2D 做 alternate attention 的后端,发现精度不如之前。

  2. 去除了零卷积机制,发现模型短时间内根本就未收敛。

  3. 在算 loss 的时候去除了 scale 发现效果变差,也就是说模型需要去专注思考几何结构。这是在训练阶段做的事情

这篇文章的训练成本非常非常的低,依赖于一个已经训好的 VGGT ,只训练了微调点云特征的一个 point Transformer 和一堆 head ,感觉非常有启发性非常厉害,同时也中了 CVPR2026 ,符合预期(似乎是 Spann3R 的续作)

VGGT-SLAM

我说这是一篇数学论文,文中没有训练任何模型,仅仅是介绍了一种局部点云拼合办法。

vggt-slam

顾名思义,这篇工作基于 VGGT 输出的点云和 pose ,作者认为 VGGT 预测出 pose 和局部点云之后直接进行 Sim(3)变化为全局点云是有问题的,主要灵感来自于传统 CV 里面的双目立体视觉:相机之间的单应性矩阵或者说是本征矩阵并非仅仅包含了 pose 中进行的旋转、平移,更有一些拉伸,透视等等等。具体来讲就是 VGGT 预测出的点云深度包含了相机的射影形变,直接使用 Sim(3)方法来还原是不准确的。

因此,作者转而使用了 SL(4)进行点云的对齐,具体来讲,当 VGGT 得出了点云和 pose 之后,会进行以下几个操作:

  • 对于一个子地图里的帧,作者选择相信 VGGT 的质量,作者在代码里设置了一个 submap_size 参数用于控制子地图的大小。

  • 对于不同子地图之间,因为我们想得到一个在不同坐标系下共享的三维点,所以作者这里采用了一个很聪明的办法,将上一个子地图的最后一帧重复输入到下一个子地图里,这样 VGGT 的输出就包含了相同图片在不同坐标系下的点云,由此可以建立点与点之间的对应关系。

  • 之后根据传统的一些算法,可以计算两个子地图之间的 SL(4)矩阵,到这里第一步就算完成了

  • 下一个步骤就是全局对齐,作者也写得太数学了吧:

  • 具体来讲,作者构建了一个基于最大后验估计的非线性因子图,目标是最小化所有子地图之间的相对单应性误差:

    $\hat{H}=argmin_{H\in SL(4)}\sum_{(i,j)\in\mathcal{L}}||Log(H_{i}^{-1}H_{j}(H_{j}^{i})^{-1})||{\Omega{ij}^{H}}^{2}$

  • 然后引入各种优化器,这里我的数学太烂了( x )根本看不懂,只知道他是需要迭代优化的。

  • 嗯嗯,所以这样我们就可以得到一个后端,对于每一个子地图,都给出了一个将其变换到潜在全局坐标系下的 SL(4)矩阵,从而消除了 Sim(3)变换带来的问题。

此外,文章还提出了一种 reloop 机制,就是说在一个子地图待输入的时候,系统会利用 SALAD 描述子去寻找历史子地图中是否有相似的图片,若有,系统就会选择将那张图片作为共享帧,我们这时候就会有多个相对的信息。

总体来说,这篇工作就是提供了一个偏传统的对齐方法,比较优雅,但是很显然缺点也很明显,首先对于单个子地图,该工作完全信任 VGGT 的输出结果,缺乏鲁棒性;其次,其得出对齐是通过迭代优化得出,相对于直接拼接会慢上很多,另外有太多的查询操作(如 reloop ),感觉复杂度还是有点高的。

不过可以从上图看到,他确实改善了点云拼接时可能产生的分层的质量。但是,查看其 github 里的 issue ,似乎稳定性存疑:

Due to potential randomness in our approach caused by RANSAC, we report the average performance over five runs, which have a low spread (small standard deviation) as shown in Sec. 5.5.

而且那个 issue 到最后作者都没有回答,感觉有点尴尬( x

学习笔记:Tensor Parallelism(TP)

引言

经历了一些对未来选择的思考之后,最近在了解 mlsys 相关的内容,本文即为对 TP 的理解和总结,目前网上已经有大量的博文详细介绍了 TP 的实现细节,本文主要是为了自己未来查阅方便而写的文章,欢迎大家指正。

TP简介

Tensor Parallelism 是在 DP, MP 之后提出的一个方法,由 Magatrion-LM 首创。其出发点在于 DP, MP 仍然需要单卡在计算时凑齐一个完整的 layer 的参数和各种激活值、梯度、优化器状态,当一个 layer 过大的时候,单卡就放不下了。
而 Tensor Parallelism 将模型的计算拆成分布式的了,使得一层能够分布于不同卡上进行计算。

Transformer-like model

一个经典的 Transformer 模型的架构大致如下图:

transformerarch

可以看到,一个 layer 主要由 Attention 和 MLP 层组成, TP 的关键优化点也就是在这两层上,下面将具体说明。

MLP

我们先从 MLP 层开始,简而言之,一个 MLP 层的数学描述大致这样:

其中:

一般来说,

我们先考虑不进行 TP ,仅仅进行单卡计算:

单卡forward

参数量:

计算量:

激活量:在 backward 里考虑。

单卡backward

首先对 dropout 反向:

这一步的 FLOPS 差一个数量级,可忽略不计,另外使用 表示

然后对 进行反向:

其中

这一步的 FLOPS 为

GeLU 的 FLOPS 几乎也可以忽略不计。

然后对 进行反向,几乎与 相同。

因此整个过程的 FLOPS 为 ,为前向传播的两倍。

然后我们从激活值占用角度分析,在没有梯度检查点的情况下,我们有:

1
2
3
4
X (BS, d_model) use for compute L_W1
H = XW_1 (BS, d_ff) use for compute the gelu
A = GeLU(H) (BS, d_ff) use for compute L_W2
Dropout mask(BS, d_model) use for Dropout

TP forward

sd

我们进行这样的切分方式:

1
W_1 -> (W_11, W_12, W_13, ... W_1n) # W_1i: (d_model, d_ff / n) 

这样我们在输入 X 的时候全部注入,然后得到:

1
H -> (XW_11, XW_12, XW_13, ... XW_1n) # XW_1i: (B, S, d_ff / n)

值得注意的是,我们选择按列切分 使得我们得到的结果是可以独立通过 gelu 的,省去了这一步通信的麻烦。

之后考虑

我们选择将 进行这样的切分:

1
2
3
4
5
6
7
W_2 -> [
W_21,
W_22,
W_23,
...
W_2n
]

之后,显然我们现在可以每张卡计算XW_11 @ W_21,而且他的形状就是最后矩阵的形状,
因此,我们算出来然后最后采用 all reduce 就可以得到最后结果啦。

ok ,我们现在对这整个过程进行分析:

  • 参数量:
    显然,我们现在把所有参数分散到了多卡上,而且分散均匀,

  • 计算量:

但是这里还要考虑一个问题,就是最后 reduce-all 操作还要对所有激活值进行累加,但是这部分数量级过小,可忽略。
  • 激活量:在 backward 里考虑。

TP backward

在每张卡上的前向是:

由于 AllReduce 之后每张卡上的 Z 完全相同,所以上游传回的梯度也完全一样,不需要额外通信。

此后,每一步的计算基本上与单张卡相同,但是要除以

因此,每张卡的反向 FLOPS 为:

然后,之后需要注意的是我们在反向传播的最后仍然需要一步 all-reduce ,因为我们此前计算的都是独立的梯度。

激活值的占用:我们有:

1
2
3
4
X (BS, d_model) use for compute L_W1
H = XW_1 (BS, d_ff / N) use for compute the gelu
A = GeLU(H) (BS, d_ff / N) use for compute L_W2
Dropout mask(BS, d_model) use for Dropout

Attention

单卡forward

输入数据:

1
2
3
4
5
6
7
8
9
10
X (B, S, d_model) 
X_h (B, S, h, d_head) W_Q (h, d_head, d_Q) W_K (h, d_head, d_K) W_V(h, d_head, d_V)
# 注意到在单卡情况下我们这一步计算 Q, K, V 通常不做维度划分,但可以这么理解,方便后续对 TP 的理解
Q (B, S, h, d_Q) K (B, S, h, d_K) V (B, S, h, d_V)
Q @ K.transpose -> S (B, h, S, S)
S @ V -> (B, h, S, d_V)
reshape -> (B, S, h *d_V)

# 接着引入一个 W_O: (h * d_V, d_model)
O (B, S, d_model)

ok ,对整个过程清晰之后我们便可以分析其各个指标:

参数量:

通常来说这几个都相等,所以总参数为

计算量:

操作 形状 FLOPS

因此,总的 FLOPS 为:

需保存的激活值:

Tensor shape num

单卡backward

首先我们做 的反向:

加起来是

然后我们回到

这一步的 FLOPS 为

然后经过 Softmax 反向, FLOPS 可以忽略,然后计算 Q, K , FLOPS 为

之后对 做反向,权重梯度和输入梯度均为 ,共计为 12 。

因此,总反向 FLOPS 为:

为前向的两倍,所以我们在这里也可以认为反向传播的 FLOPS 为前向的两倍。

TP

显然这时我们就可以完全将 head 分到多张卡上,所有的几乎均乘上一个 即可。

但此时仍然需要注意的是,我们得到 O 之后仍然需要 all-reduce ,这与 mlp 是一样的。

先写到这里.