MoE vs Dense Models in Inference //blog/MoE-vs-Dense-Models-in-Inference/

epoch.ai · 2024

How do mixture-of-experts models compare to dense models in inference?

Ege Erdil

Introduction

In a recent review conducted by Mimo, the reviewer asked me a question: “How do mixture-of-experts (MoE) models compare to dense models in inference and implementation?” I thought a while and answered according to my intuition, but obviously the answer was not satisfactory. So I decided to do some research and write this post to share my findings.

This post is based on the article How do mixture-of-experts models compare to dense models in inference? by Ege Erdil, published on May 21, 2024. The article provides a detailed analysis of the differences between MoE and dense models in terms of inference and implementation.

The first main part of the article discusses the advantages of MoE models from an inference perspective:

MoE models have fewer parameters than dense models, which can lead to faster inference times and lower memory usage.
MoE models tend to be shallower and wider than dense models, which can also contribute to faster inference times.
MoE models tend to have smaller attention blocks, i.e. the product of their number of attention heads with the head dimension is smaller, but whether this happens depends on whether we use GQA or MQA.

MoE has fewer parameters than dense models

The most discussed advantage of MoE models is that they do less arithmetic than dense models, as each token will only be processed by a subset of the model’s parameters. If the computation is the constraint of the system, then MoE models can be more efficient than dense models.

However, as we diving into the details, compute bound is not the only constraint of the inference process.

During the Prefill stage, we can compute all the token in parallel. In this case, the network latency and memory bandwidth can’t be the main bottleneck, because the batch size of the tokens is large enough to allow us hide them behind the computation. So in this case, MoE models can be more efficient than dense models.

But in the Decode stage, if we still use large batch size and few GPU, the above situation still holds. However, if we want to going fast, we need to use more GPU to eplit arithmetic and memory workload known as TP. Extra network communication is required to synchronize the TP, which can be a bottleneck for MoE models. Briefly, as the number of GPU increases, the arithmetic workload per GPU decreases, and the network communication overhead becomes more significant. In the practical setting, the number of GPU is usually large enough to make the network communication overhead a bottleneck for MoE models. So in this case, MoE models can be no more efficient than dense models.

The article takes an example of a Llama 3.3 70B model with TP=8, H100 single node. In the ffn blocks alone, each token processed will require 8192 width vector to be all-reduced for each layer, as 8192 is the model dimension. Assume our quant precision is 16 bits, and the llama model has 80 layers, then the total communication of each token will be 8192 * 80 * 16 bits = 1.25 MB. At the critical batch size of 300 tokens, the total communication will be 384 MB, while the NVLink all-reduce bandwidth is around 112GB/s, which means the communication will take around 3.4 ms. This is pure communication time, while the computation time is around 5 ms based on Firework’s experiments, acturally decreasing a lot at the critical batch size. So the communication overhead is significant, and it can make MoE models less efficient than dense models in the decode stage.

How we compute the critical batch size?
The equation is:
$$ \text{critical batch size} = \frac{\text{FLOP/s}}{\text{Memory bandwidth(TB/s)}} $$
For a matmul operation, assume the weight matrix is of size $d_{in} \times d_{out}$, and the input matrix is of size $b \times d_{in}$, then the total FLOP is $2 \cdot b \cdot d_{in} \cdot d_{out}$, and the total memory access is $b \cdot d_{in} + d_{in} \cdot d_{out}$. So the critical batch size can be calculated as:
$$\frac{d_{in} \times d_{out} \times \text{BytesPerElement}}{\text{Memory bandwidth(TB/s)}} =\frac{d_{in} \times d_{out} \times 2}{\text{FLOP/s}}$$
So on H100, the B is around 300 for a matmul operation with $d_{in} = d_{out} = 8192$.

MoE models tend to be shallower and wider than dense models

This is a fact that is observed in practice that MoE models tend to have fewer layers and bigger d than dense models. So the serial computation time of MoE models can be smaller than dense models. This is obvious and I won’t go into details here.

At the fixed model depth, MoE models have fewer communication than dense models

The amount of network communication for the feedforward blocks needed per processed or generated token scales with the product of the model dimension, the model depth, and the number of active experts.

The equation is:
$$\text{Communication} \propto d \times L \times E$$
where $d$ is the model dimension, $L$ is the model depth, and $E$ is the number of active experts.
Each all-reduce operation’s communication cost is $d$, $L$ layers and $E$ active experts will lead to a total communication cost of $d \times L \times E$.
Take GPT-4 as an example, it has 16 experts, and each layer’s parameter number $\propto d^2$, if the dense model want to achieve the same parameter number, it needs to:
$$ d_{dense}^2 = 16 \times d_{moe}^2 \Rightarrow d_{dense} = 4 \times d_{moe} $$
So the communication cost of the dense model will be:
$$\text{Communication}_d \propto 4d \times L \times 1 = 2 \times \text{Communication}_m$$
So the dense model will have 2 times more communication than the MoE model at the same model depth.
Generalize to the general case, set the number of experts to be $E$, then the active experts is $k$, so the communication cost fraction of the MoE model compared to the dense model will be:
$$\frac{\text{Communication}_m}{\text{Communication}_d} = \frac{k}{\sqrt{E}}$$
So when $k < \sqrt{E}$, the MoE model will have less communication than the dense model.

In practice, MoE is usually shallower and wider, which increases their communication advantage further over dense models of the same size.

MoE models tend to have smaller attention blocks

The majority of the parameters in MoE are housed within the experts which are sparsely activated. So we can use smaller $d$.

Set the model’s total parameter number to be $M$, the number of experts to be $N$:
$$ M \approx N \times d^2 \Rightarrow d \approx \sqrt{\frac{M}{N}} $$
So the model dimension of MoE models is smaller than dense models, which can lead to smaller attention blocks. Besides the kv cache of single token is also smaller, which can reduce the communication and computation of the attention blocks.

Conclusion

Mixture-of-experts models are generally cheaper to serve for inference compared to dense models, but except in prefill this is not directly because they have a smaller number of active parameters.

They also tend to be shallower and wider than dense models, which can contribute to faster inference times. Additionally, MoE models tend to have smaller attention blocks, which can reduce the communication and computation of the attention blocks.

]]> readings Streams and Concurrency on CUDA //blog/Streams-and-Concurrency-on-CUDA/ Introduction

I have learned CUDA kernel programming for a long time, but I have never learnt CUDA streams, only knowing that CUDA streams can be used to achieve concurrency. Today by reading the NVIDIA slides on CUDA streams, I have a better understanding of CUDA streams and concurrency.

Default stream

By default, all CUDA operations are issued into a single stream, called the default stream. Operations in the default stream are executed sequentially, and they are not concurrent with any other operations. The special behavior of the default stream is that it is wholely sync for host and device, which means each time we submit a operation to the default stream, the host will insert an implicit cudaDeviceSynchronize() after and before the operation. But there are several exceptions to this rule:

Kernel launches in the default stream are asynchronous with respect to the host, but they are still serialized with respect to other operations in the default stream.
cudaMemcpyAsync() and cudaMemsetAsync() operations in the default stream are asynchronous with respect to the host.
cudaMemcpy() in the same device.
cudaMemcpy() below 64KB between host and device.

Requirements for Concurrency

To achieve concurrency, we need to meet the following requirements:

Use non-default streams for concurrent operations.
cudaMemcpyAsync() with host from pinned memory.
sufficient resources must be available on the device to execute concurrent operations.

Some examples


cudaMalloc(& dev1, size); 
double * host1 = (double *)malloc(& host, size); 
...
cudaMemcpy(dev1, host1, size, cudaMemcpyHostToDevice); 
kernel2< < < grid, block, 0> > > (..., dev2, ...); 
kernel3< < < grid, block, 0> > > (..., dev3, ...); 
cudaMemcpy(host4, dev4, size, cudaMemcpyDeviceToHost);

Above code will be executed synchronously, because all operations are issued into the default stream. Observing the nsys timeline, we can see that all operations are executed sequentially.

cudaStream_t streams[NUM_STREAMS]; 
for(int i = 0; i < NUM_STREAMS; i ++){ 
 cudaStreamCreate(& streams[i]); 
} 
for(int i = 0; i < NUM_STREAMS; i ++){ 
 int offset = i * chunk; 
 cudaMemcpyAsync(dev1 + offset, host + offset, chunk_size, cudaMemcpyHostToDevice, streams[i]); 
} 
for(int i = 0; i < NUM_STREAMS; i ++){ 
 int offset = i * chunk; 
 kernel1< < < (chunk + 255) / 256, 256, 0, streams[i]> > > (dev1 + offset, dev2 + offset, chunk); 
} 
for(int i = 0; i < NUM_STREAMS; i ++){ 
 int offset = i * chunk; 
 cudaMemcpyAsync(host + offset, dev2 + offset, chunk_size, 
 cudaMemcpyDeviceToHost, streams[i]); 
}

Above code will be executed concurrently, because we have issued operations into different streams. Observing the nsys timeline, we can see that all operations are executed concurrently.

Another overlap example is as follows:

cudaMemcpy(dev1, host1, size, H2D); 
kernel2< < < grid, block> > > (dev2); // launch kernel is asynchronous with respect to the host.
some_CPU_method(); // overlap with kernel2
kernel3< < < grid, block> > > (dev3); 
cudaMemcpy(host4, dev4, size, D2H);

In above code, kernel2 will be launched asynchronously with respect to the host, so some_CPU_method() can be executed concurrently with kernel2. However, kernel3 and cudaMemcpy() will be executed sequentially after kernel2, because they are issued into the default stream.

Explicit Synchronization

Synchronize everything: cudaDeviceSynchronize(): blocks host until all issued CUDA operations are completed.
Synchronize a stream: cudaStreamSynchronize(stream): blocks host until all operations in the specified stream are completed.
Synchronize using Events: cudaEventSynchronize(event): blocks host until the specified event is completed. Events can be used to measure the time between operations in different streams.

Some Event Using Examples

cudaEvent_t start, stop; 
cudaEventCreate(& start); 
cudaEventCreate(& stop); 

cudaMemcpyAsync(dev1, host1, size, H2D, stream1); 
cudaEventRecord(start, stream1); // record start event after memcpy

cudaMemcpyAsync(host2, dev2, size, D2H, stream2); 
cudaStreamWaitEvent(stream2, start, 0); // make stream2 wait for the start event
kernel< < < grid, block, 0, stream2> > > (...); // kernel will execute after the start event is recorded
cudaEventRecord(stop, stream2); // record stop event after kernel launch
cudaEventSynchronize(stop); // wait for the stop event to complete
float elapsedTime; 
cudaEventElapsedTime(& elapsedTime, start, stop); // calculate elapsed time between start and stop events
printf(" Elapsed time: %f ms\n" , elapsedTime);

Implicit Synchronization

Some operations will cause implicit synchronization, without knowing it, we may introduce unexpected synchronization points in our code, which can lead to performance degradation. Some examples of implicit synchronization are as follows:

cudaMallocHost()/cudaFreeHost(): These functions will block the host until all previously issued CUDA operations are completed, because they need to ensure that the pinned memory is not being used by any ongoing CUDA operations.
cudaMalloc(): This function will block the host until all previously issued CUDA operations are completed, because it needs to ensure that there are sufficient resources available on the device to allocate the requested memory.
cudaMemcpy(): This function needs to ensure that the data transfer is not being interfered by any ongoing CUDA operations.
cudaDeviceSetCacheConfig(): This function needs to ensure that the cache configuration is not being changed while any ongoing CUDA operations are using the cache, so it will block the host until all previously issued CUDA operations are completed.

The right way to avoid implicit synchronization is to assign the memory allocation and deallocation in the beginning and the end of the program, and use cudaMemcpyAsync() instead of cudaMemcpy() for data transfer between host and device.

Stream Scheduling

Take Fermi architecture as an example, it has 3 queues: 1 compute engine queue, 2 copy engine queues (one for H2D and one for D2H).

The shedule rule is as follows:
CUDA operations are pushed into the target queue based on the type of operation in the launch order. One operation is issued only when the three conditions are met:

In the same stream, all previously issued operations have been completed.
Ahead of the operation in the same queue, there is no other operation that is still executing.
The resources required for the operation are available on the device.

One blocked operation can block the entire queue even there are other operations in the queue belonging to different streams. So the launch order of operations can affect the performance of the program.

An example of stream scheduling is as follows:

Concurrent Kernel Scheduling

Normally, a signal is inserted into the queues, after the operation is issued, to indicate the completion of the operation. But for the compute engine queues, when compute kernels are issued sequentially, the signal is not inserted until the kernel is completed. So if there are multiple kernels issued into the compute engine queue, they will be executed sequentially, even if they belong to different streams.

In some situations this delay of signals can block other queues.

Conclusion

Maybe the slides I read is a bit old, but it still gives me a good understanding of CUDA streams and concurrency. I will try to use CUDA streams in my future projects to improve the performance of my code.

]]> blog CUDA Foundation of Reinforcement learning(V) //blog/Foundation-of-Reinforcement-learning-V/ Introduction

In the previous post, we have introduced the estimate of value function: MC and TD. In this post, we will introduce two important algorithms for estimating the action value function: SARSA and Q-learning.

Looking back at our previous post, now we have known ‘What is the best state’: estimating the state value function $V^{\pi}(S_t)$, but we still don’t know ‘What is the best action’: $\pi(s) = \arg\max_{a \in A}P(s’|s, a)V^{\pi}(s’)$. Here we don’t know the transition probability $P(s’|s, a)$, so we can’t directly compute the optimal policy. So, we need to estimate the action value function $Q^{\pi}(s, a)$.

SARSA

For any (state, action, reward, next state, next action) executated by the policy $\pi$, we can update the action value function as follows:
$$
Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]
$$

]]> blog Reinforcement learning review notes Foundation of Reinforcement learning(IV) //blog/Foundation-of-Reinforcement-learning-IV/ Introduction

In the previous posts, we have introduced the MDP and its solution. But in practice, we often do not have the simulation of the environment, which means we cannot directly apply our knowledge of MDP to solve the problem. There is indeed a method to simulate the environment, which is called model-based Reinforcement learning. However, in this post, we will focus on the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs.

Estimating Value Functions

In mode-based RL, value functions can be computed by DP methods as follows:
$$
\begin{aligned}
V^{\pi}(s) &= E_{\pi} \left[R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) + \cdots | s_0 = s \right] \\ &=
R(s, a) + \gamma \sum_{s’} P(s’|s, a) V^{\pi}(s’)
\end{aligned}
$$

However, in model-free RL, we cannot directly access the $P(s’|s, a)$ and $R(s, a)$, but we have some ways to estimate the value function from episodes of experience.

Why we estimate the value function?
Because we can use the value function to derive the optimal policy, which is our ultimate goal. Besides value function can help us to reuse historical experience to make better decisions in the future, which is the essence of Reinforcement learning.

Here is a graph to introduce some methods to estimate the value function:

value estimation(Slide credit: David Silver)

Monte Carlo methods

Target: Learn $V^{\pi}$ from episodes of experience.

Review: accumulate reward function:

$$
G_t = R_{t} + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k}
$$

Review: value function is the expected return:
$$
V^{\pi}(s) = E_{\pi} \left[ G_t | s_t = s \right] \simeq \frac{1}{N(s)} \sum_{i=1}^{N(s)} G_t^i
$$
The Monte Carlo method use empirical mean cumulative reward instead of expected return to estimate the value function.

First-visit Monte Carlo method

The first-visit Monte Carlo method estimates the value function by averaging the returns following the first time a state is visited in an episode. The algorithm is as follows:

Initialization:
- For any $s \in S$,$V(s) \in \mathbb{R}$, $N(s) = 0$.
- For any $s \in S$,$\text{returns}(s) = \emptyset$.
Loop for each episode:
- Generate an episode following policy $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, \cdots, S_T$.
- For t = T-1, T-2, …, 0:
  - $G \leftarrow \gamma G + R_{t+1}$
  - If $s$ is the first time in the episode:
    - Append $G$ to $\text{returns}(s)$.
    - $V(s) \leftarrow \text{average}(\text{returns}(s))$.

The reason why we call it ‘first-visit’ is that we only update the value function for the first time we visit a state in an episode. Thus we can avoid the bias caused by multiple visits to the same state in an episode. However, this method may have high variance because it only uses one return for each state in an episode.

Incremental Monte Carlo method

The first-visit Monte Carlo method takes a lot of memory to store the returns for each state, which is not efficient. The incremental Monte Carlo method uses an incremental update rule to estimate the value function without storing all the returns. The algorithm is as follows:

Initialization:
- For any $s \in S$,$V(s) \in \mathbb{R}$, $N(s) = 0$, $G = 0$.
Loop for each episode:
- Generate an episode following policy $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, \cdots, S_T$.
- For t = T-1, T-2, …, 0:
  - $G \leftarrow \gamma G + R_{t+1}$
  - If $s$ is the first time in the episode:
    - $N(s) \leftarrow N(s) + 1$
    - $V(s) \leftarrow V(s) + \frac{1}{N(s)} (G - V(s))$.

Interesting, online softmax also takes the same update rule as the incremental Monte Carlo method. Great job.

Besides, incremental MC provides more design space for us to tackle some problems in practice. For example, we can use a constant step size $\alpha$ instead of $\frac{1}{N(s)}$ to update the value function, which is called constant step size MC method. It is useful when the environment is non-stationary, which means the reward function and transition probability may change over time. In this case, we want to give more weight to recent returns than old returns, which can be achieved by using a constant step size $\alpha$:
$$
V(s) \leftarrow V(s) + \alpha (G - V(s))
$$

Some properties of Monte Carlo methods

Monte Carlo methods are model-free, which means they do not require the knowledge of the environment’s dynamics (transition probabilities and reward function).
Monte Carlo methods take the simpliest approach to estimate the value function, which is to average the returns following the policy. However, this method may have high variance because it only uses one return for each state in an episode.
One key to note is that Monte Carlo methods can only be applied to finite MDPs, which means the state space and action space must be finite.

Importance sampling

Let’s try to estimate a custom distribution $p(x)$ ‘s expectation.
$$
\begin{aligned}
E_{x \sim p} [f(x)] &= \int f(x) p(x) dx \\ &=
\int f(x) \frac{p(x)}{q(x)} q(x) dx \\ &=
E_{x \sim q} \left[ f(x) \frac{p(x)}{q(x)} \right]
\end{aligned}
$$
Then we reassign the importance sampling weight $w(x) = \frac{p(x)}{q(x)}$, we can rewrite the above equation as:
$$
E_{x \sim p} [f(x)] = E_{x \sim q} \left[ f(x) w(x) \right]
$$

Off-policy Monte Carlo methods via Importance Sampling

We can use the cumulative reward function of policy $\mu$ to justify policy $\pi$, and then weight the cumulative reward function by the importance ratio between $\pi$ and $\mu$ to estimate the value function of policy $\pi$. The algorithm is as follows:

Every episode would be mutified by the importance sampling ratio:
$$
G_t^{\pi/\mu} = \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})} \cdots \frac{\pi(A_{T-1}|S_{T-1})}{\mu(A_{T-1}|S_{T-1})} G_t
$$

So we then update the value function by:
$$
V(s) \leftarrow V(s) + \frac{1}{N(s)} (G_t^{\pi/\mu} - V(s))
$$

Sample by importance sampling will significantly increase the variance of the return, which is because the importance sampling ratio can be very large when $\pi$ and $\mu$ are very different.

Temporal-Difference Learning

Temporal-Difference (TD) is a method combining the MC method and DP method, which name comes from the fact that it uses the diff of estimated value function at two consecutive time steps to update the value function. There are two key ideas in TD learning: TD error and TD target.

For state value funtion $V$, after a transition from state $s$ to state $s’$ with reward $r$, the TD error is defined as:
$$
\delta = r + \gamma V(s’) - V(s)
$$
The TD target is defined as:
$$
\hat{V} = r + \gamma V(s’)
$$

As for the TD in Bellman expectation equation, the TD error is used in estimating the expect part.

Some details of TD learning

The simpliest TD learning algorithm is called TD(0), which updates the value function by the TD error at each time step. The key equation of TD(0) is as follows:
$$
V(s) \leftarrow V(s) + \alpha \delta = V(s) + \alpha (r + \gamma V(s’) - V(s))
$$

Why we update like this?
The Bellman expectation equation is rewritten as:
$$
E_{\pi} \left[ R_{t+1} + \gamma V^{\pi}(S_{t+1}) - V^{\pi}(S_t) | S_t = s \right] = 0
$$
That’s all. We want to make the TD error as small as possible, which means we want to make the estimated value function as close as possible to the true value function. Thus, we can use the TD error to update the value function.

The TD method introduce the bootstrapping idea, which means we use the estimated value function to update the value function. This is different from the MC method, which uses the actual return to update the value function. The bootstrapping idea can significantly reduce the variance of the return, but it may introduce bias because we are using an estimated value function to update the value function.

Contrast between TD and MC methods

They have the same goal: Learn the value function from episodes of experience. However, they have different approaches to achieve this goal. The MC method uses the actual return to update the value function, which can have high variance but no bias. The TD method uses the estimated value function to update the value function, which can have low variance but may introduce bias.

TD method	MC method
update value function $V(s)$ like $V(s) \leftarrow V(s) + \alpha (r + \gamma V(s’) - V(s))$	update value function $V(s)$ like $V(s) \leftarrow V(s) + \frac{1}{N(s)} (G_t - V(s))$

The object of TD is $ R_t + \gamma V(s_{t+1})$, which is called TD target, while the object of MC is $G_t$, which is the actual return. The TD method’s error is called TD error, which is defined as $\delta = r + \gamma V(s’) - V(s)$, while the MC method’s error is defined as $G_t - V(s)$.

The strengths and limitations of TD learning and MC learning

TD method can learn until the end of an episode:

After each step in an episode, TD method can update the value function use the former value function, which means it can learn until the end of an episode. However, MC method can only update the value function after the end of an episode, which means it cannot learn until the end of an episode.
TD method can learn from incomplete episodes, which means it can learn from episodes that are not terminated. However, MC method can only learn from complete episodes, which means it cannot learn from episodes that are not terminated.

Tradeoff between bias and variance

	Estimator	Bias	Variance
MC	$G_t$	Unbiased: $E[G_t] = V^{\pi}(s)$	Higher
TD (real)	$R_{t+1} + \gamma V^{\pi}(S_{t+1})$	Unbiased: $E[R_{t+1} + \gamma V^{\pi}(S_{t+1})] = V^{\pi}(s)$	Lower
TD (actual)	$R_{t+1} + \gamma V(S_{t+1})$	Biased: $E[R_{t+1} + \gamma V(S_{t+1})]$ ≠ $V^{\pi}(s)$	Lower

Note: The real TD target uses the true $V^{\pi}$, which is unknown in practice. The actual TD target uses the current estimate $V$, introducing bias. Despite the bias, TD typically has lower variance than MC because it bootstraps from a single step rather than a full trajectory.

Multi-step TD learning

The TD(0) method only uses the immediate reward and the estimated value of the next state to update the value function, which may not be sufficient to capture the long-term dependencies in the environment. The multi-step TD learning method uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. We will introduce it by leading into n-step cumulate reward function and n-step TD target.

N-step cumulate reward function

Consider the following n-step cumulate reward function:
$$
G_t^{(n)} = R_{t} + \gamma R_{t+1} + \cdots + \gamma^{n-1} R_{t+n - 1} + \gamma^n V(S_{t+n})
$$
It seems make sense to use the n-step cumulate reward function to update the value function, which is called n-step TD learning. The key equation of n-step TD learning is as follows:
$$
V(s) \leftarrow V(s) + \alpha (G_t^{(n)} - V(s))
$$

N-step mean cumulate reward function

Can we take up the information of different n-step cumulate reward function to update the value function? The answer is yes.
We can use a weighted average of different n-step cumulate reward functions to update the value function, which is called n-step mean TD learning. The key weight figure of weighted average is as follows:

So the n-step mean cumulate reward function is defined as:
$$
G_t^{\lambda} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}
$$
Then we can update the value function by:
$$
V(s) \leftarrow V(s) + \alpha (G_t^{\lambda} - V(s))
$$

This is called TD($\lambda$) method, which is a generalization of TD(0) and MC methods. When $\lambda = 0$, TD($\lambda$) reduces to TD(0) method, and when $\lambda = 1$, TD($\lambda$) reduces to MC method. Thus, by adjusting the value of $\lambda$, we can control the bias-variance tradeoff in the estimation of the value function.

Conclusion about TD(λ) method

Unless the $lambda$ is 0, TD($\lambda$) mothod is unbiased, because it’s a weighted average of unbiased n-step TD targets.
The variance of TD($\lambda$) method is lower than that of MC method, because it uses bootstrapping to update the value function, which can reduce the variance of the return. However, the variance of TD($\lambda$) method is higher than that of TD(0) method, because it uses more rewards and estimated values to update the value function, which can increase the variance of the return.
$$
\text{Var}(aX + bY) = a^2 \text{Var}(X) + b^2 \text{Var}(Y)
$$
Empirically $\lambda$ is not quite commom because fast credit assignment for a given action is preferred. So MC or TD(0) is more commonly used in practice. However, TD($\lambda$) method can be useful when we want to balance the bias-variance tradeoff in the estimation of the value function, which can be achieved by adjusting the value of $\lambda$.

So TD($\lambda$) use $\lambda$ as variable while n-step TD use $n$ as variable. TD($\lambda$) is a generalization of n-step TD, which can be seen as a weighted average of infin-step TD targets. By adjusting the value of $\lambda$, we can control the bias-variance tradeoff in the estimation of the value function, which can be useful in practice when we want to balance the bias and variance in the estimation of the value function.

Conclusion

In this post, we have introduced the model-free Reinforcement learning, which does not require the simulation of the environment and the construction of MDPs. We have introduced two methods to estimate the value function from episodes of experience: Monte Carlo methods and Temporal-Difference learning. We have also introduced the n-step TD learning method, which uses the rewards and estimated values of multiple future states to update the value function, which can better capture the long-term dependencies in the environment. Finally, we have discussed the bias-variance tradeoff in the estimation of the value function, which can be controlled by adjusting the value of $\lambda$ in TD($\lambda$) method.

]]> blog Reinforcement learning review notes Foundation of Reinforcement learning(III) //blog/Foundation-of-Reinforcement-learning-III/ Introduction

In the previous post, we have introduced the Bellman equation and the linear programming formulation for MDP. In this post, we will discuss the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment.

The Settings of RL

Typically, RL is framed as MDP, exploring the enviroment and learning the optimal policy.
Generally, we can only observe the episodes and usually, we do not have the model of the environment.

So we need to introduce the model into our RL project. Model-based RL actually is a method to solve the MDP.

Dynamic Programming Based RL

Dynamic Programming for finite MDP

Our objective function is simple, just the expected return, which is defined as:
$$
\max_{\pi} E_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} | s_0 = s \right]
$$

In the first episode of the series, we introduced Backward induction, which we can start from the last step and then recursively solve the problem. However, this method is not efficient for large state space, and it requires the state transition.

But meanwhile, we can also use the Bellman equation of value function to tackle the problem:
$$
V^{\pi}(s) = \sum_{a \in A} \pi(a|s)
\left[
\underset{\text{immediate reward}}{\underbrace{R(s,a)}} +
\underset{\text{discount}}{\underbrace{\gamma}}
\sum_{s’ \in S} \underset{\text{transition}}{\underbrace{P(s’|s,a)}}
\underset{\text{future value}}{\underbrace{V^{\pi}(s’)}}
\right]
$$

Optimal Value Function

For a state $s$, we can define the optimal value function as the maximum value function over all policies:
$$
V^*(s) = \max_{\pi} V^{\pi}(s)
$$
So the optimal value function is as follows:
$$
V^\ast(s) = \max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V^\ast(s’)
$$

So the best policy can be derived from the optimal value function as:
$$
\pi^*(a|s) = \arg\max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V^\ast(s’)
$$

for any state $s$ and policy $\pi$, there is:
$$
V^\ast(s) = V^{\pi^*}(s) \geq V^{\pi}(s)
$$

Obviously, the value function relates to the policy, so we can iterate the optimal value function and the optimal policy until convergence. They are called Value Iteration and Policy Iteration respectively.

Value Iteration

For an MDP which is finite in both state and action space, we can use the value iteration to solve the problem. The value iteration is as follows:

Initialize $V(s) = 0$ for all $s \in S$
For each state $s \in S$, update the value function as:
$V(s) \leftarrow \max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V(s’)$
Repeat step 2 until convergence.

NOTE: There isn’t any specific order to update the value function, we can update the value function in any order. But the convergence rate may be different.

Sync & Async Value Iteration

Sync value iteration need to store two copies of value funtion:

For any state $s$, we update the value function as:
$V_{new}(s) \leftarrow \max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V_{old}(s’)$
After updating all states, we copy the new value function to the old value function:
$V_{old} \leftarrow V_{new}$

Async value iteration only need to store one copy of value function:
$V(s) \leftarrow \max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V(s’)$

Policy Iteration

The assumption of MDP is the same as the value iteration, which is finite in both state and action space. The policy iteration is as follows:

Randomly initialize a policy $\pi$ and a value function $V(s) = 0$ for all $s \in S$
Repeat the following steps until convergence:
Policy Evaluation: For each state $s \in S$, update the value function as:
$V(s) \leftarrow \sum_{a \in A} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V(s’) \right]$
Policy Improvement: For each state $s \in S$, update the policy as:
$\pi(a|s) \leftarrow \arg\max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V(s’)$

Obviously, the Policy Iteration will be more expensive than the Value Iteration, since it needs to evaluate the policy in each iteration. However, the Policy Iteration can converge faster than the Value Iteration, since it can update the policy in each iteration.

Let’s contrast the two methods:

Method	Value Iteration	Policy Iteration
Update	Value function	Policy and value function
uses	Bellman optimality equation	Bellman expectation equation

NOTE:
Value iteration is a greedy method, we always use the best.
Update the value function by Bellman equation in Policy Iteration is expensive.
For smaller space MDP, Policy Iteration is faster than Value Iteration, but for larger space MDP, Value Iteration is faster than Policy Iteration.
If there isn’t any state transition circle, the value iteration is better.

Bellman operators

In fact, we have introduced the Bellman operators in the previous post, but we haven’t discussed it in detail.

Why Policy Iteration and Value Iteration can converge to the optimal value function? The key is that Bellman operators are contraction mappings.

Bellman operator is the collection of below functions:

Bellman expectation operator, usually denoted as $\mathcal{T}^{\pi}$, which is defined as:
$$\mathcal{T}^{\pi} V(s) = \sum_{a \in A} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V(s’) \right]$$
Bellman optimality operator, usually denoted as $\mathcal{T}^{\ast}$ or $\mathcal{T}$, which is defined as:
$$\mathcal{T} V(s) = \max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V(s’)$$

They can be used on the state value function and action value function:

expectation operator is used in the policy iteration, used for computing
the value function of a given policy, while is the inner loop of the policy iteration.
optimality operator is used in the value iteration, used for computing the optimal value function, while is the main loop of the value iteration.

Both the Bellman expectation operator and the Bellman optimality operator can be defined on the action value function and the state value function:

Bellman expectation operator on V-function:

$$\begin{aligned}
V^{\pi}(s) &= E_{\pi} \left[\sum_{t = 0}^{\infty} \gamma^t R_t \mid s_0 = s\right] \\
&= E_{\pi} \left[R(s_0, a_0) + \gamma \sum_{s’ \in S} P(s’|s,a) \pi(a|s) V^{\pi}(s’)\right] \\
&= \sum_{a \in A} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) V^{\pi}(s’) \right] \\
&= (\mathcal{T}^{\pi} V^{\pi})(s)
\end{aligned}$$

Bellman optimality operator on V-function:

$$\begin{aligned}
V^{\ast}(s) &= \max_{\pi} E_{\pi} \left[\sum_{t = 0}^{\infty} \gamma^t R_t \mid s_0 = s\right] \\
&= \max_{a \in A} R(s, a) + \gamma \sum_{s’ \in S} P(s’|s, a) V^{\ast}(s’) \\
&= (\mathcal{T} V^{\ast})(s)
\end{aligned}$$

Bellman expectation operator on Q-function:

$$\begin{aligned}
Q^{\pi}(s, a) &= E_{\pi} \left[\sum_{t = 0}^{\infty} \gamma^t R_t \mid s_0 = s, a_0 = a\right] \\
&= E_{\pi} \left[R(s_0, a_0) + \gamma \sum_{s’ \in S} P(s’|s,a) \pi(a|s) Q^{\pi}(s’, a’)\right] \\
&= R(s, a) + \gamma \sum_{s’ \in S} P(s’|s,a) \sum_{a’ \in A} \pi(a’|s’) Q^{\pi}(s’, a’) \\
&= (\mathcal{T}^{\pi} Q^{\pi})(s, a)
\end{aligned}$$

Bellman optimality operator on Q-function:

$$\begin{aligned}
Q^{\ast}(s, a) &= \max_{\pi} E_{\pi} \left[\sum_{t = 0}^{\infty} \gamma^t R_t \mid s_0 = s, a_0 = a\right] \\
&= R(s, a) + \gamma \sum_{s’ \in S} P(s’|s, a) \max_{a’ \in A} Q^{\ast}(s’, a’) \\
&= (\mathcal{T} Q^{\ast})(s, a)
\end{aligned}$$

Due to the contraction property of the Bellman operators, we can guarantee the convergence of the value iteration and policy iteration to the optimal value function.

Model RL

In the above sections, our objective environment is a known MDP, all our methods are based on the assumption that we have the model of the environment, which is the transition probability and the reward function. However, in many real-world scenarios, we do not have the model of the environment, so we need to learn the model from the data.

There are two basic thoughts to learn the model of the environment:

learn the state transition probability$P(s’|s, a)$:
$$P(s’|s, a) = \frac{N(s, a, s’)}{N(s, a)}$$
where $N(s, a, s’)$ is the number of times we have observed the transition from state $s$ to state $s’$ when taking action $a$, and $N(s, a)$ is the number of times we have observed taking action $a$ in state $s$.
learn the reward function $R(s, a)$:
$$R(s, a) = \textbf{average}\left( R_t | s_t = s, a_t = a \right)$$
where $N(s, a)$ is the number of times we have observed taking action $a$ in state $s$, and $R(s, a)$ is the average reward we have observed when taking action $a$ in state $s$.

The simple simulate algorithm is as follows:

randomly initialize a policy $\pi$
repeat the following steps until convergence:
collect data by executing the policy $\pi$ in the environment, and store the transition data in a replay buffer.
learn the model of the environment from the replay buffer, which includes learning the state transition probability and the reward function.
solve the MDP with the learned model to get the optimal policy $\pi^*$.

Other method to solve this is not learning the MDP, instead we learn the value function directly from the data, which is called model-free RL, we will discuss it in the next post.

Conclusion

In this post, we have introduced the model-based Reinforcement learning, which is a method to solve the MDP when we do not have the model of the environment. We have discussed the value iteration and policy iteration, which are two basic methods to solve the MDP. We have also introduced the Bellman operators, which are the key to guarantee the convergence of the value iteration and policy iteration. Finally, we have introduced the simple simulate algorithm, which is a method to learn the model of the environment and solve the MDP with the learned model. In the next post, we will discuss the model-free Reinforcement learning, which is a method to learn the value function directly from the data without learning the model of the environment.

]]> blog Reinforcement learning review notes Foundation of Reinforcement learning(II) //blog/Foundation-of-Reinforcement-learning-II/ Introduction

Given the former post where we have introduced the MDP and some basic properties, we are now ready to discuss the MDP-based Reinforcement learning. But first, we need to introduce the solution of MDP.

Bellman Equation

If we have learned the previous post, we will know that there are two types of value function, state value function and action value function. Their mathematical definitions are as follows:
$$
V^{\pi}(s) = E_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} | s_0 = s \right]
$$

Instinctively, the state value function is the expected return when we start from state $s$ and follow policy $\pi$. Similarly, the action value function is defined as:

$$
Q^{\pi}(s, a) = E_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} | s_0 = s, a_0 = a \right]
$$

Here, it represents the expected return when we start from state $s$, take action $a$, and then follow policy $\pi$ thereafter.

On the other hand, we have a accumulate reward function, which is defined as:
$$
G_t = R_{t} + \gamma R_{t+1} + \gamma^2 R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k}
$$
It can be recursively defined as:
$$
G_t = R_{t} + \gamma G_{t+1}
$$

So, we can rewrite the state value function as:
$$
V^{\pi}(s) = \sum_{a} \pi(a|s) \left[ R(s, a) + \gamma \sum_{s’} P(s’|s, a) V^{\pi}(s’) \right]
$$
Above is the Bellman expectation equation for the state value function. Now, our question is to choose the best policy $\pi$ to maximize the value function. We can define the optimal state value function as:
$$
V^*(s) = \max_{\pi} V^{\pi}(s)
$$

Due to the Principle of Optimality: Each stage of an optimal policy must be optimal for the remaining stages, we can derive the Bellman optimality equation for the state value function as:
$$
V^{\ast}(s) = \max_{a} \left[ R(s, a) + \gamma \sum_{s’} P(s’|s, a) V^{\ast}(s’) \right]
$$

Above is the Bellman optimality equation for the state value function.

Linear Programming for MDP

We have a key observation that the Bellman optimality equation can be rewritten as a linear programming problem. The linear programming formulation for MDP is as follows:

$$
\begin{aligned}
\text{minimize} \quad & \sum_{s} V(s) \\
\text{subject to} \quad & V(s) \geq R(s, a) + \gamma \sum_{s’} P(s’|s, a) V(s’), \\
&\quad \forall s \in S, a \in A
\end{aligned}
$$

Proof:
The first part is to show that the optimal value function $V^{\ast}$ is a feasible solution to the above linear programming problem. We can see that for any state $s$ and action $a$, we have:
$$
V^{\ast}(s) = \max_{a’} \left[ R(s, a’) + \gamma \sum_{s’} P(s’|s, a’) V^{\ast}(s’) \right] \\ \geq R(s, a) + \gamma \sum_{s’} P(s’|s, a) V^*(s’)
$$
Thus, $V^{\ast}$ satisfies the constraints of the linear programming problem.
The second part is to show that any feasible solution satisfying the constraints must be greater than or equal to $V^{\ast}$.
Given that the optimal policy $\pi^{\ast}$ choose one action $\pi^{\ast}(s)$ for each state $s$, LP constraints imply that for any state $s$:
$$
V(s) \geq R(s, \pi^{\ast}(s)) + \gamma \sum_{s’} P(s’|s, \pi^{\ast}(s)) V(s’)
$$
Let’s write them in matrix form:
$$
V \geq R^{\pi^{\ast}} + \gamma P^{\pi^{\ast}} V
$$
while the $R^{\pi^{\ast}}$ is the reward vector under policy $\pi^{\ast}$, and $P^{\pi^{\ast}}$ is the transition matrix under policy $\pi^{\ast}$. We can rearrange the above inequality as:
$$
(I - \gamma P^{\pi^{\ast}}) V \geq R^{\pi^{\ast}}
$$
Since $\gamma < 1$, we can conclude that $I - \gamma P^{\pi^{\ast}}$ is invertible, and we can get its inverse as:
$$
(I - \gamma P^{\pi^{\ast}})^{-1} = \sum_{k=0}^{\infty} (\gamma P^{\pi^{\ast}})^k
$$
Obviously, the above inverse is a non-negative matrix. Thus, we can multiply both sides of the inequality by $(I - \gamma P^{\pi^{\ast}})^{-1}$ to get:
$$
V \geq (I - \gamma P^{\pi^{\ast}})^{-1} R^{\pi^{\ast}} = V^{\pi^{\ast}}
$$
Since $V^{\ast} \geq V^{\pi^{\ast}}$, we can conclude that $V \geq V^{\ast}$.
So, we have shown that any feasible solution is greater than or equal to $V^{\ast}$, and the optimal value function $V^{\ast}$ is a feasible solution. Thus, the optimal solution to the linear programming problem is $V^{\ast}$.

Given we have the $V^{\ast}$, we can easily derive the optimal policy $\pi^{\ast}$ as:
$$
\pi^{\ast}(s) = \arg\max_{a} \left[ R(s, a) + \gamma \sum_{s’} P(s’|s, a) V^{\ast}(s’) \right]
$$

Here I asked claude to give me a simple explanation of the equivalence between this greedy like policy decision and the optimal policy. The formal proof may be need to use the fixed point theorem, but it is beyond the scope of this post. We just need to remember that the optimal policy can be derived from the optimal value function by choosing the action that maximizes the expected return.

The Dual Linear Programming for MDP

The dual linear programming formulation for MDP is as follows:
$$
\begin{aligned}
\text{maximize} \quad & \sum_{s, a} \rho(s, a) R(s, a) \\
\text{subject to} \quad & \sum_{a} \rho(s’, a) = \sum_{s, a} \rho(s, a) P(s’|s, a), \\
&\quad \forall s’ \in S \\
& \rho(s, a) \geq 0, \quad \forall s \in S, a \in A
\end{aligned}
$$

If the initial state distribution $\mu(s) > 0$ for all $s \in S$, then let
$w(s) = \mu(s)$, the target function means the expected accumulated reward represented by the occupancy measure, and the constraints means the flow conservation constraints.

Assume that the optimal solution is $\rho^{\ast}(s, a)$, we can derive the optimal policy from Theorem 2 as:
$$
\pi^{\ast}(s) = \frac{\rho^{\ast}(s, a)}{\sum_{a} \rho^{\ast}(s, a)}
$$

Comparison between the Primal and Dual Linear Programming for MDP

Dimension	Primal LP	Dual LP
variable	state value function $V(s)$	occupancy measure $\rho(s, a)$
objective	minimize $\sum_{s} V(s)$	maximize $\sum_{s, a} \rho(s, a) R(s, a)$
constraints	Bellman optimality constraints	flow conservation constraints
explanation	state value	action frequency

Summary

At the beginning of this post, I tried to introduce the MDP-based Reinforcement learning, but I found that the solution of MDP takes a lot of space, so I just introduce the Bellman equation and the linear programming formulation for MDP. In the next post, I will introduce the value iteration and policy iteration algorithms for solving MDP, which are based on the Bellman equation.

]]> blog Reinforcement learning review notes Foundation of Reinforcement learning(I) //blog/Foundation-of-Reinforcement-learning-I/ The category of decision making problem

dimension	single step	multi step
one person	optimization problem	RL, to the best situation
multi person	static game	dynamic game, MARL.etc

Dynamic programming

Dynamic program is used to solve the Sequential decision making problem, feature of this problem is that it’s decision making process is sequential, and the decision at one step will affect the next step, and the reward is received at the end of decision making process, not at each step.

For an example, given a maze like problem below, the agent need to find a way from Position A to Position B, and the time of each way is different. Agent need to find the way with the least time. A simple way to solve this is to list all the possible paths, but if there is a circle, if the map is large, this will be unfeasible.

A better way to solve this is Backward induction, we start from the end point, and for evey point we calculate the time to reach the end point, then we regart the selected point as the new end point, and repeat this process until we reach the start point. This is a dynamic programming method, and it can solve the problem in polynomial time. But due to we need find a backward path, this method is only suitable for DAG, if there is a circle, this method will fail.

maze

The example is just a introduction, we can summarize the features of dynamic programming as follows:

it start from the end, and caculate the best action for each state.
it traverse all the states, and for each state, it calculate the best action, and the value of this state.
it need to define the state, path(state transition), time(online reward)

So it lead to the Principle of Optimality:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

Markov Decision Process

Stochastic Process

A stochastic process is a collection of random variables, which can be used to describe the evolution of a system over time. It’s mathematical definition is as follows:
$$
P(X_{t+1}|X_t,X_{t-1},…,X_0)
$$
This means that the probability of the next state $X_{t+1}$ depends on the current state $X_t$ and all the previous states $X_{t-1},…,X_0$.

Markov Process

Compared to stochastic process, Markov process has a stronger assumption, which is “the future is independent of the past given the present”. Mathematically, it’s definition is as follows:
$$
P(X_{t+1}|X_t,X_{t-1},…,X_0) = P(X_{t+1}|X_t)
$$
This means that the probability of the next state $X_{t+1}$ only depends on the current state $X_t$, and is independent of all the previous states $X_{t-1},…,X_0$.

Trying to understand it’s property is that the current state contains all the information about the past, so we can make decision based on the current state without worrying about the past.

Markov Decision Process

Markov Decision Process (MDP) provides a mathematical framework for modeling decision making in situations where the outcome is partly random, partly under the control of a decision maker. An MDP is defined by the following components:

State space (S): A set of all possible states in the environment.
Action space (A): A set of all possible actions that the agent can take
Transition function (P): A function that defines the probability of transitioning from one state to another given a specific action. It is denoted as $P(s’|s,a)$, which represents the probability of transitioning to state $s’$ from state $s$ after taking action $a$.
Reward function (R): A function that defines the reward received after transitioning from one state to another given a specific action. It is denoted as $R(s,a)$, which represents the reward received after taking action $a$ in state $s$. Sometimes only relates to the State.
Discount factor (γ): A factor that determines the importance of future rewards. It is a value between 0 and 1, where a value closer to 0 makes the agent prioritize immediate rewards, while a value closer to 1 makes the agent consider future rewards more heavily.

The dynamic feature of MDP

The whole process of MDP is dynamic as follows:

The agent observes the current state $s_t$.
The agent selects an action $a_t$ based on its policy $\pi(a|s)$, which is a mapping from states to actions.
The agent gets a reward $R(s_t,a_t)$.
The MDP transitions to a new state $s_{t+1}$ according to the transition function $P(s_{t+1}|s_t,a_t)$.

The total reward that the agent receives over time is often defined as the discounted sum of rewards:
$$
G_t = R(s_t,a_t) + \gamma R(s_{t+1},a_{t+1}) + \gamma^2 R(s_{t+2},a_{t+2}) + … = \sum_{k=0}^{\infty} \gamma^k R(s_{t+k},a_{t+k})
$$

Markov Policy

In the context of MDP, a policy is a function that depends on the history:
$$
h_t = (s_0,a_0,s_1,a_1,…,s_{t-1},a_{t-1},s_t) \
\pi(a_t|h_t) = P(a_t|h_t)
$$

But a Markov policy is a special type of policy that only depends on the current state:
$$\pi(a_t|s_t) = P(a_t|s_t)
$$

In the RL setting, we usually assume that the policy is a Markov policy. Why?

The MDP has Markov property, which means the future is independent of the past given the present, so there is no special information in the history that can help us make better decision, so we can just use the current state to make decision.More informally, for any policy relying on the history, we can find a Markov policy that at least performs as well as it does, so we can just focus on Markov policy without loss of generality.proof(the 26th and 27th slides of the lecture)

The category of MDP Policy

At the time demension, we can categorize the policy into two types:

Stationary policy: A policy that does not change over time. It is defined as $\pi(a|s)$, which means the action taken in state $s$ is the same at any time step.
Non-stationary policy: A policy that can change over time. It is defined as $\pi_t(a|s)$, which means the action taken in state $s$ can be different at different time steps.

At the probability distribution demension, we can categorize the policy into two types:

Deterministic policy: A policy that always selects the same action for a given state. It is defined as $\pi(s) = a$, which means the action taken in state $s$ is always $a$.
Stochastic policy: A policy that selects actions according to a probability distribution. It is defined as $\pi(a|s) = P(a|s)$, which means the action taken in state $s$ is selected according to the probability

In the RL setting, we usually assume that the policy is a stationary policy. Why?
Typically, we consider the infinite horizon setting. There is also a proof that for any non-stationary policy, we can find a stationary policy that at least performs as well as it does, so we can just focus on stationary policy without loss of generality. proof(the 29th and 32th slides of the lecture)

The best policy for MDP

There is a theorem:

In a situation that the discount factor $\gamma \lt 1$, while the state and action space are finite and the horizon is infinite, there exists a deterministic
and stationary policy $\pi^\ast$ that is optimal, which means for any policy $\pi$, we have $V^{\pi^\ast}(s) \geq V^{\pi}(s)$.

Proof: Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

The goal of MDP

Our goal is to choose the action to maximize the expected reward, which is defined as follows:
$$
\textbf{E}[R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) + …] = \textbf{E}[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)]
$$

So we can define the value function for a policy $\pi$ as follows:
$$
V^{\pi}(s) = \textbf{E}[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t) | s_0 = s]
$$
This means the expected reward that the agent can get starting from state $s$ and following policy $\pi$.

Occupancy Measure

In MDP context, the occupancy measure is a way to represent the discounted state-action expectation under a policy $\pi$, also known as state-action visitation distribution. It is defined as follows:
$$
\rho^{\pi}(s,a) = \underset{a \sim \pi(s),, s’ \sim p(s, a)}{\mathbb{E}} \left[ \sum_{t=0}^{\infty} \gamma^t \mathbb{I}(s_{t} = s, a_{t} = a) \right]
$$

while the $s \sim p(s, a)$ means the state transition, which is defined as follows:
$$
s_{t+1} \sim p(s_t, a_t)
$$

On the other hand, the state occupancy measure is defined as follows:
$$
\rho^{\pi}(s) = \underset{a \sim \pi(s),, s’ \sim p(s, a)}{\mathbb{E}} \left[ \sum_{t=0}^{\infty} \gamma^t \mathbb{I}(s_{t} = s) \right]
$$

How to compute the occupancy measure?

State occupancy measure

We assume that the initial state distribution is $\mu(s)$, then we can compute the state occupancy measure as follows:
$$
\rho^{\pi}(s’) = \mu(s’) + \gamma \sum_{s} p^{\pi}(s’|s)\rho^{\pi}(s)
$$

then we can solve the fomula:
$$
\rho^{\pi} = \left(I - \gamma (P^{\pi}_{SS’})^T\right)^{-1} \mu
$$

State-action occupancy measure

We can compute the state-action occupancy measure as follows:
$$
\rho^{\pi}(s,a) = \mu(s’) \pi(a’|s’) + \gamma \sum_{s} p^{\pi}(s’|s)\rho^{\pi}(s,a)
$$

Pay attention that the whole process is flow conservation. Because the state-action occupancy measure is the expected discounted number of times that the agent takes action $a$ in state $s$, so the total flow into state $s$ must equal the total flow out of state $s$. This is why we have the flow conservation constraint in the computation of occupancy measure.

Some Properties of Occupancy Measure

Obviously, from the definition of the measures:

$\rho^{\pi}(s) = \sum_{a} \rho^{\pi}(s,a)$
$\rho^{\pi}(s,a) = \pi(a|s)\rho^{\pi}(s)$

We have two important theorems about the occupancy measure:

Theorem 1: For two policies $\pi$ and $\pi’$ interacting with the same dynamic environment, if $\rho^{\pi} = \rho^{\pi’}$, then $\pi_1 = \pi_2$.
Theorem 2: Given a Occupancy measure $\rho$, the only policy that can generate this occupancy measure is $\pi_{\rho}(s, a) = \frac{\rho(s,a)}{\sum_{a} \rho(s,a)}$.

Accumulated reward for a policy

As we have defined the occupancy measure, we can compute the accumulated reward for a policy $\pi$ as follows:
$$
\begin{aligned}
\mathbb{V}(\pi) &= \underset{a \sim \pi(\cdot|s), s’ \sim p(\cdot|s,a)}{\mathbb{E}} \left[R(s_0, a_0) + \gamma R(s_1, a_1) + \gamma^2 R(s_2, a_2) + …\right] \\
&= \sum_{s, a} \underset{a \sim \pi(\cdot|s),, s’ \sim p(\cdot|s,a)}{\mathbb{E}} \left[R(s, a)\right] \rho^{\pi}(s, a) \\
&= \sum_{s, a} R(s, a) \rho^{\pi}(s, a) \\
&= \underset{\rho^{\pi}}{\mathbb{E}}\left[R(s, a)\right]
\end{aligned}
$$

Value function and Q function

The value function is used to evaluate a state or a state-action pair, given a policy $\pi$.

The state value function is usually know as value function, which is defined as follows:
$$
V^{\pi}(s) = \underset{a \sim \pi(\cdot|s), s’ \sim p(\cdot|s,a)}{\mathbb{E}} \left[R(s, a) + \gamma V^{\pi}(s’)\right]
$$

The state-action value function is usually know as Q function, which is defined as follows:
$$
Q^{\pi}(s, a) = \underset{a’ \sim \pi(\cdot|s’),, s’’ \sim p(\cdot|s’,a’)}{\mathbb{E}} \left[R(s, a) + \gamma Q^{\pi}(s’, a’)\right]
$$

Obviously, we have the relationship between the value function and the Q function:
$$
V^{\pi}(s) = \sum_{a} \pi(a|s) Q^{\pi}(s, a)
$$
And we can also compute the value function and Q function using the occupancy measure as follows:
$$
\begin{aligned}
V^{\pi}(s) &= \sum_{a} R(s, a) \rho^{\pi}(s, a) \\
Q^{\pi}(s, a) &= R(s, a) + \gamma \sum_{s’} p(s’|s, a) V^{\pi}(s’)
\end{aligned}
$$

Summary

MDP provides us a simple but powerful mathematical framework to model the sequential decision making problem.
The five-tuple of MDP is defined as $(S, A, P, R, \gamma)$, which represents the state space, action space, transition function, reward function and discount factor respectively.
Markov poverty is the key assumption of MDP, which means the future is independent of the past given the present.
Policy is the function to choose the action, usually is a conditional probability distribution over actions given states, and we usually assume that the policy is a stationary policy.
Occupancy measure is a way to represent the discounted state-action expectation under a policy, which can be used to compute the accumulated reward for a policy.
State value function and state-action value function are used to evaluate a state or a state-action pair, given a policy, and they can be computed using the occupancy measure.

]]> blog Reinforcement learning review notes nanoPD:一个 LLM P/D 分离推理引擎的实现笔记 //blog/nanopd/ 这个项目的起因是读 vllm, DistServe 和 Mooncake 的时候有些地方没有完全想清楚，觉得与其在论文层面反复打转，不如自己动手实现一遍，于是花了一段时间和 claude 大人一起从零写了一个支持 Prefill/Decode 分离调度的推理引擎，覆盖了从 CUDA 内核到自适应路由器的完整栈，代码大约 2000 行 Python 加 400 行 CUDA C++，每个模块有单独的文档，也顺手做了中英双语版。(菜菜勿喷呜呜）

背景和动机

LLM 推理的两个阶段在计算特性上的差异非常显著， prefill 是一次性处理整个 prompt 的过程，本质上是大规模的 GEMM ，算力密集， GPU 的计算单元在这个阶段是满载的；而 decode 每步只生成一个 token ，每次前向传播只有一个（或很少几个）新 token 需要计算，但需要从显存里读取所有历史 token 的 KV cache ，是典型的 memory bound 操作， GPU 的计算单元大部分时间在等数据，实际上 decode 阶段的算术强度非常低，主要的开销是 HBM 的读取带宽。

这两种操作对 GPU 资源的需求模式截然不同，把它们放在同一张卡上并发运行时会产生 SM 资源的竞争， prefill 的矩阵乘法会占用大量 SM ，导致同时在跑的 decode 请求的 attention 计算被推迟，表现出来就是 decode 的延迟在有 prefill 并发时会显著上升，这种干扰在高并发场景下尤其明显，也是 vLLM 等系统在负载较高时尾延迟劣化的一个重要原因。

P/D 分离（ Disaggregated Serving ）的思路是把 prefill 和 decode 分配到专用的 GPU 上，让两个阶段互不干扰，这个方向在工业界和学术界都有比较多的工作， DistServe 比较系统地分析了分离的收益， Mooncake 则是
月之暗面在生产系统中实践分离调度的工程经验，两篇文章读起来都很有意思，但读完之后我对一些具体的设计决策仍然有疑问，比如路由策略在不同硬件上的表现差异有多大，代价模型里的参数对结论的敏感性如何，这些问题通过实现一遍会有更直接的感受。

实现栈，从底到上

CUDA 内核部分手写了 paged attention kernel 和 KV store ops ，主要是想理解非连续内存块上的 attention 是怎么做的，传统的 attention 假设 KV cache 存储在连续内存里，但 paged KV cache 把显存切成固定大小的物理块，序列的 KV 数据分散在这些块中， attention 计算需要根据 blocktable 做间接寻址，实现上需要在 CUDA kernel 里根据 token 的位置先找到对应的物理块再读取数据，这个 gather 操作相比连续 KV cache 有额外的开销，但换来的是显存利用率的显著提升，因为不再需要为每条序列预先分配最大长度的连续显存，这个 tradeoff 在长序列和高并发场景下非常值得。内核的实现参考了 vLLM 的设计，但为了保持简单没有做 Flash Attention 那样的 tiling 优化，所以在长序列上性能差很多，这部分如果要真正优化的话工作量还是比较大的。（ next work 预定了说是）

块管理器以 block 为粒度管理显存的分配和释放，逻辑上类似操作系统的虚拟内存管理，每个物理块有引用计数，引用计数降为零时才真正释放，支持 Copy-on-Write fork ，这个特性在 beam search 或者需要复制序列状态的场景下有用， fork 的时候共享物理块，只有在某条路径需要写入新 token 时才触发实际的物理块复制，避免了不必要的显存拷贝。

推理引擎实现了 chunked prefill ，长 prompt 会被切成固定大小的 chunk 和当前正在 decode 的请求交错执行，而不是一次性把整个 prompt 打进去阻塞所有 decode 请求，这个设计的好处是降低了 prefill 对 decode 的干扰，代价是单个请求的 prefill 总时间会因为被切分而略有增加，调度器维护 waiting 、 prefilling 、 running 三个队列，每一步决定哪些请求进入 prefill 、哪些进行 decode 、 chunk 大小是多少，这些调度决策对系统的整体延迟和吞吐影响很大，实现上采用了比较简单的启发式策略，没有做复杂的动态调整。

Worker 层分三类，

CollocatedWorker 在单卡上同时做 prefill 和 decode ，内部复用了推理引擎的调度器；
PrefillWorker 专门处理 prefill ，完成后把生成的 KV cache 从 GPU 显存提取到 pinned memory buffer （为什么我的服务器连 PCIe 都没有！！！），准备传输；
DecodeWorker 接收传输过来的 KV cache ，加载到自己的显存后把对应的序列加入 decode 队列。
三类 Worker 可以在不同 GPU 上并发运行，由上层的 CentralScheduler 协调， CentralScheduler 把 collocated 和 disaggregated 两条 pipeline 跑在独立的线程上，每个 step 并发执行，然后汇总结果。

KV 传输部分用了独立的 transfer stream ，目标是和 compute stream 尽量 overlap ，减少等待时间，代码里用 torch.cuda. Event 来同步两个 stream 之间的依赖，实现上比较简单，没有做精细的 pipeline 重叠，实际测下来 overlap 的效果取决于传输数据量和计算量的比例，在小 batch 或短序列的时候效果不太明显。同时会自动检测当前硬件是否支持 P2P 直传，通过 torch 的 _check_p2p 接口查询，不支持的话 fallback 到 pinned memory relay ，也就是先把数据从 GPU 拷到 CPU 的 pinned memory ，再从 pinned memory 拷到目标 GPU ，这个路径的带宽会受限于 PCIe ，在没有 NVLink 的多卡环境下（比如 8 张 4090 通过 PCIe 互联）这个开销是比较显著的，实测大约 12.9 GB/s ，而有 NVLink 的 H20 可以达到约 392 GB/s ，差了将近 30 倍，这个差距直接决定了路由判断在两种硬件上的结论会有多大的不同。

代价模型和路由

为了决定每个请求走合并路径还是分离路径，我实现了一个解析式代价模型，思路是先在真实设备上跑 micro-benchmark 测四个参数， prefill 速度α（ ms/token ）、 decode 单步延迟 β（ ms ）、 prefill 对 decode 的干扰系数 γ（ ms/token ）、以及 GPU 间传输带宽 bw （ GB/s ），然后用这四个参数建立两条路径的延迟估算公式，每个请求来了之后实时计算预估延迟并选更小的那条，所有参数都来自实测，没有用论文里的理论值，因为不同的硬件和软件环境下这些数字差异很大，直接用实测值比从理论推导更可靠。

profiling 的过程大概是： prefill latency 用不同长度的 prompt 各跑若干次取中位数，用线性回归拟合得到 α； decode latency 在不同的 KV 长度和 batch size 下各测一遍，用来拟合 β 和 batch_thresh （ decode 从内存带宽瓶颈切换到算力瓶颈的拐点）； interference 通过对比纯 decode 和混合 prefill+decode 下 decode 延迟的差值来测量，用线性回归拟合得到 γ； P2P 带宽直接测一次大块数据传输的时间来估算。

路由判断最后可以化简成比较两个都正比于 prompt 长度 L 的量，分离路径相比合并路径的额外代价是传输 KV cache 的时间，大约是 transfer_rate× L ，合并路径相比分离路径的额外代价是 prefill 对并发 decode 的干扰，大约是 γ × L × (system_load / batch_thresh)，分离更划算的条件可以化简为 γ / transfer_rate > batch_thresh / system_load ，其中γ / transfer_rate 是一个只取决于硬件的比值，反映了”干扰有多贵”相对于”传输有多贵”的比例，而 batch_thresh / system_load 则反映了当前系统的负载程度。

在 RTX 4090 上实测，γ / transfer_rate ≈ 7.6 ，代入 batch_thresh = 16 可以得到 system_load ≥ 约 2.1 时分离就开始划算，也就是只要有两到三个并发请求在跑，分离路径在延迟上就已经比合并路径更好了；而在 H20 上这个比值达到 346 ，几乎在任何非零负载下分离都是更优的选择，两种硬件上的结论差距这么大主要是因为传输带宽差了将近 30 倍，γ 的差异相对没那么大（ 0.087 vs 0.130 ms/token ），所以比值主要是被 transfer_rate 拉开的。这个结论确实感觉很不合理，直接原因就导致路由决策非常单一，理论上能充分展示路由决策的机器本人无钱无时间找到（谁来帮我考算分期中！），于是懒得改了。

此外还实现了一个输出长度预测器，因为路由需要估算 decode 阶段的代价，而 decode 代价正比于输出长度，但输出长度在请求来的时候是未知的，用了一个在线的贝叶斯预测器，按 prompt 长度分桶，每桶内维护历史输出长度的统计，新请求来了用对应桶的均值作为预测，桶内样本不足时 fallback 到全局均值，这部分比较粗糙，输出长度预测本身就是一个很难的问题，实际上即使是 vLLM 这样的成熟系统目前也还没有特别好的解法，这里的实现只是为了让路由可以跑起来，准确性有限。

测试和结果

写了一个 Poisson 到达过程的 benchmark ，模拟真实服务里请求按泊松过程到达的场景，固定到达率 λ，跑固定时长，然后统计完成的请求数、端到端延迟的分布、以及 drop 的请求数，这比简单的串行测试更接近实际的服务场景，因为串行测试本质上是测单个请求的延迟，没有并发，没有排队，和真实负载差距比较大。

结果上，在 RTX 4090 × 8 的环境下， adaptive 路由在中等到达率下的吞吐和延迟比 collocated 有一定改善，在 H20 上分离路径的优势更明显，和代价模型的预测基本一致，但实际的性能数字和成熟框架比是没有可比性的，因为缺少了 CUDA Graph 、 Flash Attention 、算子融合等优化，这里就不列具体数字了，主要是看各策略之间的相对趋势是否符合理论预期，从这个角度来看结果是比较合理的。
性能出现巨大问题的原因还有 disagg 路径没有像 collocated 做那么多小优化， such as cotinious batching 等等，在项目的文档中我做了具体说明。

另外一个有意思的观察是，在 H20 上 adaptive 的峰值吞吐反而低于 RTX 4090 ，原因是 H20 的 decode 步更快（β = 33ms vs 51ms ），更多请求在 collocated GPU 上就快速完成了， disaggregated 路径的利用率相对
更低，多卡的优势没有完全发挥出来，这有点反直觉，但想清楚之后也是合理的，更快的 decode 反而减少了分离的必要性，加之我较为 naive 的 paged attention 实现（肉眼可见有一处就可以改为 reduce 求和），使得 max_request_num 调大了就会影响 TTFT 等乱七八糟的事，本人又买不起 autodl 的 H20 * 4 呜呜呜，就这样吧。

一些感受

实现上没有做 CUDA Graph 、 Flash Attention 或量化这些会显著影响性能的优化（实际上懒了写不了了），主要是想保持每一层的逻辑足够清晰，方便对照论文理解设计决策，也方便文档解释，所以性能上和成熟的推理框架没有可比性，仅仅是作为理解系统设计的实践项目。

写完之后最大的收获可能不是哪个具体的技术细节，而是对整个系统各层之间的依赖关系有了更清晰的感受，代价模型的准确性依赖 micro-benchmark 的质量， micro-benchmark 又依赖引擎本身的实现是否足够接近真实推理路径，路由的效果依赖输出长度预测的准确性，调度器的策略影响 profiling 时测出来的 interference 数值，这些依赖关系在读论文的时候是模糊的，实现一遍之后会更具体地感受到每一层的假设在什么条件下成立，什么条件下会失效，以及各层之间的耦合程度，这种感受很难单纯通过读代码或读论文获得。

已经放到了 github 上，文档里对每层的设计决策有比较详细的说明。

]]> blog mlsys cuda kernel infra 3D Reconstruction Series //blog/3dreconseries/ 引言

更新，已决定停止更新（ x

可以看到本文的 publishDate 是 4096-16-64, 实际上的 publishDate 是 2026-02-10 。
本文的初衷是一个长期更新的 3D recon 系列论文阅读，之前其实已经发过了一些该领域的论文的精读了，但是显然精读必然是不可长期持续的。因此，我想以本文——一个系列的形式记录对大多数论文的浅要阅读，当然如果有特别重要的论文，我也会单开一篇文章进行精读的。

本文的 cover image 是一个词云，记录了本文包含的工作的名称，希望它能不断地更新，成为一个 3D recon 领域的词云图谱。

CUT3R

CUT3R 的输入是视频序列，但是也可以 unordered （据作者所言训练的时候是无序训练的，但是推理的时候推理的时候是 dataloader 先计算重合率来进行初步排序。），使用一个 feed forward 网络预测 camera parameters 和点云。

cut3r

然后是一个 recurrent 模型，每一帧输入的时候添加一个 pose token 然后经过 encoder 和 decoder ，之后使用交叉注意力更新$s_t$ 和 $F_t$，之后再使用不同 head 来从$s_t$和$F_t$中提取 output 。

显然这样缺少修正，对于长序列容易造成偏移。但是作者似乎也提到了一个 revisit 机制，在输入结束之后拿着全局的$s$来做之前的预测，在 7scene 上的 acc 和 comp 是有改善的，但是 NRGBD 不怎么明显。

此外，作者也说因为数据集质量的原因，采用的 head 即使已经有一个 pose head 和 local points head ，也仍然要加入一个 world ptshead （缺乏高质量的数据集）。

$\pi ^ 3$

$\pi ^ 3$ 是一个相对来说比较有趣的东西，模型结构如下：

pi3

首先与之前的最大不同是它没有显式地选取参考帧和一个特定的 scale factor ，像 VGGT 就是先选取了一个 ref frame 然后做重建，但是重建质量受 ref 影响很大，因此$\pi^3$选择了一个方案，就是一次性将所有帧全部输入，所有帧之间均平等，然后 inference 出一组相对位姿和局部点云，这样就能规避确定某一个 frame 作为坐标原点造成的不确定性问题。

但是仔细一想，$\pi^3$仍然不怎么好避免一个 ref 的问题，首先，在一个 batch 内部，虽然我们预测的是一组相对位姿，但是直觉上感觉仍然是把某一帧与其他帧不融洽所导致的原先的那种大的，显著的，偶然性的损失转化为了现在的看起来不明显的、高一致的、所有帧都有的系统性损失。但作者通过实验证明了损失会变小，其实这也是比较好解释的，因为原先的可能是$T_2$依赖$T_1$，$T_3$依赖$T_2$……这种单向参考，而$\pi^3$则进行了交叉注意力计算，仔细想来确实会更好。

其次，交叉注意力的复杂度大概是$O(n^2)$，显然对于长序列是不可接受的，作者训练和测试的时候均采用了有限个 batch 内 frame 的做法，但对于实际的长序列的话，感觉并不是很好做。如果切片进行拼接的话，显然也会面临 ref 的选择问题，但是这时候是一个 scene 之间的拼接，感觉确实会降低很多错误，如果分层做的话，也会降低误差，总之感觉似乎确实是一个不错的方案。

DA3

DA3 是字节 seed 的一个项目，可以说是力大飞砖，充分体现了工业界解决问题的规模（ x 。

da3
DA3 的主要创新点在于：

更简单的模型，作者的意思是 VGGT 即使结构很简单，但是由于其在 DINO 后接 AA 层的操作，因为 AA layers 是新训练的，因此过程中可能数据的利用率不高。而 DA3 选择了只利用 DINO 这一个方案，通过在 DINO 的$L_g$层中变形数据完成了 AA 层所做的事情。因此， DA3 的几乎所有参数都是预训练过的，而 vggt 则有$\frac{2}{3}$ 的参数是从头开始训的，这是 DA3 的简洁之处。
预测任务的简洁性。相比于 VGGT 通过不同 head 得出了不同结果， DA3 则使用了一个更新的表达方式： Ray-depth 表达，具体来说就是使用一个 Dual head 来分别输出一个像素的深度信息和光心与之相连的射线的信息，从而天然地同时包含了点云和 pose 信息，而且在设计 loss 的时候是可以加入一致性信息的。相比与 vggt ，这似乎加强了一致性，也提高了数据利用率，感觉 pose 和 pts3d 反而是不容易加入一致性的，作者做的消融实验也证实了这一点。
使用 teacher 标定数据，首先训了一个 teacher 模型用于给深度不好的 frame 重新生成 depth ，之后依照这个 depths 训练。感觉最终效果也很依赖这个 teacher 模型。

但是， DA3 的弊端也有一些，他的效果确实非常好，但是阅读之后才发现他是用 128 x H100 训练的，这个规模确实有点难以复现。小算力情况下上面两条结论似乎很有帮助，可以尝试。

MapAnything

首先是 Meta 的项目，和 VGGT 难道不构成什么竞争关系嘛（）

主要创新点在于他的输入很有意思，不同于 VGGT 还有以往的重建工作只输入图像序列， MapAnything 支持多种多样的输入，对于每一个输入都会通过一个 encoder 最后对齐到 DINOv2 输出的 image token 上，然后就是正常处理的流程，不过似乎它多加了一个 scale token ，用于预测 scale 信息。

mapanything

感觉其利用了 nlp 里面的多模态，证明了给定不同类型的输入其预测的准确性与相应的专家模型性能相似，这是很有价值的，因为他减少了很多训练量（虽然也是在 64xH200 上训了 10 天）。

另外一个比较有趣的地方在于，他最后的点云数据不是直接输出的，而是由 depth ， ray ， pose 联合输出，这解耦了 VGGT 的冗余预测模式，而且在设计 loss 的时候能保持更好的一致性，感觉这个跟 DA3 输出 Depth-ray 的做法还是很像的。

不过其缺点也非常明显，首先对于长序列情况下，其仍然没有摆脱$O(n^2)$的处理复杂度；其次模型是 offline 的，不过感觉各有各的应用场景；最后就是推理速度和显存占用，推理速度在 100frame 的时候就已经接近 10s ，而且这时的显存占用也已经来到了 65G 左右，即使采用了作者提出的 Mem Efficent 策略，即在 dpt 头采用串行计算策略也是 20G 左右，似乎有点太大了（ x

此外，作者表示了在输入过程中模型无法对噪声数据进行处理，也就是说潜在的噪声可能会污染整个 transformer 的内容，另外融合时机是在 encoder 之后进行，而且是简单的相加，可能有更精细的融合方式。

AnySplat

anysplat

与之前讲过的大多数点云重建的工作不同， AnySplat 是 3dgs 重建。具体来讲就是他在 vggt 的基础上进行改造， backbone 与 vggt 相同，但其 head 则是一个 gaussian head, 一个 depth head ，还有一个 Camera head 。然后通过一个可微体素化将原本稠密的高斯球聚合到一起，训练的时候则监督：

每一帧位置的 rgb loss
depth 的深度与 gaussian depth 的差异损失
相机参数与 vggt 预测出的损失
模型预测深度与 vggt 之间的深度差异

首先， 2 的 loss 保证了其几何一致性，也就是让不同视角的深度尽量保持一致，可以避免分层现象。此外，文章作者说他们实现了一个 Differentiable Voxelization ，可以有效解决生成的稠密高斯球产生的复杂度问题。

总体来说，这是一个高度模仿 vggt 的工作，只不过换了一下 head 和输出形式，其余部分都差不多。此外为 offline 的重建，看上去速度似乎还可以，但是同样面临长序列问题。另外，固定世界坐标系为第一张图片，去监督每一个绝对位姿是否正确，似乎也是存在$\pi ^ 3$所述的归纳偏置问题的。

RayZer

rayzer

令人耳目一新的自监督模型，训练过程只需要图片而不需要 gt 的 pose 和内参，训练过程大概是这样的：

首先输入$K$张图片，将其分为$L_a$ 和$L_b$两个集合。
然后模型通过 Camera Estimator 模块，预测出 pose 和 intrinsics 。
之后对于$L_a$ ，模型根据其对应的预测出来的$R_a$ 和本身的图片输入，生成场景的 token$z$.
然后对于$L_b$，模型选择通过$z$和$R_b$ $L_b$ 预测出$\hat{L}_b$ 然后监督$\hat{L}_b$ 与$L_b$ 之间的损失，然后更新所有的值。

因此，推理时的大致步骤大概就是先把场景的已知几张图片输入得到$z$ ，之后针对一个特定的 pose ，计算一个光线图，之后输入到 rendering decoder 里得到在这个特定的 pose 下的 rgb 图片。

感觉和 nerf 好像，都是一个隐式的表达整个场景，不过不同的是 RayZer 是一个更直接的模型，图里的三个模块每个都是 8 层 naive transformer ， loss 仅由最后的 rgbloss 和 LPIPS loss 决定，感觉挺聪明的。不过感觉 rendering 部分采用的表现形式——类 raymap 形式似乎真的挺好用的。

另外，值得注意的是第一部分，在预测 pose 和 intrinsics 时，直接选取了中间帧作为参考帧，使得模型能跨越更长的距离。此外，如果说我们在第一部分就引入$z$ ，能否实现定位功能？不过作者似乎做了消融实验，发现在训练的时候，从图像特征中提取几何关系比从一个未成形的$z$ 中提取容易得多。但是我觉着可以在 rendering 部分再添加一个 decoder 用于定位。

另外，这个模型完全打败了 LVSM （一个有监督的模型），感觉是一个非常惊艳的工作，看项目主页的 demo 视频感觉真的很不错啊。

Spa3R

spa3r

首先是一个自监督的模型，模型的 backbone 设计的有点复杂：

我们给出一个场景的 views ，然后将 views 分为 context view 和 target view ，首先将所有 views 通过一个改造过的 vggt （似乎是只引入了 head 之前的部分），改造内容是在 context Views 的 AA 层那里把 Target Views 给 mask 掉，然后得到 Context Views 的 feature $F_c$ 和 Target Views 的 camera token 和 Feature $F_t$ ，之后，数据流向两条路径：

Context views ：$F_c$ 与一组可学习的$q$ 通往 Encoder ，然后得到$z$ 作为空间的隐式表征。
Target views ： camera tokens 通过 camera head 生成 camera embeds $r$ ，然后与$z$ 一起输入到 Decoder 里生成对 Target views 的预测过的 feature $\hat{F}_t$ ，然后将得到的预测 feature 与$F_t$ 进行监督得到 loss 。

推理的阶段我们就只看 Context Views 得到的$z$ ，将$z$与 qwen2.5 vl 得到的$F_V$ 输入到一个Adapter里，然后将这个 adapter 和 text prompt 输入到 llm 里得到最终结果。

首先，肉眼可见这项工作把大量的其他工作缝合到了一起， Target View 阶段用了 DINOv3 和 VGGT ，$z$ 的后续处理用到了 qwen2.5 vl ，但是这篇文章叫 Spa3R 啊， Dust3R 被放到哪了呢？然后可训练的内容只有 Encoder 和 Decoder ，仅 6 层 Transformer ，而且通过两个$F_t$作为 loss 进行训练，训练结束之后即丢弃 Decoder ，保留训好的 Encoder 和 q 。然后后续还有一个针对 Adapter 的一个微调，让其学到怎么生成一个合理的融合$F_{input}$ 。

模型做了几个消融实验：

Target Views 阶段作者证明了同时使用 VGGT 和 DINO 会更好（包含语义和空间信息），这是一个比较显然的结论。
提取出一个场景$z$ 表征是一个更好的手段，相对于现有的几个类似于 VG-LLM 简单把所有特征输入到 llm 里效果更好（但是只提升了 3 个点，感觉有点低于预期，考虑到第二阶段训练只进行了 1 个 epoch ，有没有可能是训练量不够？我也是第一次读 VLM 相关的文章（），不过看具体的比分， Multi-Choice 涨分了，而 Numerical 几乎没变，确实是 make sense 的）。
pose embedding 的影响， PRoPE 比 plucker 更好。
Mask Ratio ，这也是一个比较显然的消融实验。
Adapter 使用提高了点数，比较 make sense 。

模型只在 ScanNet 和 ScanNetpp 上进行了 pre train ，使用了 8 张 5090 进行训练，在 VSI-Bench 上达到了 58.6 的水平，超过了之前的大部分 model ，查看现在的 VSI-Bench Leaderboard ，其性能也是处于前列的（不过论文里的表格好像有些数据有点不对？可能有更新吧）。算是为领域开了一个新坑（），自监督看上去也不错（）。

看上去这篇文章正在投 CVPR ，是笔者写阅读笔记的两天前才登上了 arxiv ，也不知道中没中，方法是很有趣

Spann3R

结构很复杂，首先大部分模型权重继承自 Dust3r ，然后模型的 backbone 大致如下：

spann3r

预编码 ：首先将一帧输入到 ViT Encoder 得到一个$f_t^I$ ，此时我们手上还有一个上一帧的$f_{t-1}^Q$ 。
查询记忆： 根据$f_{t-1}^Q$ ，我们可以从历史记忆中查询出一个$f_{t-1}^G$ 来作为下一步的输入。
主要推理部分： 之后我们将这两个 feature 输入到 Target Decoder 和 Reference Decoder ，这两个 Decoder 会做 self attention 和 Cross attention 然后分别得到$f_t^{H’}$ 和$f_{t - 1}^{H}$
Heads ： 对于$f_t^{H’}$ ，在推理阶段我们会使用一个 query head 来提取出$f_t^{Q}$，然而在训练阶段我们也会加入一个 head 将其转化为点云和置信度来监督训练；对于$f_{t-1}^H$ ，我们会通过一个 reference head 将其重建出点云和置信度。
记忆： 之后，根据$f_{t-1}^H$和$f_{t-1}^I$ ，我们将其通过一个 Memory encoder + MLP head 生成一个$f_{t-1}^K$ ，然后根据这个和点云通过一个 Memory Encoder 生成 $f_{t-1}^V$ ，之后$f_{t-1}^K$会对已有记忆去重，如果工作记忆已满剩下的就会进长期记忆然后做进一步处理。

这是一篇 24 年的文章了，主要创新点就在于他改良了 Dust3R ，使得可以对多个图片输出一个一致的全局坐标系下的点云，此外使用记忆方法，分层处理记忆。

但很显然的是，虽然该方法加入了记忆，但是记忆看上去也是近期记忆的方案，客观上因此而存在长距离漂移的现象，此外，如果遇到 reloop 现象，记忆是否能健康提取也会是一个比较大的问题。

做的消融实验大致有这几个：

关于记忆方面的消融实验，去掉长期记忆会引起很大的漂移现象，而注意力不截断的话也会引发噪声的干扰
关于长期记忆应该取多大：作者发现 1000-2000token 的过程中漂移得到极大修正，但是 4000+之后就不会有明显的提升，因此最后作者选择了 4000.
Dust3R 采用了 exp confidence function ，本文将其改为了 sigmoid ，事实证明是有所改善的。

Flow4R

一个局限性很大的三维重建追踪方案，不过在表现形式上很有新意。

模型的 backbone 很优雅，首先接收两张图片作为输入，通过共享权重和 cross attention 的两个对称 encoder-ecoder-head 结构得到每张图的${ P,F,W,C}$ 其中， $P$是相机坐标系下的点云，$F$ 是一个场景流，描述每一个像素如何从本张图片移动到下一张点云，之后还有一个$W$ 指示哪个像素在求解 pose 的时候最可靠，最后的$C$ 是全局的置信度。

flow4r

得到这些元素之后，可以首先将 pose 通过最小二乘法求出：
$\hat{T} = \arg \min_{T \in SE(3)} \sum_{i=1}^{HW} W^i ||P_{vt}^i - T P^i||_2$

$P_{vt}$ 是由$P + F$ 得到的，得到 pose 之后就可以做位姿流和场景流的分解，然后很多下游任务就可以进行处理了。

针对于长序列数据，作者提出了将第 1 张 frame 作为锚点，后续的每一张都与之输入处理，好处是可以通过 L2 norm 来归一化尺度，但是坏处也非常明显，一是稍微长一点的序列，就会出现遮挡现象，模型目前来看没有一种很好的应对方式；二是极其依赖第一张 frame 的质量，鲁棒性不算太好。观察其论文里呈现的 demo ，看起来也通常是对一个角落 or 一个相似视角区域做的重建，完整场景重建效果存疑。

此外，作者竟然只做了一个消融实验（能中吗？）对比了三种不同的网络预测和监督变体：

预测场景流 $F$ ，并用真实的 $F$ 进行损失监督。
预测场景流 $F$ ，但用目标帧的真实 3D 点位置 $\overline{P}_{vt}$ 进行监督。
直接预测目标帧的 3D 点位置 $P_{vt}$ ，并用真实的 $P_{vt}$ 监督（场景流则通过简单的减法推导：$F = P_{vt} - P$ ）。

消融结果：实验证明，直接预测并监督 $P_{vt}$ 的性能最佳 。因此后来直接预测的实际上是$\mathcal{P} = {P, P_{vt}, W, C}$

总体来说，这篇工作证明了一点，可以通过引入流的方式来完成 Dust3R 这种结构从静态到动态的拓展，但确实局限性很大。

这项工作似乎还没有开源（）

AMB3R

把三维体素引入到了重建中，使得模型能够真正地从空间角度来考虑重建任务。简而言之就是之前的重建采用的 ViT 将图像分为一个一个 patch 造成隐式几何中缺乏空间紧凑性约束，于是论文作者想了一个办法把空间紧凑性加入到了 backbone 当中。

amb3r

大致的 backbone 分为前端和后端，其中，前端继承了 VGGT 的网络和参数，一张图片进入之后会经过 Encoder 得到一个初步的 feature ，然后数据的主题是向 decoder 移动，但是这部分 feature 也会使用一个 scale head 预测一个绝对尺度。

然后，进入 decoder 的 feature 会对 keyframes 做 cross attention ，这里的 keyframe 就可以理解为场景的隐式表达，经过该过程之后， decoder 就会输出一个 pointmap 和一个 confidence ，在推理阶段，之后会有一个门控机制：如果置信度足够高，那就直接进入下一阶段，反之则会将点云和 feature 变为体素，然后通过一个 point transformer 优化该体素的 feature ，之后再会逆变换变为 2Dfeature ，之后我们会将该 feature 注入到前端的 decoder 中，重新拿到一个高级的点云。

然后我们拿到了当前帧的点云以及物理尺度，然后系统会将该结构放大/缩小，然后根据 keyframes 和 VGGT 预测出的 pose 将该结构拼接到大的全局点云中，最后我们会评测该点云是否可以成为 keyframes ，然后将其处理掉。

将体素引入到点云重建里很厉害，作者做的几个消融实验：

移除了基于 sparse voxel 的后端，转而使用一个 2D 做 alternate attention 的后端，发现精度不如之前。
去除了零卷积机制，发现模型短时间内根本就未收敛。
在算 loss 的时候去除了 scale 发现效果变差，也就是说模型需要去专注思考几何结构。这是在训练阶段做的事情

这篇文章的训练成本非常非常的低，依赖于一个已经训好的 VGGT ，只训练了微调点云特征的一个 point Transformer 和一堆 head ，感觉非常有启发性非常厉害，同时也中了 CVPR2026 ，符合预期（似乎是 Spann3R 的续作）

VGGT-SLAM

我说这是一篇数学论文，文中没有训练任何模型，仅仅是介绍了一种局部点云拼合办法。

vggt-slam

顾名思义，这篇工作基于 VGGT 输出的点云和 pose ，作者认为 VGGT 预测出 pose 和局部点云之后直接进行 Sim(3)变化为全局点云是有问题的，主要灵感来自于传统 CV 里面的双目立体视觉：相机之间的单应性矩阵或者说是本征矩阵并非仅仅包含了 pose 中进行的旋转、平移，更有一些拉伸，透视等等等。具体来讲就是 VGGT 预测出的点云深度包含了相机的射影形变，直接使用 Sim(3)方法来还原是不准确的。

因此，作者转而使用了 SL(4)进行点云的对齐，具体来讲，当 VGGT 得出了点云和 pose 之后，会进行以下几个操作：

对于一个子地图里的帧，作者选择相信 VGGT 的质量，作者在代码里设置了一个 submap_size 参数用于控制子地图的大小。
对于不同子地图之间，因为我们想得到一个在不同坐标系下共享的三维点，所以作者这里采用了一个很聪明的办法，将上一个子地图的最后一帧重复输入到下一个子地图里，这样 VGGT 的输出就包含了相同图片在不同坐标系下的点云，由此可以建立点与点之间的对应关系。
之后根据传统的一些算法，可以计算两个子地图之间的 SL(4)矩阵，到这里第一步就算完成了
下一个步骤就是全局对齐，作者也写得太数学了吧：
具体来讲，作者构建了一个基于最大后验估计的非线性因子图，目标是最小化所有子地图之间的相对单应性误差：
$\hat{H}=argmin_{H\in SL(4)}\sum_{(i,j)\in\mathcal{L}}||Log(H_{i}^{-1}H_{j}(H_{j}^{i})^{-1})||{\Omega{ij}^{H}}^{2}$
然后引入各种优化器，这里我的数学太烂了（ x ）根本看不懂，只知道他是需要迭代优化的。
嗯嗯，所以这样我们就可以得到一个后端，对于每一个子地图，都给出了一个将其变换到潜在全局坐标系下的 SL(4)矩阵，从而消除了 Sim(3)变换带来的问题。

此外，文章还提出了一种 reloop 机制，就是说在一个子地图待输入的时候，系统会利用 SALAD 描述子去寻找历史子地图中是否有相似的图片，若有，系统就会选择将那张图片作为共享帧，我们这时候就会有多个相对的信息。

总体来说，这篇工作就是提供了一个偏传统的对齐方法，比较优雅，但是很显然缺点也很明显，首先对于单个子地图，该工作完全信任 VGGT 的输出结果，缺乏鲁棒性；其次，其得出对齐是通过迭代优化得出，相对于直接拼接会慢上很多，另外有太多的查询操作（如 reloop ），感觉复杂度还是有点高的。

不过可以从上图看到，他确实改善了点云拼接时可能产生的分层的质量。但是，查看其 github 里的 issue ，似乎稳定性存疑：

Due to potential randomness in our approach caused by RANSAC, we report the average performance over five runs, which have a low spread (small standard deviation) as shown in Sec. 5.5.

而且那个 issue 到最后作者都没有回答，感觉有点尴尬（ x

]]> blog 3Dreconstruction paper reading 学习笔记：Tensor Parallelism（TP） //blog/TensorParallel/ 引言

经历了一些对未来选择的思考之后，最近在了解 mlsys 相关的内容，本文即为对 TP 的理解和总结，目前网上已经有大量的博文详细介绍了 TP 的实现细节，本文主要是为了自己未来查阅方便而写的文章，欢迎大家指正。

TP简介

Tensor Parallelism 是在 DP, MP 之后提出的一个方法，由 Magatrion-LM 首创。其出发点在于 DP, MP 仍然需要单卡在计算时凑齐一个完整的 layer 的参数和各种激活值、梯度、优化器状态，当一个 layer 过大的时候，单卡就放不下了。
而 Tensor Parallelism 将模型的计算拆成分布式的了，使得一层能够分布于不同卡上进行计算。

Transformer-like model

一个经典的 Transformer 模型的架构大致如下图：

transformerarch

可以看到，一个 layer 主要由 Attention 和 MLP 层组成， TP 的关键优化点也就是在这两层上，下面将具体说明。

MLP

我们先从 MLP 层开始，简而言之，一个 MLP 层的数学描述大致这样：

$$
\mathrm{Out} = \mathrm{Dropout}(\mathrm{GeLU}(X W_1) W_2)
$$

其中：

$$
\begin{aligned}
X &: (B, S, d_{\text{model}}) \
W_1 &: (d_{\text{model}}, d_{\text{ff}}) \
W_2 &: (d_{\text{ff}}, d_{\text{model}})
\end{aligned}
$$

一般来说，$d_{\text{ff}} = 4 \times d_{\text{model}}$

我们先考虑不进行 TP ，仅仅进行单卡计算：

单卡forward

参数量：

$$
\begin{aligned}
W_1 &: d_{\text{model}} \times d_{\text{ff}} \
W_2 &: d_{\text{ff}} \times d_{\text{model}} \
\text{总参数} &: 8 d_{\text{model}}^2
\end{aligned}
$$

计算量：

$$
\begin{aligned}
&\text{两个矩阵乘均贡献 } 2 \times B \times S \times d_{\text{model}} \times d_{\text{ff}} \
&\mathrm{FLOPS} = 16 \times B \times S \times d_{\text{model}}^2
\end{aligned}
$$

激活量：在 backward 里考虑。

单卡backward

首先对 dropout 反向：

$$
\frac{\partial L}{\partial Z} = \frac{\partial L}{\partial \mathrm{Out}} \odot \frac{\mathrm{mask}}{1 - p}
$$

这一步的 FLOPS 差一个数量级，可忽略不计，另外使用 $LX$ 表示 $\partial L / \partial X$。

然后对 $W_2$ 进行反向：

$$
\begin{aligned}
LW_2 &= A^T \cdot LZ \
LA &= LZ \cdot W_2
\end{aligned}
$$

其中 $A = \mathrm{GeLU}(\cdots)$。

这一步的 FLOPS 为 $2 \times (2 \times B S , d_{\text{model}} , d_{\text{ff}}) = 16 , B S , d_{\text{model}}^2$

GeLU 的 FLOPS 几乎也可以忽略不计。

然后对 $W_1$ 进行反向，几乎与 $W_2$ 相同。

因此整个过程的 FLOPS 为 $32 , B S , d_{\text{model}}^2$，为前向传播的两倍。

然后我们从激活值占用角度分析，在没有梯度检查点的情况下，我们有：

X (BS, d_model) use for compute L_W1
H = XW_1 (BS, d_ff) use for compute the gelu 
A = GeLU(H) (BS, d_ff) use for compute L_W2 
Dropout mask(BS, d_model) use for Dropout

TP forward

我们进行这样的切分方式：

1	W_1 -> (W_11, W_12, W_13, ... W_1n) # W_1i: (d_model, d_ff / n)

这样我们在输入 X 的时候全部注入，然后得到：

1	H -> (XW_11, XW_12, XW_13, ... XW_1n) # XW_1i: (B, S, d_ff / n)

值得注意的是，我们选择按列切分 $W_1$ 使得我们得到的结果是可以独立通过 gelu 的，省去了这一步通信的麻烦。

之后考虑 $W_2$

我们选择将 $W_2$ 进行这样的切分：

W_2 -> [
 W_21, 
 W_22, 
 W_23, 
...
 W_2n 
]

之后，显然我们现在可以每张卡计算XW_11 @ W_21，而且他的形状就是最后矩阵的形状，
因此，我们算出来然后最后采用 all reduce 就可以得到最后结果啦。

ok ，我们现在对这整个过程进行分析：

参数量：
显然，我们现在把所有参数分散到了多卡上，而且分散均匀，

$$
\begin{aligned}
W_1 &: d_{\text{model}} \times d_{\text{ff}} \
W_2 &: d_{\text{ff}} \times d_{\text{model}} \
\text{总参数} &: 8 d_{\text{model}}^2 / n
\end{aligned}
$$

计算量：

$$
\begin{aligned}
&\text{两个矩阵乘均贡献 } 2 \times B \times S \times d_{\text{model}} \times d_{\text{ff}} / n \
&\mathrm{FLOPS} = 16 \times B \times S \times d_{\text{model}}^2 / n
\end{aligned}
$$

但是这里还要考虑一个问题，就是最后 reduce-all 操作还要对所有激活值进行累加，但是这部分数量级过小，可忽略。

激活量：在 backward 里考虑。

TP backward

在每张卡上的前向是：

$$
Z^i = \mathrm{GeLU}(X W_1^i) W_2^i, \quad \text{AllReduce} \to Z = \sum_i Z^i
$$

由于 AllReduce 之后每张卡上的 Z 完全相同，所以上游传回的梯度也完全一样，不需要额外通信。

此后，每一步的计算基本上与单张卡相同，但是要除以 $N$。

因此，每张卡的反向 FLOPS 为：

$$
\mathrm{FLOPS} = 32 \times B S , d_{\text{model}}^2 / N
$$

然后，之后需要注意的是我们在反向传播的最后仍然需要一步 all-reduce ，因为我们此前计算的都是独立的梯度。

激活值的占用：我们有：

X (BS, d_model) use for compute L_W1
H = XW_1 (BS, d_ff / N) use for compute the gelu 
A = GeLU(H) (BS, d_ff / N) use for compute L_W2 
Dropout mask(BS, d_model) use for Dropout

Attention

单卡forward

输入数据：

X (B, S, d_model) 
X_h (B, S, h, d_head) W_Q (h, d_head, d_Q) W_K (h, d_head, d_K) W_V(h, d_head, d_V)
# 注意到在单卡情况下我们这一步计算 Q, K, V 通常不做维度划分，但可以这么理解，方便后续对 TP 的理解
Q (B, S, h, d_Q) K (B, S, h, d_K) V (B, S, h, d_V)
Q @ K.transpose -> S (B, h, S, S)
S @ V -> (B, h, S, d_V)
reshape -> (B, S, h *d_V)

# 接着引入一个 W_O: (h * d_V, d_model)
O (B, S, d_model)

ok ，对整个过程清晰之后我们便可以分析其各个指标：

参数量：

$$
\begin{aligned}
W_Q &: h \times d_{\text{head}} \times d_Q \
W_K &: h \times d_{\text{head}} \times d_K \
W_V &: h \times d_{\text{head}} \times d_V
\end{aligned}
$$

通常来说这几个都相等，所以总参数为 $4 d_{\text{model}}^2$。

计算量：

操作	形状	FLOPS
$X \to Q, K, V$	$(BS, d_{\text{model}}) \times (d_{\text{model}}, d_{\text{model}})$	$6 , BS , d_{\text{model}}^2$
$Q K^T \to S$	$(B, h, S, d_{\text{head}}) \times (B, h, d_{\text{head}}, S)$	$2 , BS^2 d_{\text{model}}$
$SV \to AV$	$(B, h, S, S) \times (B, h, S, d_{\text{head}})$	$2 , BS^2 d_{\text{model}}$
$AV \cdot W_O$	$(B, S, d_{\text{model}}) \times (d_{\text{model}}, d_{\text{model}})$	$2 , BS , d_{\text{model}}^2$

因此，总的 FLOPS 为：

$$
8 , BS , d_{\text{model}}^2 + 4 , BS^2 d_{\text{model}}
$$

需保存的激活值：

Tensor	shape	num
$X$	$(B, S, d_{\text{model}})$	$BS , d_{\text{model}}$
$Q, K, V$	$(B, S, d_{\text{model}})$	$3 , BS , d_{\text{model}}$
$S$	$(B, h, S, S)$	$BhS^2$
$AV$	$(B, h, S, d_{\text{head}})$	$BS , d_{\text{model}}$

单卡backward

首先我们做 $W_O$ 的反向：

$$
\begin{aligned}
LW_O &= (AV)^T \cdot LO \quad (d_{\text{model}}, S, B) \times (B, S, d_{\text{model}}) \
LAV &= LO \cdot W_O^T \quad (B, S, d_{\text{model}}) \times (d_{\text{model}}, d_{\text{model}})
\end{aligned}
$$

加起来是 $4 , BS , d_{\text{model}}^2$

然后我们回到 $AV$：

$$
\begin{aligned}
LS &= LAV \cdot V^T \quad (B, h, S, d_{\text{head}}) \times (B, h, d_{\text{head}}, S) \
LV &= S^T \cdot LAV \quad (B, h, S, S) \times (B, h, S, d_{\text{head}})
\end{aligned}
$$

这一步的 FLOPS 为 $4 , BS^2 d_{\text{model}}$。

然后经过 Softmax 反向， FLOPS 可以忽略，然后计算 Q, K ， FLOPS 为 $4 , BS^2 d_{\text{model}}$。

之后对 $W$ 做反向，权重梯度和输入梯度均为 $2 , BS , d_{\text{model}}^2$，共计为 12 。

因此，总反向 FLOPS 为：

$$
16 , BS , d_{\text{model}}^2 + 8 , BS^2 d_{\text{model}}
$$

为前向的两倍，所以我们在这里也可以认为反向传播的 FLOPS 为前向的两倍。

TP

显然这时我们就可以完全将 head 分到多张卡上，所有的几乎均乘上一个 $\frac{1}{N}$ 即可。

但此时仍然需要注意的是，我们得到 O 之后仍然需要 all-reduce ，这与 mlp 是一样的。

先写到这里.

]]> blog Parallelism learning notes HPCGames 题解 D E 题 //blog/HPCGamesDE/ 在上一篇文章中，我们介绍了 HPCGames 题解 A 、 B 、 C 题的解决方案。本文将继续探讨 D 、 E 题的解决方案，深入分析每道题目的挑战和我们的应对策略。

D. Hyperlane Hopper

]]> blog HPC games parallel computing High Performance Computing AI Infrastructure HPCGames 题解 A B C 题 //blog/HPCGamesABC/ A 题

B 题

这里又来到了经典的小北问答环节，结合一些理论知识和具体论文的查阅，我们可以对题目进行详细的分析和解答。

1. Amdahl & Gustafson

某程序的代码中 10% 必须串行执行， 90% 可完美并行。

根据 Amdahl’s Law ，无论核心数如何增加，该程序的理论最大加速比极限是 ____ 倍；
若在 10 核系统中通过扩大问题规模来保持每核计算负载不变，根据 Gustafson’s Law ，该系统的加速比将达到 ____ 倍。
首先，根据 Amdahl 定律，加速比 S 可以通过以下公式计算：
$$
S = \frac{1}{(1 - P) + \frac{P}{N}}
$$
其中 P 是可并行部分的比例， N 是处理器的数量。对于该问题， P = 0.9 （ 90% 可并行），串行部分为 0.1 （ 10% 必须串行）。当 N 趋近于无穷大时，公式简化为：
$$
S_{max} = \frac{1}{(1 - P)} = \frac{1}{0.1} = 10
$$
因此，该程序的理论最大加速比极限是 10 倍。
接下来，根据 Gustafson 定律，加速比 S 可以通过以下公式计算：
$$
S = N - (1 - P) \times (N - 1)
$$
在 10 核系统中， N = 10 ， P = 0.9 ，因此：
$$
S = 10 - (1 - 0.9) \times (10 - 1) = 10 - 0.1 \times 9 = 10 - 0.9 = 9.1
$$
因此，在 10 核系统中通过扩大问题规模来保持每核计算负载不变，该系统的加速比将达到 9.1 倍。

2. OpenMP

以下代码使用 OpenMP 并行执行循环：

int sum = 0; 
#pragma omp parallel for
for (int i = 0; i < 100; i++) { 
 sum += i; 
} 
printf(" sum = %d\n" , sum);

关于该代码，请问以下说法中正确的是 ____ 。

选项	描述
A	代码一定能正确计算出 0 到 99 的和（ 4950 ）
B	代码存在数据竞争，结果不确定。
C	`sum`变量默认为`private`，每个线程有自己的副本。
D	OpenMP 会自动为`sum`变量添加原子操作，保证结果正确。

正确答案是 B 。

解释如下：

选项 A 是错误的，因为代码中存在数据竞争，多个线程同时修改共享变量 sum，导致结果不确定。
选项 B 是正确的，因为在并行执行时，多个线程可能同时读取和写入 sum，导致数据竞争，从而使得最终结果不确定。
选项 C 是错误的，因为 sum 变量在默认情况下是共享的（ shared ），而不是私有的（ private ）。因此，所有线程都访问同一个 sum 变量。
选项 D 是错误的，因为 OpenMP 不会自动为共享变量添加原子操作。要确保结果正确，需要显式地使用 #pragma omp atomic 或其他同步机制来保护对 sum 的访问。

3. 低精度

已知 IEEE 754 标准的 FP32 拥有 8 位指数位。请问：

BF16 拥有 ____ 位指数位，____ 位尾数位
NVFP4 拥有 ____ 位指数位，____ 位尾数位

提示：可以查阅资料，了解 NVFP4 如何在低精度下保持较高的数值范围和动态范围。

经查阅资料可知： - BF16 拥有 8 位指数位， 7 位尾数位 - NVFP4 拥有 2 位指数位， 1 位尾数位

NVFP4 比较特殊，他只有 4 个 bit ，显然如果直接使用的话，其范围会很小，精度也不理想，但经过查阅资料可知：NVFP4 首先将一组数视为了一个块，一个块中会共享一个高精度的 scale factor ，确定大致的数量级，然后 NVFP4 只存储每个数相对于这个 scale factor 的偏移量，这样就能在保持较大数值范围的同时，使用更少的位数来表示每个数，从而提高了存储效率和计算速度。

4. MPI 通信

4.1 基本原语

4 个进程执行以下代码，每个进程有局部值 local_val ，操作后每个进程都有所有进程的值。

int rank; 
MPI_Comm_rank(MPI_COMM_WORLD, & rank); 
int local_val = rank; // rank 为进程编号， 0~3
int recv_buf[4]; 
/* 填这一行代码 */

4.2 通信器

创建一个 2 维笛卡尔拓扑，尺寸为 2×2 ，行优先排列，允许环绕连接。

MPI_Comm comm_cart; 
int dims[2] = { 2, 2} ;
int periods[2] = { 1, 1} ; // 环绕连接
/* 填这一行代码 */

4.1 基本原语

可以使用 MPI_Allgather 来实现该功能。代码如下：

1	MPI_Allgather(& local_val, 1, MPI_INT, recv_buf, 1, MPI_INT, MPI_COMM_WORLD);

这行代码会将每个进程的 local_val 收集到所有进程的 recv_buf 中。

4.2 通信器

可以使用 MPI_Cart_create 来创建一个二维笛卡尔拓扑。代码如下：

1	MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 1, & comm_cart);

这行代码会创建一个 2x2 的笛卡尔拓扑通信器 comm_cart，并允许环绕连接。

5.NCCL 延迟

在深度学习的并行推理与训练中，进程之间会频繁进行集合通信操作。 NVIDIA 的开源集合通信库 NCCL ，提供了在 GPU 之间进行集合通信的高性能解决方案。

当在异构的硬件上进行大规模集合通信时，如何选择通信的算法将很影响集合通信操作的效率。为了解决这个问题， NCCL 的解决方案是：基于一套硬编码的调优常数，估算不同集合通信算法下的集合通信完成时间，由此选择最优的算法。

问题：在 NCCL 2.28 的默认调优常量中，用 NVLink 连接的两 GPU 、采用 Tree 算法和 LL 协议时，在估算时每跳（单步）的硬件延迟取值为 ______ µs 。

根据 NCCL 2.28 的默认调优常量，当使用 NVLink 连接的两 GPU ，采用 Tree 算法和 LL 协议时，每跳（单步）的硬件延迟取值为 0.6 µs 。

具体参考资料可见 NCCL Github src/graph/tunning.cc中的 151 行。

6. 高性能网络

Rail-optimized networking 与 Clos 都是高性能网络设计方案。以下说法正确的有：

选项	描述
A	在 Rail-optimized 网络中，来自不同 HB 域（ High-Bandwidth Domain ）但具有相同 local rank 的 GPU 会被连接到同一个 rail switch 上，以减少跨域通信的延迟
B	常见部署模式下， Rail-optimized 网络相比传统 Clos 网络的主要优势是完全不需要 Spine 层交换机，因此可以大大节省网络设备成本
C	Clos 网络因其使用 Spanning Tree Protocol (STP) 而在大规模部署时存在扩展性问题，这是 Rail-optimized 网络要解决的核心问题之一
D	Rail-optimized 网络保证了任何情况下集群内任意两个 GPU 之间都能以网络线速（如 400 Gbps InfiniBand ）进行通信，无论它们是否在同一个 rail 中
E	NCCL 2.12 引入的 PXN 特性可以结合 NVLink 和 PCI 通信来优化网络流量，这个优化对于 Rail-optimized 网络尤为重要
F	对于 LLM 训练工作负载，最优的通信策略会将大部分网络流量集中在相同 local rank 的 NIC 之间，并且会多用 NVLink 等高速互联进行跨 rail 交换，这使得 Rail-optimized 架构特别适合此类场景

正确答案是 A 、 E 、 F 。

解释如下：

选项 A 是正确的，这是 Rail-optimized 网络的核心定义。在这种架构中，网络拓扑是根据 GPU 的 rank 进行物理隔离的。例如，所有服务器上的 0 号 GPU 都连接到同一组交换机（ Rail 0 ）， 1 号 GPU 连接到另一组（ Rail 1 ）。这使得在进行数据并行（ Data Parallelism ）训练时， AllReduce 等操作只需在同一个 Rail 内进行，无需跨越复杂的交换层级，大大降低了拥塞和延迟。
选项 B 是错误的， Rail-optimized 仍然基于 Clos 架构，通常需要 Spine 交换机。 Rail-optimized 描述的是 Leaf 层交换机与 GPU 的连接方式以及流量的导向方式，而不是一种去除了 Spine 的新型拓扑。对于大规模集群（超过一个 Leaf 交换机的容量）， Rail 0 的 Leaf 交换机之间仍然需要通过 Spine 交换机互联，以构成一个完整的 Rail 0 网络平面。
选项 C 是错误的， Clos 网络并不使用 STP ，且 STP 是传统以太网的痛点。传统二层以太网使用 STP (Spanning Tree Protocol) 防止环路，这会导致大量链路被阻塞，带宽利用率低。而现代 Data Center Clos 网络（无论是基于 IP 路由的 ECMP 还是 InfiniBand ）的设计初衷就是利用所有链路进行负载均衡，完全摒弃了 STP 。因此， C 选项描述的前提本身就是错误的。
选项 D 是错误的， Rail-optimized 并不保证“跨 Rail”通信的效率等同于“同 Rail”。 Rail-optimized 的设计哲学是“专路专用”。虽然物理上可以通过 Spine 进行跨 Rail 通信（例如 Node A 的 GPU 0 发给 Node B 的 GPU 1 ），但这通常不是最优路径，且可能面临 oversubscription （收敛比）的问题。实际上，这种架构倾向于利用 E 选项和 F 选项提到的技术来避免在网络层面上进行跨 Rail 数据传输。
选项 E 是正确的， PXN (PCIe/NVLink Cross-NIC) 是解决 Rail 架构灵活性的关键。在 Rail-optimized 网络中，如果 GPU 0 需要向网络中的 Rail 1 发送数据，传统的路径非常低效（走 PCIe -> CPU -> NIC -> Switch ->…）。 NCCL 的 PXN 特性允许 GPU 0 通过 NVLink 直接把数据传给同机的 GPU 1 ，然后由 GPU 1 的 NIC （连接着 Rail 1 ）发送出去。这相当于在节点内部利用 NVLink 完成了“变轨”，从而充分利用 Rail 网络的优势。

选项 F 是正确的，这准确描述了 LLM 训练中的混合通信模式。在 LLM 训练中，通常结合了数据并行（ DP ）和模型并行（ TP/PP ）。

+ TP (Tensor Parallelism) 流量极大，通常限制在单机内部，完全走 NVLink 。+ DP (Data Parallelism) 需要跨机同步梯度，流量发生在相同 rank 的 GPU 之间，这完美契合 Rail-optimized 的网络路径。+ 如果需要跨 rank 的操作（如 Pipeline Parallelism 的某些阶段或特定的 All-to-All ），结合 NVLink （节点内）+ Rail （节点间）是目前最优的策略。

7. GPU

NVIDIA 的 Hopper 架构引入了 TMA （ Tensor Memory Accelerator ）以提升 GPU 内存访问效率。以下说法正确的有：

选项	描述
A	相比 cp.async ， TMA 可以直接将数据从全局内存加载到共享内存，无需经过寄存器中转，从而能节省寄存器
B	在 cutlass 的异步流水线抽象中， Producer 调用 producer_acquire 获取空闲的 buffer stage ，完成数据加载后调用 producer_commit 通知 Consumer ； Consumer 则通过 consumer_wait 等待数据就绪，使用完毕后调用 consumer_release 释放 buffer
C	在使用 TMA 进行数据传输时，所有参与的线程都需要执行相同的 TMA 指令， TMA 硬件会自动处理线程间的协调
D	Cutlass Pipeline 使用多级缓冲（ multi-stage buffering ），通过 PipelineState 追踪当前读写的 stage index 和 phase ，实现 Producer 和 Consumer 之间的流水线重叠
E	TMA 的 multicast 功能允许一次 TMA 操作将同一块数据广播到 Cluster 内的多个 Thread Block 的共享内存中，减少了重复的全局内存访问
F	TMA 描述符（ TMA Descriptor ）需要在 kernel 启动前在 host 端创建，描述符中包含了张量的形状、步长和 swizzle 模式等信息， kernel 执行时通过预取描述符（ prefetch_tma_descriptor ）来减少首次 TMA 操作的延迟

正确答案是 B D E F.

解释：

- 选项 A 是错误的。 Ampere 架构引入的 cp.async 指令同样也是绕过寄存器（ Register File ），直接将数据从全局内存（ GMEM ）搬运到共享内存（ SMEM ）。- 选项 B 是正确的。这描述了 Cutlass 异步流水线中 Producer 和 Consumer 之间的交互方式，符合 Cutlass 的设计理念。- 选项 C 是错误的。 TMA 操作允许线程组内的线程根据需要选择性地执行 TMA 指令，而不是所有线程都必须执行相同的指令。如果所有线程都执行，会导致重复发射多个拷贝操作（除非有特殊的掩码处理）。这一点与 Ampere 的 cp.async （通常每个线程负责一部分）不同。- 选项 D 是正确的。 Cutlass Pipeline 确实使用多级缓冲，通过 PipelineState 来追踪当前读写的 stage index 和 phase ，从而实现 Producer 和 Consumer 之间的流水线重叠。- 选项 E 是正确的。 TMA 的 multicast 功能允许一次 TMA 操作将同一块数据广播到 Cluster 内的多个 Thread Block 的共享内存中，减少了重复的全局内存访问。- 选项 F 是正确的。 TMA 描述符需要在 kernel 启动前在 host 端创建，包含张量的形状、步长和 swizzle 模式等信息， kernel 执行时通过预取描述符来减少首次 TMA 操作的延迟。

8. LLM

对于参数如下的一个标准的 Transformer-Decoder 模型，所有的 all reduce 操作都使用 ring all reduce 。假设一共有 4 张卡。

模型参数

参数	值
层数	32 层
隐藏层维度 (h)	4096
FFN 结构	两层线性层，中间层维度为 4h
序列长度	2048
Batch Size	32
优化器	Adam + 混合精度训练
精度设置	参数和梯度使用 fp16 ， Adam 优化器状态使用 fp32 （包括 momentum 、 variance 和 master weights ）

问题

请计算在以下三种并行方式下，进行一个 batch 的前向传播和反向传播，每张卡需要的发送量（以 GB 为单位）：

数据并行：每张卡上存放完整的模型，把 batch 均匀拆分到每张卡上，分别计算完成后对梯度进行 All-Reduce 操作
流水并行：按层拆分模型放到不同卡上，只需要前向传播的时候发送 activation ，反向传播的时候发送 gradient 。（计算通信量时只考虑中间的卡）
张量并行：对于 MHA 操作，按照 head 拆分到不同卡上。对于 FFN ，第一个线性层按照输出维度进行拆分，第二个线性层按照输入维度进行拆分
计算过程如下：
- 数据并行：
  - 模型参数量：
    $$
    P = 32 \times 12 \times (4096 ^ 2)
    $$
  - 梯度大小：
    $$
    G = P \times 2 \text{ bytes (fp16)} = 12 \text{ GB}
    $$
  - 通信量：
    $$
    \text{通信量} = 2 \times \frac{N-1}{N} \times G = 18 \text{ GB}
    $$
- 流水并行：
  - 每层激活大小：
    $$
    A = 32 \times 2048 \times 4096 \times 2 \text{ bytes (fp16)} = 0.5 \text{ GB}
    $$
  - 通信量：
    $$
    \text{通信量} = 2 \times A = 1 \text{ GB}
    $$
- 张量并行：
  - 单次 All-Reduce 大小：
    $$
    AR = 32 \times 2048 \times 4096 \times 2 \text{ bytes (fp16)} = 0.5 \text{ GB}
    $$
  - 单次 Ring All-Reduce 通信量：
    $$
    \text{单次通信量} = 2 \times \frac{N-1}{N} \times AR = 0.75 \text{ GB}
    $$
  - 总通信量：
    $$
    \text{通信量} = 32 \times 4 \times 0.75 \text{ GB} = 96 \text{ GB}
    $$

9. UB 互联

在高性能计算系统中，集合通信（ Collective Communication ）的性能主要受带宽（ Bandwidth ）与延迟（ Latency ）两个因素制约。

NVIDIA 通过 NVLink 与 NVSwitch 构建 GPU 间的高速 Scale-up 互联网络，而华为则提出了 Unified Bus （ UB ）协议，作为面向 NPU 的统一互联与内存访问机制。 UB 协议基于华为自研的 UB Switch 交换芯片，并通过高带宽物理链路 HCCS （ High-Capacity Coherent System ）进行连接。

传统 AI 集群通常以 8 卡服务器为基本单元进行 Scale-out 扩展，而华为在 CloudMatrix 384 （ CM384 ）架构中，通过两级 UB Switch 组网，将 384 颗昇腾 910C NPU 构建为一个统一的超节点（ SuperPod ）。在该超节点范围内，所有 NPU 均处于同一个低延迟的轨道优化网络中，实现全对等 Scale-up 互联。

CM384 进一步将 UB 网络划分为 7 个相互独立的物理平面。每颗 NPU 的 7 个 HCCS 接口分别接入不同的交换平面，从而保证大规模并行通信过程中，数据流在物理路径上完全隔离、无链路冲突。

问题

在 CloudMatrix 384 的标准满配部署方案中，为了支撑 384 颗昇腾 910C NPU 实现无收敛、全对等的 Scale-up 互联，系统采用两级交换架构。在该超节点的物理拓扑中，分别使用了：

____ 个 Level 1 UB Switch
____ 个 Level 2 UB Switch
最终实现了理论上 ____ GB/s 的系统级聚合带宽

假设 switch chip 提供的单个 Port 可以提供 28GB/s 的通信带宽

该问题直接查阅华为 CloudMatrix 384 的白皮书即可得到答案。

10. Cache 行为分析

假设我们需要进行一个矩阵乘法 $C=A×B$。

测试环境

为了简化分析，假设：

参数类型	配置
数据类型	double (8 Bytes)
L1 Cache 大小	4KB (4096 Bytes)
相联度	直接映射 (Direct Mapped, E=1)
块大小	64 Bytes （ 1 个 Cache Line 可存 8 个 double ）
矩阵规模	A, B, C 均为 64×64 的方阵 (N=64)
存储方式	数组按行优先存储
内存对齐	A, B, C 的起始地址均对齐到 Cache 的起始 Set

代码实现

// 假设变量 sum 已优化到寄存器中，忽略 C 的访存影响
// 仅考虑内层循环中 A 和 B 的读取
for (int j = 0; j < 64; ++j) { // Loop 1
 for (int i = 0; i < 64; ++i) { // Loop 2
 double sum = 0.0; 
 for (int k = 0; k < 64; ++k) { // Loop 3
 sum += A[i][k] * B[k][j]; 
 } 
 C[i][j] = sum; 
 } 
}

问题 10.1

我们试图分析上述代码中最内层循环 Loop 3 对矩阵 $B$ 的访存行为。

已知 Cache 总共有 $4096/64=64$ 个 Set 。

在计算
$C[i][j]$ 的过程中（即一次完整的 Loop 3 ），关于
$B[k][j]$ 的 Cache Miss Rate （不命中率），下列说法正确的是：

选项	描述
A	12.5% - 这里有良好的空间局部性，每 8 个 double 只有 1 次 Miss
B	25% - 虽然是列优先访问，但 Cache 够大，只有冷不命中
C	约 50% - A 和 B 互相打架（冲突），导致一半的数据被驱逐
D	100% - 发生了严重的 Cache Thrashing （抖动），每次读取都是 Miss

💡 提示：计算一下访问 $B[k][j]$ 和 $B[k+1][j]$ 时的内存地址差值（ Stride ），以及它们映射到的 Set Index 的跨度。

正确答案是 D 。

解释如下：

矩阵 B 是按行优先存储的，因此访问 B[k][j] 时， k 的变化会导致访问的内存地址以列为单位跳跃。
计算地址差值（ Stride ）：
$$
\text{Stride} = \text{sizeof(double)} \times N = 8 \text{ Bytes} \times 64 = 512 \text{ Bytes}
$$
每次访问 B[k][j] 时，地址增加 512 Bytes ，而每个 Cache Line 大小为 64 Bytes ，因此每次访问都会跨越多个 Cache Line 。
计算 Set Index 的跨度：
$$
\text{Set Index Span} = \frac{\text{Stride}}{\text{Cache Line Size}} = \frac{512 \text{ Bytes}}{64 \text{ Bytes}} = 8
$$
因为 Cache 有 64 个 Set ，跨度为 8 意味着每次访问都会映射到不同的 Set ，但由于 k 从 0 到 63 ，共有 64 次访问，这些访问会循环映射到同一组 Set 上，导致频繁的冲突和驱逐。
最终结果是每次读取 B[k][j] 都会导致 Cache Miss ，即 Cache Thrashing 。

问题 10.2
为了进一步提升矩阵乘法的效率，我们决定使用分块技术。你将矩阵分成了 $8×8$ 的小块（ Block Size = 8 ）。

// 8x8 分块优化演示
for (int jj = 0; jj < 64; jj += 8) { 
 for (int ii = 0; ii < 64; ii += 8) { 
 for (int kk = 0; kk < 64; kk += 8) { 
 // 在这里处理 8x8 的子块乘法
 for (int j = jj; j < jj + 8; ++j) { 
 for (int i = ii; i < ii + 8; ++i) { 
 double sum = C[i][j]; // 简化写法
 for (int k = kk; k < kk + 8; ++k) { 
 sum += A[i][k] * B[k][j]; 
 } 
 C[i][j] = sum; 
 } 
 } 
 } 
 } 
}

针对一个 $8×8$ 的 $B$ 矩阵子块（假设该子块已预加载），在处理该子块内部的计算时，关于其在 L1 Cache 中的状态，下列分析正确的是：

选项	描述
A	一个 8×8 的子块大小为 512 Bytes ，远小于 Cache 大小，因此完全没有冲突，所有数据都能驻留在 Cache 中
B	尽管子块很小，但由于 B 的原始列宽（ Stride ）很大，导致子块内的 8 行数据全部映射到了同一个 Set 中，依然存在严重的冲突
C	子块内的 8 行数据分别映射到了 8 个不同的 Set 中（ Set 索引间隔为 8 ），且在子块计算期间不会发生自我冲突（ Self-Conflict ）
D	分块主要是为了利用 L2/L3 Cache ，对这么小的 L1 Cache (4KB) 来说， 8×8 的分块没有任何意义

正确答案是 C 。

解释如下：

一个 8×8 的子块大小为：
$$
\text{Block Size} = 8 \times 8 \times \text{sizeof(double)} = 64 \times 8 \text{ Bytes} = 512 \text{ Bytes}
$$
该子块的大小（ 512 Bytes ）确实远小于 L1 Cache 大小（ 4KB ），但关键在于访问模式。
在处理该子块时，访问 B[k][j] 时， k 的变化会导致访问的内存地址以列为单位跳跃。
计算 Set Index 的跨度：
$$
\text{Set Index Span} = \frac{\text{Stride}}{\text{Cache Line Size}} = \frac{512 \text{ Bytes}}{64 \text{ Bytes}} = 8
$$
因此，子块内的 8 行数据分别映射到了 8 个不同的 Set 中，且在子块计算期间不会发生自我冲突（ Self-Conflict ）。
分块技术有效地利用了 Cache 的空间局部性，减少了冲突，提高了数据的命中率。

C 题

]]> blog HPC games parallel computing High Performance Computing AI Infrastructure SF3D 论文阅读记录 //blog/SF3D/ 引言

mesh construction 是我刚刚开始了解的一个方向, 今天读了SF3D: Scene Fusion for 3D Reconstruction with Transformers这篇论文, 本文笔记记录用于后续翻阅学习。

读完这篇论文之后, 感觉 mesh reconstruction 与 point cloud reconstruction 还是有很大区别的, 尤其是这篇文章中引入的几个新的 mesh 专有的 module, 感觉要比 point cloud reconstruction 更加复杂一些.OK,
废话不多说, 直接进入正题.

Introduction

作者一上来就提出了几个 issue:
SF3D提出的问题

Light bake-in: 现有的模型将光照信息直接 bake 到 texture 里, 使得生成的 mesh 难以利用, 而在 SF3D 中, 作者提出了使用 explicit illumination 和一个不同的使用 Spherical Gaussian 的 shading model 来解决这个问题(如上图第一行所示).
Vertex Coloring: 现有的工作中, 生成的 vertex 的数量过多, 使得性能开销很大. 作者认为一个关键问题就是 UV unwrapping 的额外处理时间, 于是作者提出了一种 highly parallelizable fast box projection-based UV
unwrapping method 来解决这个问题(如上图第二行所示), 这使得时间从 10-30s 减少到了 0.5s, 而且从图上来看, 细节比 baseline 的 TripoSR 的效果更好.
Marching Cube Artifacts: feed-forward network 通常生成类似与 Triplane NeRFs 的体素网格, 然后使用 marching cube 来提取 mesh, 但是这种方法会引入一些 artifacts,
作者提出了使用一个对高分辨率 Triplane 更有效的 architecture, 并且使用 DMTet 来对生成的 vetex diplacement 和 normal map 生成最终的 mesh, 这样可以有效减少 marching cube 引入的 artifacts(如上图第三行所示).
Lack of Material Properties: 现有的工作生成的 mesh 在不同光照下都会看起来 dull, 这是因为缺乏 explicit 的 material properties.为解决这个问题, 作者预测了 non-spartially varying material properties
(如上图第 4, 5 行所示).

通过以上的改进, SF3D 可以从单张图像生成高质量的 mesh, 且生成的 3D 资产体积小(1 MB)并且可以在 0.5s 内生成.

Method

为了解决上面提到的问题, 作者提出了 SF3D.

首先, SF3D 是在 TripoSR 的基础上进行改进的. TripoSR 训练了一个能够生成 Triplane 3D representation 的 transformer. 它使用 DINO encode image, 然后把 token 送入 transformer 中, transformer 输出一个$64 \times 64$分辨率的
triplane, 然后 triplane feature 之后被 decode 为 color 和渲染成标准 NeRF. TripoSR 只学到了 colors 并且不能处理反射等材质属性.

Overview

SF3D 的整体架构如下图所示:
SF3D架构图
可以看到, SF3D 由 5 个主要模块组成:

Enhanced Transformer: 用于预测高分辨率的 triplane feature.
Merterial Estimation: 用于预测材质属性.
Illumination Modeling: 处理光照问题.
Mesh extraction and refinement: 用于从 triplane 中提取 mesh 并进行细化.
UV Unwrapping and Export: 产生 low-poly mesh 和高分辨率 texture map.

Enhanced Transformer

为了生成高分辨率的 triplane feature, 作者对 TripoSR 的 transformer 进行了改进, 主要有以下几点:

首先, 作者将 DINO 替换成了 DINOv2, 这样可以获得更好的 image feature.
其次, 作者对 triplane 导致的 aliasing 问题进行了讨论

如上图所示, 低分辨率的 triplane 会导致 aliasing 问题, 但是简单地提高 triplane 的分辨率会导致模型更复杂, 作者说, 他从 PointInfinity 中获得启发,
(PointInfinity 提供了一个不需要计算 triplane 的 self-attention 的架构), 因此, 作者将分辨率提高到$96 \times 96$, 从而降低了走样.

Material Estimation

SF3D 输出了 metallic 和 roughness 两个材质属性. 论文中提到, 理想状况下, 人们希望材质属性是 spatially varying 的, 但是这样并不现实. 于是作者简化了这个问题, 为整个物体
预测这两个属性, 作者提到虽然这种非空间变化的材质属性通常适用于同质物体, 但是实际上能显著改善渲染效果.

为了实现这个预测, 作者引入了一个 Material net, 首先将图像通过 CLIP encoder 编码, 然后通过 2 个 MLP 预测 metallic 和 roughness.

Illumination Modeling

作者提出要显式 estimating 光照, 如果不这样做的话, 输出的 RGB 颜色会将光照信息 bake 进去, 使得生成的 mesh 难以利用. 为此, 作者提出了一个 Light net, estimate SG 光照. 因为 triplane encode 了场景的几何信息, 所以可以能够推断光照变化.

具体实现上, 作者使用 Transformer 输出的 $96 \times 96$ 分辨率的 triplane 作为输入, 使其通过 2 个 CNN 层, 接着进行 max pool,
最后通过一个 MLP 。 Light Net 输出 24 个 SG 的 grayscale amplitude values, 并使用 Softplus 以确保值为正数。这些 SG 的轴和锐度值保持固定, 其设置旨在覆盖整个球体。
利用这些振幅值, 作者实施了一种类似于 NeRD [4] 中使用的 deferred physically based rendering 方法.

此外, 作者的方法在训练阶段还引入了一个 lighting demodulation loss $\mathcal{L}_{\text{Demod}}$, 该损失函数旨在确保：一个具有 entirely white albedo 的物体上的光照,
能与输入图像的亮度紧密匹配。 lighting demodulation loss 强制学习到的光照与训练数据中观察到的光照条件保持一致.
这可以被视为一种 bias, 用于解决 appearance 和 shading 之间的 ambiguity.

为了从 triplane 中提取 mesh, 作者使用了 DMTet. 作者提出了两个 MLP head 来预测 vertex offsets 和 vertex normals. 这里受 MeshLRM 启发, 作者也单独使用了分离的 decoder MLP 来辅助这两个 head 的训练.
作者发现, vertex offset 能够反走样, 而 vertex normal 则能提升细节表现. 鉴于一开始 normal map 的预测不会太准确, 于是作者使用了 slerp 来稳定训练, 这是在一开始的 5K step 里发生.

然后引入了各种 loss 来训练这个 mesh extraction and refinement 模块:

$$\mathcal{L}_{\text{Nrmconsistency}}$$: 法线一致性损失
$$\mathcal{L}_{\text{Laplacian}}$$: Laplacian 平滑损失
$$\mathcal{L}_{\text{Offset}} = v_o^2$$: 顶点偏移正则化
$$\mathcal{L}_{\text{Nrmrepl}} = 1 - n \cdot \hat{n}$$: 法线复制损失
$$\mathcal{L}_{\text{Nrmsmooth}} = (\hat{n}(x) - \hat{n}(x + \epsilon))^2$$: 法线平滑损失

UV Unwrapping and Export

SF3D 模型的最终阶段是一个高效的导出流水线, 关键挑战在于传统 UV 展开的计算密集性, 这不符合快速生成的要求. 为此, 作者提出了一个基于立方体投影的展开方法. 该方法利用网格面法线独立决定投影方向, 实现了可并行化的展开过程.
具体实现上, 该方法执行 2D 三角形-三角形相交测试来处理 UV 图集中的遮挡, 并根据深度和接近度对相交面进行重新分配. 同时, 通过遵循径向 $z$ 切线方向旋转 UV 岛以最小化阴影接缝. 接着, 通过 UV 展开将世界坐标和占用率烘焙到 UV 图集上
, 用于从 triplane 中查询反照率和表面法线. 为防止接缝伪影, 作者采用了一个迭代过程, 使用 $3 \times 3$ 部分卷积和最大池化来扩展 UV 边界, 确保纹理平滑向外混合.

之后, 作者将所有文件作为 glb 格式导出.

Overall Training and Loss Functions

由于直接在网格渲染任务上训练方法会产生不满意的结果, 作者首先在 NeRF 任务上进行了预训练. 完成预训练后, 模型过渡到网格训练,
将 NeRF 渲染替换为 differentiable mesh rendering 和基于 SG 的着色.

分步的损失函数如下所示:
$$
\begin{split}\mathcal{L}{\rm render}&=\underbrace{ \lambda{\rm MSE}}{ 1 0}\mathcal{L}{\rm MSE}+\underbrace{ \lambda_{\rm LPIPS}}{ 2}\mathcal{L}{\rm LPIPS}+\underbrace{\lambda_{ \rm Mask}}{ 1 0}\mathcal{L}{\rm Mask}\ \mathcal{L}{\rm mesh}&=\underbrace{\lambda{\rm Laplacian }}{ 0.01}\mathcal{L}{\rm Laplacian}+\underbrace{\lambda_{\rm Nrm Consistency}}{ 0.001}\mathcal{L}{\rm Nrm consistency}+\underbrace{\lambda_{\rm Offset}}{ 0.1}\mathcal{L}{\rm Offset}\ \mathcal{L}{\rm shading}&=\underbrace{\lambda{\rm Nrm repl}}{ 0.2}\mathcal{L}{\rm Nrm repl}\underbrace{\lambda_{\rm Nrm smooth}}{ 0.02}\mathcal{L}{\rm Nrm smooth}+\underbrace{\lambda_{\rm Demod}}{ 0.01}\mathcal{L}{\rm Demod}\end{split}
$$
总损失为:
$$
\mathcal{L}=\mathcal{L}{\rm render}+\mathcal{L}{\rm mesh}+\mathcal{L}_{\rm shading}
$$

Results

作者在 GSO 和 OminiObject3D 数据集上对 SF3D 进行了评估. 结果如下图所示:
结果图
可以看到, SF3D 在视觉效果上明显优于其他方法, 并且在数值指标上也有显著提升.

在速度方面, 确实如作者所说, SF3D 的 UV 展开非常快, 只需 0.5s, 远快于其他方法的 10-30s.
速度对比

Conclusion

因此, 我似乎大致总结完了 SF3D 的主要结构, 从一张图像生成高质量的 mesh, 能不能对视频进行这样的操作呢? 我们看到这个任务里实际上用了大量生成的先验知识, 我在想一个完全
基于 image 的 3D reconstruction 方法, 能不能做到不依赖于这些先验知识?

]]> blog 3Dreconstruction paper reading mesh reconstruction ViT Transformer 的阅读?(应该算是阅读吧) //blog/ViT_transformer/ 引言

在快要到 2026 年的今天, ViT 相比于当下的复杂的结构而言, 已经显得比较简单了, 我读论文的时候的最大感觉是, 它充满了 Transformer 在各领域蓬勃发展的野蛮生长的气息.
但是作为 Transformer 在 CV 领域的里程碑式的工作, 并且我作为这方面的初学者, 我觉着还是需要读一下这一篇论文An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, 做一个简单的记录.

ViT 的整体结构

ViT 的整体结构如下图所示:

可以看到, 他的特殊处理是在于输入部分, 传统的 CNN 是通过 kernel 来滑动提取局部信息, 这样的一个 CNN 的输出很难直接送入 Transformer 中进行处理, 因为
Transformer 需要的是一个序列化的输入, 而 CNN 的输出是一个三维的 feature map.

因此, 相较于同期的其他处理, ViT 直接将输入图像划分为若干个小的 patch, 然后将每个 patch 展平并映射到一个固定维度的向量空间中, 形成一个序列化的输入, 这样就可以直接送入 Transformer 中进行处理.

具体来说, 假设输入图像的尺寸为$H \times W \times C$ (高度, 宽度, 通道数), 我们将其划分为大小为$P \times P$的若干个不重叠的 patch, 则总共会得到$N = \frac{HW}{P^2}$个 patch.
每个 patch 被展平为一个向量, 并通过一个线性投影映射到一个$d$维的向量空间中, 形成一个序列化的输入矩阵$X \in \mathbb{R}^{N \times d}$.
此外, 为了让模型能够捕捉到位置信息, ViT 还引入了可学习的位置编码, 将其与输入序列相加, 形成最终的输入表示.
接下来, 这个序列化的输入就可以直接送入标准的 Transformer Encoder 中进行处理, 经过多层的 Transformer Encoder Layer 的处理后, 得到最终的输出表示.

其具体的一个维数变换大概是这样:
$$
X \in \mathbb{R}^{224 \times 224 \times 3} \rightarrow 196 \times Patchs^{16 \times 16 \times 3} \rightarrow Flattened_Patchs^{196 \times 768} \rightarrow \
Transformer Input^{197 \times 768} \rightarrow Transformer Output^{197 \times 768} \rightarrow Classifier Output^{1 \times 1000}
$$
为什么新加上的 class token work?

因为在 transformer 中, 两两 token 之间是可以相互 attention 的, 因此 class token 可以和所有的 patch token 进行 attention, 从而聚合全局的信息, 这样我们就可以在最终的输出中使用 class token 来进行分类任务.

一些其他的细节

相对于 CNN 而言, ViT 的先验信息很少, 因此在中小数据集上的表现并不理想, 论文中提到需要在大规模数据集上进行预训练, 然后再进行微调, 才能取得较好的效果.

此外, ViT 的 attention 机制也与 Transformer 类似, 主要包括 Multi-Head Self-Attention 和 Feed-Forward Neural Network (FFN) 两个部分, 具体的计算过程与 Transformer 中的 Self-Attention 类似, 这里就不再赘述.

总的来说, ViT 通过将图像划分为 patch 并使用 Transformer 进行处理, 提供了一种新的思路来解决计算机视觉中的图像分类问题, 并且在大规模数据集上取得了优异的表现, 成为计算机视觉领域的重要里程碑.

]]> blog Learning note conputer vision model architecture 回顾一下Transformer //blog/transformer/ 引言

Transformer 在Attention is All You Need一文中被提出, 本来想读一下原文的, 但是时间并不太够, 因此我们这里就简单捋一下就行.

整体结构

Transformer 的整体结构如下图所示:
Transformer架构图
可以看到, 其主要由 Encoder 和 Decoder 两部分组成.

Transformer 的工作流程:
- 首先获取输入每一个词的表示向量$X$, $X$由单词的 embedding 和位置的 embedding 相加得到.
- 然后将$X$输入到 Encoder 中, 经过多层的 Encoder Layer 的处理, 得到编码后的表示$Z$.
  - $Z$用$X_{n \times d}$表示, 其中$n$是序列长度, $d$是词向量的维度.
- 接着将目标序列的输入$Y$输入到 Decoder 中, 经过多层的 Decoder Layer 的处理, 并结合 Encoder 的输出$Z$, 最终得到预测结果$\hat{Y}$.如下图:
  - 使用的过程中, 翻译到单词$i + 1$时, 需要通过Mask操作掩盖住未来的信息, 以防止模型在预测时看到未来的词.

OK, 下面我们来具体看看 Encoder Layer 和 Decoder Layer 的结构.

Self-Attention 机制

Transformer 的核心是 Self-Attention 机制, 其结构如下图所示:

Self-Attention架构图

左侧为Encoder block
右侧为Decoder block
红圈中的部分为Multi-Head Attention机制, 是由多个 Self-Attention 组成的.

可以看到Encoder block包含一个Multi-Head Attention层.
Decoder block包含两个Multi-Head Attention层, 第一个用于处理目标序列的输入, 第二个用于结合 Encoder 的输出.
每个 Attention 层后面都跟着一个**Feed-Forward Neural Network (FFN)**层.

因为Self-Attention机制是 Transformer 的核心, 因此我们重点来看一下它的计算过程.

dsa

上图是Self-Attention的计算流程图, 计算时需要用到三个矩阵: Query ($Q$), Key ($K$), Value ($V$), 实际过程中, 这三个矩阵都是通过输入的表示$X$经过线性变换得到的.

Q, K, V 的计算

Self-Attention机制中, 对于输入的表示$X \in \mathbb{R}^{n \times d}$, 可以使用线性变换矩阵$W_Q, W_K, W_V \in \mathbb{R}^{d \times d_k}$来计算$Q, K, V$:
$$
Q = X W_Q, \quad K = X W_K, \quad V = X W_V
$$

实现

from math import sqrt


class SelfAttention(nn. Module): 
 def __init__(self, d_model, d_k, d_v): 
 " " " 
 input: X: (batch_size, n, d_model)
 q: (batch_size, n, d_k)
 k: (batch_size, n, d_k)
 v: (batch_size, n, d_v)
 " " " 
 super(SelfAttention, self).__init__()
 self.d_k = d_k
 self.W_Q = nn. Linear(d_model, d_k)
 self.W_K = nn. Linear(d_model, d_k)
 self.W_V = nn. Linear(d_model, d_v)
 self._norm_factor = sqrt(d_k)
 
 def forward(self, X): 
 Q = self.W_Q(X) # Q: (batch_size, n, d_k)
 K = self.W_K(X) # K: (batch_size, n, d_k)
 V = self.W_V(X) # V: (batch_size, n, d_v)
 
 scores = torch.matmul(Q, K.transpose(-2, -1)) / sqrt(self.d_k) # (batch_size, n, n)
 attn_weights = torch.softmax(scores, dim=-1) # (batch_size, n, n)
 output = torch.matmul(attn_weights, V) # (n_batch_size, n, d_v)
 
 return output

因此, 当我们得到了$Q, K, V$后, 就可以计算 Attention 的输出了:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V
$$

得到$QK^T$之后, 使用 Softmax 函数对每一行进行归一化, 即每一行的和都变为 1.

最后将归一化后的权重矩阵与$V$相乘, 得到最终的 Attention 输出.

上图中softmax矩阵的第一行可以理解为单词 1 对其他单词的关注程度, 最终单词 1 的输出$Z_1$等于所有单词的值$V$加权求和.

Multi-Head Attention

上一步中, 我们已经知道怎么使用 Self-Attention 机制来计算 Attention 的输出了, 但是 Transformer 中使用的是Multi-Head Attention机制, 其结构如下图所示:

Multi-Head Attention架构图

从上图中可以看到Multi-Head Attention机制包含多个并行的 Self-Attention 头, 每个头都有自己的一组线性变换矩阵$W_Q^i, W_K^i, W_V^i$.

首先将输入$X$分别传递到 h 个 Self-Attention 头中, 得到 h 个不同的 Attention 输出, 下面是 h = 8 的例子:

from math import sqrt

class MultiHeadAttention(nn. Module): 
 def __init__(self, d_model, d_k, d_v, h): 
 " " " 
 input: X: (batch_size, n, d_model)
 q: (batch_size, d_model, d_k)
 k: (batch_size, d_model, d_k)
 v: (batch_size, d_model, d_v)
 " " " 
 super(MultiHeadAttention, self).__init__()
 self.h = h
 self.d_k = d_k
 self.d_v = d_v
 
 self.W_Q = nn. ModuleList([nn. Linear(d_model, d_k) for _ in range(h)])
 self.W_K = nn. ModuleList([nn. Linear(d_model, d_k) for _ in range(h)])
 self.W_V = nn. ModuleList([nn. Linear(d_model, d_v) for _ in range(h)])
 self.linear = nn. Linear(h * d_v, d_model)
 
 def forward(self, X): 
 heads = []
 for i in range(self.h): 
 Q = self.W_Q[i](X)
 K = self.W_K[i](X)
 V = self.W_V[i](X)
 
 scores = torch.matmul(Q, K.transpose(-2, -1)) / sqrt(self.d_k)
 attn_weights = torch.softmax(scores, dim=-1)
 head = torch.matmul(attn_weights, V) # (batch_size, n, d_v)
 heads.append(head)
 
 concat_heads = torch.cat(heads, dim=-1) # (batch_size, n, h * d_v)
 output = self.linear(concat_heads) # (batch_size, n, d_model)
 
 return output

得到 8 个输出后, 将它们在最后一个维度上进行拼接, 得到一个新的表示, 然后通过一个线性变换矩阵$W_O$将拼接后的表示映射回原始的维度$d_{model}$.

可见Multi-Head Attention输出的矩阵维度与输入矩阵的维度相同, 这样就可以方便地将其与后续的层进行连接.

Other components

剩余的层比较简单, 因此不再赘述.

Decoder Layer

Decoder Layer 的结构如下图红框内所示:

其与 Encoder Layer 的主要区别在于多了一个Masked Multi-Head Attention层, 该层用于处理目标序列的输入, 并且在计算 Attention 时会掩盖住未来的信息, 以防止模型在预测时看到未来的词.

第一个Multi-Head Attention

我们重点解释一下 Mask 操作.

第一步是 Decoder 的输入矩阵和 Mask 矩阵, Mask 矩阵是一个上三角矩阵, 用于掩盖未来的信息.
接下来的操作和之前的 Self-Attention 机制类似, 通过输入矩阵计算$Q, K, V$., 之后计算$QK^T$.
然后将 Mask 矩阵应用到$QK^T$上, 将被掩盖的位置设置为负无穷大, 这样在 Softmax 计算时, 这些位置的权重会变为 0.

最后进行 Softmax 归一化, 并与$V$相乘, 得到最终的 Attention 输出.

第二个Multi-Head Attention

第二个 Multi-Head Attention 层与 Encoder Layer 中的 Multi-Head Attention 层类似, 只是这里的$K$和$V$来自于 Encoder 的输出$Z$, 而$Q$来自于第一个 Attention 层的输出.

根据 Encoder 的输出$C$计算得到$K$和$V$, 根据上一个 Attention 的输出$D$计算得到$Q$, 然后计算 Attention 的输出.

时间复杂度分析

Transformer 的时间复杂度主要来自于 Self-Attention 机制. 对于一个长度为$n$的序列, Self-Attention 的时间复杂度为$O(n^2 \cdot d)$, 其中$d$是词向量的维度. 这是因为在计算$QK^T$时, 需要进行$n \times n$的矩阵乘法, 每个元素的计算涉及到$d$维的向量点积.
因此, 对于一个包含$L$层 Encoder 和 Decoder 的 Transformer 模型, 总的时间复杂度为$O(L \cdot n^2 \cdot d)$.

总结

Transformer 应该是这样的.

]]> blog Learning note model architecture transformer SLAM Former 阅读 //blog/slam_former/ 引言

最近几天读了SLAM-Former: Putting SLAM into One Transformer这篇很近很近的工作，本文笔记记录用于后续翻阅学习

首先， SLAM-Former 与之前读到的所有论文相似，都是致力于从 RGB 图像序列中恢复三维场景结构和相机位姿等属性的工作。但是与之前的工作（包含一个冗长复杂的 pipeline ）不同，
SLAM-Former 对已有的 transformer 架构进行了大胆的改进，使之更适合进行重建任务，并在实验中得到了 competitive 的结果。

模型结构

SLAM-Former架构图

据作者所述， SLAM-Former 的主要 pipeline 由 frontend 和 backend 两部分组成，至于模型的 backbone ， SLAM-Former 建立在一个 Transformer 架构之上，
而这个 Transformer aggregate 了 intraframe 和 interframe 的信息，并使用 task specific heads 预测不同的三维属性。
值得注意的是，这个 Transformer 的输入与$\pi^3$类似，对所有的输入的 image token 共享一个相同的 register tokens
从而使模型不依赖于一个不稳定的 reference frame 。

模型的 backbone 包含了$L$层组合了 intra-frame attention 和 inter-frame attention
来联合捕捉图像内容和图像之间的关系。

此外， Front end 部分负责增量式的逐帧重建， back end 负责全局的点云对齐和相机优化，他们共享一个
Transformer backbone 。

Front end

图中大部分内容都是 front end 的处理细节，当一个新的 frame 输入时， frontend 首先会
决定其是否为 keyframe ，如果是的话，则会进行进一步处理。

当给定一个 frame sequence 时， frontend 将每一个 frame 映射到一个 map token 集合中：
$$
\mathbb{F}t = f{fn}(\mathbb{I}t){{C_k }{K\in S}}
$$
这里, ${C_k}{K\in S}$表示之前 keyframe 的KV cache，
， $S$代表着 keyframe 的索引集合，$F_t$是当前 frame 的 map token, 作为该 frame 的
一个隐式神经表示。同时新的 KV cache 也通过$C_t = Cache(f(\mathbb{F}t))$产生，
也会视情况被扩充到${C_k}{K\in S}$中。

Keyframe detection

在上一步中我们已经对当前帧 generated 了 map token ，接下来我们需要决定是否为 keyframe.

作者采用了 pose head 来预测当前帧的 pose ：
$$
g_t = h_{pose}(\mathbb{F}_t)
$$

当当前 frame 的 relative pose 与最近的 keyframe 的 pose 之间的差异大于一个阈值时，
则将当前 frame 标记为 keyframe 。

但是作者在论文里又表明，在检测 frame 是否为 keyframe 时，他们并没有依赖 KV cache
, 而是直接应用了$f_{fn}(I_{k_{prev}}, T_t)$来检测，就相当于之前的 KV cache 是将该图片
与所有的 keyframe 进行 attention 计算，而这里则是只与最近的 keyframe 进行 attention 计算。
这样增加了效率并且避免了选取一个特定的 reference frame 。（这里似乎我没怎么懂跟特定的 reference frame 有什么关系）

Front end tracking and mapping

接着上一步，如果一个新的 frame 已经被认为是一个 keyframe ，我们就可以重新利用全部的 KV cache 来重新
计算他的 map token, 并更新 M, S.

好了， front end 到这里差不多结束了，作者说 frontend 只依赖于过去的 keyframe ，
使得其适合于 online 的 tracking ，然而，这种处理顺序会导致误差累积和局部不一致，
为了解决这一问题，作者引入了一个 back end 模块来进行 global refinement.

Backend

Backend 的主要任务是 refine 所有的 frame 来达到全局的一致性。传统的
SLAM 系统通常会使用 loop closure 和 bundle adjustment 来实现这一点，
但是这些方法都非常的 costly, 作为对比，作者使用了一个 transformer-based 的
back end 来进行全局的优化。

作者认为这个设计的有效性在于 backend transformer 内部的 full attention 机制，
他的全局感受野使得模型能够完成误差纠正和结构一致性。

此外，为了继承 backend refinement 的优势， frontend 和 backend 共享了 KV cache ，
使得 frontend 能够受益于 backend 的全局优化。

Training Strategy

与以往的一些论文不同， SLAM-Former 的创新点不止在于模型架构，也在于一些训练策略。

作者的目标是使一个 transformer 同时胜任 frontend 和 backend 的任务，为了达到这个目标，
作者用三种模式联合训练，每一个模式都对应着不同的输入输出对。

训练模式图

Training Frontend

Frontend 用了一个 causal mask 来确保每一个 frame 只能访问之前的 keyframe 。

然而，纯净的使用 causal mask 会自动的将第一帧作为 reference frame ，
作者又注意到党对两帧或更多帧进行联合操作时，没有单一的 refernce frame,
这避免了后续帧需要与 reference frame pose 相似的要求。

因此，作者对前两帧使用了 full attention ，并同时对所有后续 frame 使用 causal mask,
在这种情况下， inference 时， keyframe detection 将最后一帧关键帧和当前的输入帧进行处理，
tracking and mapping 时，前两个 keyframe 则会联合处理决定全局坐标。

作者的原文是：

For tracking and mapping, the
first two keyframes are jointly processed to determine the
global coordinate.

取前两帧的做法与之前的 tracking and mapping 部分提到的 use full KV cache 不符，
我感觉不怎么理解。

Training Frontend with Backend Cooperation

为了在 frontend 和 backend 之间建立联系，作者使用 maxed attention 来模拟 backend 和
cache sharing 的过程。

具体来说，采用混合注意力在一个统一的正向传播中同时完成地图精炼（后端/全注意力）和新数据处理，
并且前端的 casual attention 并非独立工作，而是以 KV cache 为条件，实现了高效且信息流一致的前端-后端协作，确保前端的实时处理结果能够立即对齐到后端修正后的全局结构。

$$
F = f_{fn}(I){C{M}}
$$

woc 这什么花式操作啊

Training Backend

作者最后使用 full attention 来训练 backend transformer ，

Joint Training

在所有的三种模式中，三维属性均是由 task specific heads 预测的：

$$
\mathbf{P}^,\mathbf{\Sigma}^,\mathbf{g}^*=h(\mathbf{F}).
$$

但值得注意的是，并不像其他的工作一样， SLAM-Former 只预测每一帧的 local
pointmap 来避免设定一个特定的世界坐标系的需求，这倒是与$\pi^3$非常相似。

剩下的 loss 函数都比较常规。
这三种模式都会在一个 batch 中共享权重依次训练。

Pipeline

在图片和叙述过程中， pipeline 已经是显而易见的，于是我便不再赘述。

Experimental Setup

本模型有 36 层 framewise 和 global attention 相结合的 transformer layer, 训了 10 个
epoch, 在 32 个 A100 上训练了 11 小时。可以可以。

Results

模型在 pose ， tracking 和 reconstruction 等任务上都达到了很好的指标。数据冗长不再多说。
值得一提的是作者对 Front end 和 back end 的联系的理解。

back end assist front end 无疑是显而易见的，但是作者还发现 back end 同样也
benefit from front end, 作者解释了是因为 back end 使用了来自于 frontend 的
implicit 的顺序信息，从而使得 back end 能够更好地理解 frame 之间的关系。（迷）

总结

总之， SLAM-Former 通过对 transformer 架构的改进和训练策略的设计，
成功地实现了一个统一的模型来处理 SLAM 任务。

但 SLAM-Former 仍然存在一些局限性，比如说作者用 full attention 来替代传统的 loop
closure 和 bundle adjustment ，受限于 full attention 的计算复杂度，模型难以处理非常长的序列，
其次， frontend 不支持一个 local 的 inference ，因为在 inference 之前需要将所有的 KV cache 输入到 frontend 中。

此外，文章中没有提到的是，我去看他们的 demo ，发现重建结果有很明显的分块化现象，目前不知是否与 transformer 的架构有关。

此文撰写的时候， SLAM-Former 的代码尚未开源，期待后续的代码发布。

]]> blog 3Dreconstruction paper reading 重返vggt //blog/vggt_new/ 引言

这是本人在学了一些基础知识并做了一些实验之后, 察觉到之前对于一些经典论文的阅读并不充分, 于是决定重新阅读VGGT一文, 并写下这篇文章, 以供后续查阅.

首先, VGGT 是一个完全的前馈式神经网络用于多目重建任务, 通过 look into 他的代码, 可以看到基本上是没有什么 pipeline 的, 直接将图片输入网络, 然后输出各种三维属性, 并在作者的宣称下, 他们所预测的多个指标在存在 BA 的前提下
均达到
子领域的 SOTA 水平, 这一点非常厉害.

模型结构

VGGT 的 backbone 是一个标准的 transformer 结构, 首先接受大量图片作为输入, 首先通过一个 DINO 提取了分块的 feature, 然后将这些 feature 通过一个主体网络结构(包含了 Alternating frame-wise layer 和 global attention layer)
进行处理, 最后通过多个 task-specific heads 输出不同的三维属性.
VGGT架构图
接下来, 我们详细叙述各个细节部分:

Alternating attention frame-wise layer

据文章作者所述, 该 AA 机制与标准的 transformer attention 机制有所不同, 能够使 Transformer 以交替的方式聚焦每一帧和全局.

frame wise attention layer: 该层的 attention 仅在同一帧内进行, 也就是说, 每个 patch 只能与同一帧内的其他 patch 进行 attention 计算. 这样做的好处是能够更好地捕捉每一帧内部的局部特征.
global attention layer: 该层的 attention 在所有帧之间进行, 也就是说, 每个 patch 可以与所有帧内的其他 patch 进行 attention 计算. 这样做的好处是能够捕捉不同帧之间的全局特征.

另外值得一提的是, 作者采用了$L = 24$层的 AA 机制, 并通过消融实验证明了 AA 机制的有效性, 此外, 作者声称他们的架构并没有采用 cross attention, 只采用 self attention.

任务特定的heads

将输入的图片通过 backbone 网络处理后, 会得到一个全局的 feature 表示, 然后通过多个 task-specific heads 输出不同的三维属性. 值得注意的是, DINO 编码的 feature 并非直接输入到 AA 中, 而是被添加了一个额外的相机 token
$t_i^g \in \mathbb{R}^{1 \times C}$和四个 register tokens$t_i^R \in \mathbb{R}^{4 \times C}$进行增强, 然后将$(t_i^L, t_i^g, t_i^R)$作为最终的输入.

此处值得注意的是, 第一帧的输入 token 是$(t_1^g = t_{ini}^g, t_1^R = t_{ini}^R)$, 之后的帧的输入 token 是$(t_i^g = t_{follow}^g, t_i^R = t_{follow}^R)$, 也就是说, 第一帧和之后的帧的 camera token 和 register token 是不同的.
但是作者说他们都是 learnable 的. 这使得模型能够将第一帧和其他帧区分开来, 并在第一个相机的坐标系下表示全局点云以及各种数据.但是, 经过 AA 层之后, 本来被赋予同一初值的 camera token 和 register
token 均会变为帧特定的, 这是因为 AA 层的 frame-wise attention layer 会使得每一帧的 token 在不同的计算中产生不同的表示.

最后遵循常规做法, register token 会被丢弃, camera token 和 image token 会被保留用于预测.

Camera parameter head

这个 head 从上图中的模型的 backbone 就可以看到, 他是将 camera token 通过 4 个 self-attention layers 进行处理, 然后通过一个 MLP 预测出每一帧的相机参数(包含内参和外参).

Dense Prediction

输出的 image token 在这里被使用, 用于预测 depth map $D_i$, point map $P_i$ 和 tracking features $F_i$. 更具体地来讲, $\hat{t}_i^I$首先会通过一个 DPT head 转化为一个 dense feature map
$F_i \in \mathbb{R}^{C’’ \times H \times W}$, 之后每一个$F_i$会通过一个$3 \times 3$的卷积层解析出 corresponding depth 和 point map. 另外, DPT 头同样也会输出 dense feature map $T_i$用于后续的 tracking,
在此同时, vggt 同样也会输出 confidence map $\Sigma_i^D \in \mathbb{R}^{C \times H \times W}$和$\Sigma_i^P \in \mathbb{R}^{C \times H \times W}$用于表示 depth 和 point 的置信度. 这个置信度用于后续的模型的 loss 计算和
真实预测时的 conf 输出.

Tracking

这一方面我并不打算去深入了解, 因此先跳过.

Training

Loss function

VGGT 的 loss function 包含多个部分, 主要包含以下几种:

Camera loss: 这个 loss 监管了相机参数$L_{camera} = \sum_{i=1}^{N} ||\hat{g}i - g_i||{\epsilon}$, 使用了 Huber loss.
Depth loss: 这个 loss 沿用了 dust3r 的 loss 设计$\mathcal{L}{\mathrm{depth}}=\sum{i=1}^N|\Sigma_i^D\odot(\hat{D}_i-D_i)|+|\Sigma_i^D\odot(\nabla\hat{D}_i-\nabla D_i)|-\alpha\log\Sigma_i^D$
Point loss: 这个 loss 同样沿用了 dust3r 的 loss 设计$\mathcal{L}{\mathrm{point}}=\sum{i=1}^N|\Sigma_i^P\odot(\hat{P}_i-P_i)|+|\Sigma_i^P\odot(\nabla\hat{P}_i-\nabla P_i)|-\beta\log\Sigma_i^P$
Tracking loss: 这个 loss 监管了 tracking feature 的质量, 具体细节我并不打算深入了解, 因此先跳过.

因此, 最终的 loss function 为:
$$
\mathcal{L}{total} = \mathcal{L}{camera} + \mathcal{L}{depth} + \mathcal{L}{point} + \lambda_{tracking} \mathcal{L}_{tracking}
$$

坐标Normalization

如果缩放的话, 重建结果应该同样也是正确的, 为了消除这种不确定性, 作者采用了归一化进行处理. 首先将所有量表示在第一个相机的坐标系中, 然后计算所有点的平均欧氏距离, 然后利用该尺度归一化相机平移, 点云坐标和深度值.

值得注意的是, 作者没有对预测结果施加任何归一化, 相反强制模型去学习预测归一化后的值, 这样做的好处是能够使得模型更好地适应不同尺度的场景.

Details

我难以想象训练的规模, 按照作者所述, 这一个 transformer 模型包含了$1.2B$的参数, 在 64 块 A100 上训练了 9 天, 属实是第一次见了.

另外, 训练的数据集之多也是难以想象:
dsfa
有点离谱了.

结论

vggt 的指标基本上达到 SOTA 水平, 但是值得注意的是, 直接的输出并没有达到, 作者加入了 BA 优化之后才达到了 SOTA, 因为 BA 是一个 costly 的优化过程, 因此我觉着这一方面或许还可以改进? 作者在论文中提到了
应用 diffentiable BA 的可行性, 但是也因为 BA 的计算量过大, 因此并没有进行进一步的尝试.

此外, VGGT 向我们展示了不需要一个复杂的 pipeline 也可以进行高质量的多目重建说你呢, SLAM3R, 我 TM 的快改吐了, 再结合最近发布的 SLAM Former, 我觉着这是一个很有意义的方向.

非常重要的是, vggt 证明了联合预测多个任务是有益的, 虽然并没有在 loss 阶段进行互相的监督, 但是通过多个任务的单独监督, 使得模型学到了更好的表示,

此外, vggt 另一个重要的发现是, 通过 depth 和 pose 反解出来的点云比直接预测的点云要好.

ok, 让我们把仓库链接抬出来:

另外, 这是真的可以的嘛?

iasdf

]]> blog 3Dreconstruction paper reading 论文阅读记录：reloc3r //blog/bloc3r/ 引言

最近，我们在尝试将 SLAM3R 进行使之输出不限于点云，还有位姿估计、深度图、局部定位等结果的改造，大体上来讲，我对这个改造的感觉就是端了一个类似于 VGGT 的重建结构出来。于是，为了了解一下现在利用 transformer 做位姿估计的工作，我选择了组里的学长的论文：Reloc3r: Large-Scale Training of Relative Camera Pose Regression for
Generalizable, Fast, and Accurate Visual Localization来阅读，本文用来记录对这个模型的理解以及个人的感受。

首先，论文上来又是经典的针砭时弊环节🤣，论文指出了之前的工作分为APR和RPR两种方式，但是各有各的缺点：

APR: 绝对位姿回归，它主要是从图片中直接回归位姿，优点是有更高的推理速度和准确度，但是它的缺点也很明显：大多数这种方法都是针对场景有效，并且在训练时需要密集点图，这限制了他们在真实世界中投入应用。
RPR: 相对位姿回归：它是估计一对图片的相对位姿，相比于绝对位姿回归的好处在于它不需要密集点图的训练，但是，它的准确度表现非常差，远远不及 APR 。

为了解决这些问题，论文提出了一种新型的对称有效的网络，并在一个特大的数据集上进行训练，最终得到了state of the art的水平。

模型结构

模型主要由两个模块组成：相对位姿回归网络和运动平均模块

相对位姿回归网络

这个网络如图片左边所示，是由两个完全相同的 vit transformer 分支构成，并且两个分支共享权重，这有效的消除了输入顺序带来的不利影响，代表着训练得到了大幅简化，并且提高了计算速度和存储效率。

细节在于通过 ViT encoder 图片被编码成特征序列之后，他们之后通过的 decoder 是 Cross attention 的，这能够使模型同时理解两张图片之中的信息，最后， decoder 输出的信息会经过 Pose regression Head
这个 head 会将 decoder 的输出转化为相对旋转和相对位移，其中相对旋转一开始会以一个 9 维向量来表示，随后通过 SVD 分解完成得到旋转矩阵。

因此，我们这个网络最后的输出就是图 A 相对于图 B 的位姿变换和图 B 相对于图 A 的位姿变换。

运动平均模块

理论上来说，第一步网络的输出的精度应当已经达标，并且网络同时输出的两个相对位姿变换矩阵应该互你，从经验上来看，这两个位姿变换矩阵的精度相似，因此我们直接选择了一个非学习的模块用于转换两个输出的相对位姿。

其中有一些细节：

旋转平均的处理：模型将多个对于一张图片的相对旋转转换为绝对旋转处理，并使用四元数表示，最终选取中位数来作为绝对旋转，增强了模型的鲁棒性。
相机中心三角化的处理：因为几何点的平均/中位数化并不可解，因此我们转而通过最小二乘法寻找到所有平移方向距离之和最小的点，将这个点作为相机预测的光心。

损失函数

模型的损失包括两方面：旋转损失和位移损失。文章将他们都表示成了角度：
$$
\mathcal{l}_R = \arccos(\frac{tr(\hat{R}^{-1}R) - 1}{2}), \mathcal(l)_T = \arccos(\frac{\hat{t} \cdot t}{||\hat{t}||||t||})
$$
然后将两者相加得到最后的总损失。显然这是一种无尺度的方法，解决了不同数据集之间度量尺度不统一的问题。

分析流程

该模型的处理流程大致如下：

输入：一个查询图像$I_q$和一个带位姿数据的数据库${I_{d_n}}$.
检索：使用 NetVLAD 在数据库中为$I_q$检索出 Top-K 个最相似的图像${I_{d_K}}$.
相对位姿预测：将$K$个图像对$(I_q, I_{d_i})$逐一送入相对位姿回归网络，得到$K$个相对位姿估计（旋转矩阵和无尺度的平移方向）
绝对位姿聚合：
- 利用数据库图像已知的绝对位姿旋转和预测的相对旋转计算出$K$个图像的绝对旋转统计，然后通过取中值得到最终的旋转$\hat{R}_q$。
- 利用所有有效的图像对和估计的$\hat{R}_q$进行相机中心的三角化，然后通过最小二乘法解出相机中心，从而得到所有的位姿估计。
输出

数据分析

第一次写数据分析模块🧐，有所不完善请原谅🥺。

性能评价指标

相对位姿

rra

RRA@15, RTA@15, mAA@30，分别是相对旋转、相对位移在 15°阈值内的准确度、以及 30°阈值下的平均准确率。

auc

AUC@5°/10°/20°: 位姿误差（旋转和平移角度误差的最小值）在 5°/10°/20°阈值下的精度曲线下面积。

绝对位姿

平移和旋转中位数误差（ m and degree ）：
abso

有效性验证

查看上面的图表便可看出，模型在个主流的公开数据集 (ScanNet1500, RealEstate 10K, ACID, CO3Dv2, 7 Scenes, Cambridge Landmarks) 上与当前最先进的方法（包括非回归和回归两大类）进行全面对比：

相对位姿估计: 在 ScanNet1500, RealEstate 10K 和 ACID 数据集上， Reloc3r 显著优于所有其他相对位姿回归(PR)方法，并且性能达到甚至超过了顶尖的非 PR 方法，同时速度快了几个数量级（例如，在 ScanNet 上比 NoPoSplat 快 50 倍以上）。在 CO3Dv2 数据集上， Reloc3r 在所有多视图评估指标上均达到 SOTA 。
视觉定位
- 在 7Scenes (室内) 数据集上， Reloc3r 的平均误差为 0.04m / 1.02°，超越了所有之前在新场景上评估的 RPR 方法，并达到了与需要场景专门训练的 APR 方法相媲美的精度。
- 在 Cambridge Landmarks (室外) 数据集上， Reloc3r 同样超越了所有 RPR 方法，与之前的 SOTA RPR 方法相比，平均位姿误差降低了约一半，其平均旋转误差甚至优于所有 APR 方法。

消融实验

lab

对称性
论文另外训练了一个使用了独立的两个 ViT 分支的相对位姿回归网络，显而易见性能是弱于 default 版本的
不含尺度信息
同样训练了一个同时输出尺度信息的模型，显而易见其准确性比不对称还差。

有趣的发现

论文在查看 decoder 的交叉熵注意力图时发现：模型在没有直接监督的情况下，自发地学会了在图像对之间建立有意义的块级别匹配。（如下图）
finding

局限性

作者发现当检索到的数据库图像与目标图像共线的时候，运动平均模块并不能恢复尺度。

总结

Reloc3r 使用了一个相当简洁的模型结构完成了 SOTA 水平，但其付出的代价是非常庞大的训练数据。这似乎在向我们说明只要数据够多够大，我们便可以训练出足够高性能的模型，这似乎在
告诉我们多造一下 SLAM3R V2 的数据🤣。

OK ，这篇论文的代码仓库如下：

]]> blog paper reading 3d relocalization camera pose estimation 论文阅读记录：Fast3R //blog/Fast3R/ 引言

OK, 本人昨天又读了一篇 3D reconstruction 方向的论文：Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass，因此写下此篇 Blog 分享自己的理解与发现。

Fast3R 从本质上来说感觉和 SLAM3R 解决的是一类问题，都是对原本 DUst3R 存在的局限性：一次只能对两张图片进行处理，如果对多张图片进行处理的话， DUst3R 则是选择进行两两配对进行重建，最后进行全局坐标下的对齐，显然这将会是一个
$\mathcal{O}(N^{2})$的过程。而 Fast3R 提出了对于打乱序列的多张图片（ 1000+）的处理方法， SLAM3R 则是解决了由视频进行重建的方法。感觉两者的本质上的区别就是 input 的图像集是否有序，后续两者的网络结构区别也正是在此。

从论文的 introduction 上来看，他们主要做了以下三方面的贡献：

创建了 Fast3R ，一个基于 Transformer 的对多目图片重建点图的端到端的模型，据论文所述，它在速度上取得显著提升，并且可以规模化计算。
展示了随着训练时视角增多，模型表现也会加强。另外，当推理时视角增多时，每张视角重建结果的精确度也会提升。并且模型可以处理比训练时多得多的模型。
在相机的位姿定位上达到了SOTA水平，另外也展现出了极快的速度。

好的，现在到了我们喜闻乐见的介绍模型环节啦！

模型

Fast3R 给出了一个看起来在推理环境就很庞大的结构图：
Fast3R

问题定义

从图中右边就可以看到， Fast3R 采用了两个头： Global Head 和 Local Head 来处理输出的 token ，因此可见， Fast3R 为每张图片预测了两个点图：本地坐标系下的点图$X_L$和全局坐标系下的点图$X_G$，可以用公式表示：
$$
\mathrm{Fast3R}:\mathbf{I}\to(\mathbf{X}\mathrm{L},\Sigma\mathrm{L},\mathbf{X}\mathrm{G},\Sigma\mathrm{G})
$$
$\Sigma_X$指代的是$X$点图的置信度。

值得注意的是，全局坐标系值得是第一张图片的坐标系，本地坐标系是每个对应图片的坐标系。（虽然 Fast3R 并没有次序的概念，但其也需要一个切入点，所以随机选取了一张图片作为第一张图片）

训练对象

类似于 Dust3R ， Fast3R 的损失函数分别采用了同样的处理方法处理本地点图和全局点图两部分：
$$
\mathcal{L}\mathrm{total}=\mathcal{L}\mathrm{X_G}+\mathcal{L}_\mathrm{X_L}
$$
阅读其论文，发现其与 Dust3R 的损失函数基本一致，因此不多赘述。

模型架构

Image Encoder

由上图所示，我们可以看到每一个输入的图片都会经过一个共享权重的 Vit Encoder 生成对应的 token 序列 $H_i = {h_{ i , j }}_{j = 1}^{HW/P^2}$，即：
$$
H_i=\mathcal{F}(I_i),i\in1,…,N
$$
论文中提到，他们使用了和 Dust3R 相同的 Encoder ： CroCo ViT ，但是他们提到了 DINOv2 的表现与之相似。

另外，在把 token 传入 fusion transformer 之前，作者为每一个 token 添加了一个一维的位置编码，目的是让模型知道哪些图像块来自于同一张图片，并且帮助模型认出上文标定的第一张图片。这同样也能让模型隐式地去理解这些图片里反映的相机位姿。

Fusion Transformer

模型中大多数计算都发生在 Fusion Transformer 里面，作者使用了一个类似于ViT-L的 24 层的 transformer 作为这一模块的主体。它将来自所有的视角的 token 作为输入，并且通过全连接的自注意力机制进行处理，使的模型能够理解所有视角的信息，远超 Dust3R 能理解的两个视角的信息。

Pointmap Decoding Heads

最后， Fast3R 使用了两个独立的 DPT 解码头将 Fusion Transformer 的输出解码为点图，即图片中右边部分。

位置编码

论文最后的目标是进行多图片处理，并且实现推理时的可以处理的图片数量远远多于训练时的图片数量，因此我们就要考虑推理时为 token 嵌入位置编码的手段。

一开始，文章尝试使用相同的球谐函数嵌入编码，文章中又提到：在 LLM 中，这种方法导致性能不佳。果不其然，在文章的初步实现中，他们同样发现当输入图像数量超过训练时使用图像的数量时，模型的效果并不好。
因此，文章借鉴了大预言模型中的位置插值方法：在训练时从一个集合${1,…,N’}$中均匀随机抽取$N$个索引，这样模型便被迫去学习处理更大范围的索引。

对于 transformer 来说，这种策略感觉和 masking 没什么区别，文章中也说：

This strategy enables Fast3R to handle N = 1000 images during inference, even if only trained with N = 20 images.

有效利用显存

从模型架构的图片来看，这看起来就是一个占用很大显存的模型。但是文章提出，由于模型的特点（ meta-architecture ），这个模型可以广泛使用各种并行化以及分片技术。
文章提出他们在训练和推理的时候利用了两种不同形式的并行化和 FlashAttention 技术，并认为随着未来的技术成熟他们的模型会持续受益（废话）。

具体采用的策略来实现高效训练。

首先，使用 FlashAttention 来提高时间和内存效率。即便如此，当 N>16 时，一个朴素的实现即使在批量大小为 1 的情况下也会耗尽内存（ 128 x A100-80GB 啊，离大谱）。
因此，后来使用了 DeepSpeed ZeRO stage 2 训练，将优化器状态、动量估计和梯度在不同的机器上进行分区。这样就能够以每个数据样本最多 N=28 个视角进行训练，同时每个 GPU 的批量大小为 1 。

模型效果：

miaomiao
就模型所给出的表格而言，确实是达到了 Sota 水平。

在推理速度上，由于所做的各种优化，它也得到了显著的提升。

但是，其实我更好奇的是它跟同期的 SLAM3R 的性能比较，阅读论文，发现两者并没有过同一个精度指标的比较，通过本人的本地测试，发现对于一个很小的数据集（ 82 张有序图片），两者速度上并没有太多差距，但是重建质量上来说
， SLAM3R 的质量远超 Fast3R 。这很好的符合了 SLAM3R 对有序图像序列进行针对性重建的特性，而 fast3R 是对一个随机图像重建的方法。

所以，当我看到 Fast3R 的 demo 里有对视频重建的选项时，我感觉并不适合。因为从直觉上来说，人们从一个没有次序的图像集中理解环境的过程也大致遵循一个先排序再重建的过程，也就是说人们对无次序的图片集中还原 3D 场景的难度远大于从视频中还原场景的难度。

论文中也提到了局限性的存在：

缺少包含大型场景的数据因而缺少在此类场景下的泛化能力。
没有更好的位置嵌入，不过论文提出可以参考那些能处理极长上下文序列的大语言模型。

ok ，关于 Fast3R 我就处理到这里，欸，我觉着或许我以后应该认真去看看训练细节和实验部分，总去看模型结构有种高屋建瓴的感觉，还是应该多看看代码（ x

]]> blog 3Dreconstruction paper reading 论文阅读记录：MAst3R //blog/mst3r/ 引言

经过一周的对SLAM3R进行 online 以及可视化 demo 改造的低效率劳作且工作完成，我终于有时间来补档我这篇早在近两个周之前就读完的论文Grounding Image Matching in 3D with MASt3R

读完这篇论文之后，我的第一感觉就是：这是一个 DUst3R 的修补模型，他并没有太多的像 DUst3R 那样的开创性地将 transformer 运用于双目三维重建那样的举动，而是在 DUst3R 模型上进行了
少许修补，并提出了少许修补中的一些独创性方法，感觉是一篇介绍 small trick 的论文。同时，我们似乎也可以这么说： MAst3R 发现本聚焦于三维重建任务的 DUst3R 在像素匹配问题上同样达到了 SOTA
于是， MAst3R 将 DUst3R 稍加改造，得到了一个在像素匹配上表现更强的模型 MAst3R.

模型介绍

MASt3R 的模型结构与 Dust3R 大致相同：
mast3r

Encoder

与 DUst3R 相同， MAst3R 的 encoder 部分同样是由 ViT 组成的，且与 DUst3R 相同的是， MAst3R 的 encoder 部分也是共享权重的。
就像这样：
$$
H_1 = Encoder(I^1) \
H_2 = Encoder(I^2)
$$

Decoder

MASt3R 的 Decoder 同样采用了 cross-attention 的机制，这能使得 MAst3R 能够理解同一像素在不同视角下的信息，有助于后续进行像素匹配。
$$
H’^1, H’^2 = Decoder(H^1, H^2)
$$

Heads

对于 Dust3R 来说，他只有一个 head ，直接将 decoder 的输出转化为点图信息和置信度（上图灰色部分）

3D Heads

MASt3R 对这个 head 基本上与 DUst3R 的 head 相同，都是将 decoder 的输出转化为点图信息和置信度。

Matching Heads

MASt3R 在此基础上又增加了一个 head ，专门用于像素匹配任务(上图蓝色部分)，这个头部由一个简单的两层的 MLP 组成，使用了 GELU 作为激活函数，另外在处理完后进行归一化处理，负责输出两张密集的特征图：
$$
D^1 = Head_{desc}^1([H’_1, H’2]) \
D^2 = Head{desc}^2([H’_1, H’_2])
$$

Loss

Mast3R 的损失函数由两部分组成：
$$
\mathcal{L}{total}=\mathcal{L}{conf}+ \beta\mathcal{L}_{match}
$$

3D Loss

MAst3R 的 3D Loss 与 DUst3R 的 3D Loss 基本相同，都是由点图的 L1 损失和置信度的交叉熵损失组成。
但是， MAst3R 在计算回归损失的时候，原本的 DUst3R 计算公式是这样的：
$$
\ell_{\mathrm{regr}}(\nu,i)=\left|\frac{1}{z}X_i^{\nu,1}-\frac{1}{\hat{z}}\hat{X}i^{\nu,1}\right|,
$$
MAst3R 认为在它的应用场景中，并不鼓励尺度不变性，而更多的是需要绝对的尺度一致性，因此 MAst3R 将上式改为了：
$$
\ell{\mathrm{regr}}(\nu,i)=\frac{\left|X_i^{\nu,1}-\hat{X}i^{\nu,1}\right|}{\hat{z}}
$$
因此， MAst3R 的 3D Loss 计算公式为：
$$
\mathcal{L}{\mathrm{conf}}=\sum_{\nu\in{1,2}}\sum_{i\in\mathcal{V}^\nu}C_i^\nu\ell_{\mathrm{regr}}(\nu,i)-\alpha\log C_i^\nu.
$$

Matching Loss

这个损失函数是对 Matching Head 输出的特征图进行监督的，基本思想是：我们鼓励一个图像中的一个特征匹配符，最多与另一张图像中代表同一个 3D 点的特征匹配符进行匹配，
需要注意的是，这个匹配本质上是一个交叉熵分类损失，当网络猜到正确的像素（而非邻近的像素）时，才会得到奖励。

具体实现上，我们利用了 InfoNCE loss 来实现这个想法，其作用于一组对应关系$\hat{\mathcal{M}} = { (i, j)|\hat{X_i}^{1,1} = \hat{X_j}^{2,1} }$，具体公式如下：
$$
\mathcal{L}{\mathrm{match}}=-\sum{(i,j)\in\hat{\mathcal{M}}}\log\frac{s_\tau(i,j)}{\sum_{k\in\mathcal{P}^1}s_\tau(k,j)}+\log\frac{s_\tau(i,j)}{\sum_{k\in\mathcal{P}^2}s_\tau(i,k)}
$$
其中，$s_\tau(i,j)=\exp(\frac{D_i^1\cdot D_j^2}{\tau})$，$\tau$是一个温度参数，$\mathcal{P}^1$和$\mathcal{P}^2$分别是图像 1 和图像 2 中所有像素的集合。

这极大地鼓励了网络进行高精度匹配。

最后，两个损失函数被结合起来，形成了 MAst3R 的总损失函数：
$$
\mathcal{L}{total}=\mathcal{L}{conf}+ \beta\mathcal{L}_{match}
$$
有了上述模型与 Loss 就可以训练了，但是网络的输出还需要经过一些处理，才能得到需要的匹配关系。注意，网络只输出了 PointMap 和每个像素的 LocalFeature ，而期望得到的是两个图像之间的像素点级别的匹配，匹配相关的部分就是图中新增的 NN 模块。

快速互惠匹配

当给定两张特定的预测图$DD^1,D^2\in\mathbb{R}^{H\times W\times d}$时，我们的目标是提取一组可靠的像素对应关系，即互惠最近邻。

数学定义：

互惠最近邻集合由公式定义：
$$
\mathcal{M}={(i,j)|j=\mathrm{NN}_2(D_i^1)\mathrm{~~and~~}i=\mathrm{NN}_1(D_j^2)}
$$
这里的$NN_A(D_j^B)$表示在特征图$D^A$中与特征$D_j^B$距离最近的特征的索引。其数学定义为：
$$
\mathrm{NN}_A(D_j^B)=\arg\min_i|D_i^A-D_j^B|
$$

传统方法

传统上，计算互惠最近邻的方法是通过暴力搜索来实现的，这种方法的时间复杂度为$O((HW)^2)$，这在高分辨率图像中是不可行的。

虽然优化最近邻搜索是可能的，例如使用 K-d 树，但这种优化在高维特征空间中通常会变得非常低效，在某些情况下，其速度甚至比 MASt3R 输出$D_1$和$D_2$的推理时间慢几个数量级。

MASt3R的方法

MASt3R 提出了一种基于子采样*的快速方法。

这个方法是从一个稀疏的第一张图片的像素集合出发的，通过找到这个集合中每个像素在第二张图片上的最近邻得到最近邻集合，然后再从这个最近邻集合中找到每个像素在第一张图片上的最近邻，最后通过检查互惠性来得到最终的互惠最近邻集合。

整个过程可以表示为：
$$
U^t\mapsto[\mathrm{NN}2(D_u^1)]{u\in U^t}=V^t\mapsto[\mathrm{NN}1(D_v^2)]{v\in V^t}=U^{t+1}
$$

当 $U_n^t = U_n^{t+1}$ 时，这些像素形成了一个闭环，并被收集为一组互惠匹配 $\mathcal{M}_k^t = { (U_n^t, V_n^t) | U_n^t = U_n^{t+1} }$。
对于下一次迭代，那些已经收敛的像素（即 $U_n^t = U_n^{t+1}$）会被过滤掉，新的 $U^t$ 更新为 $U^{t+1} \setminus U^t$。
这个过程会迭代固定的次数，直到所有的对应关系都收敛到稳定的（互惠）对为止。
最终的输出对应关系集合 $\mathcal{M}$ 由所有互惠匹配集合的拼接而成：$\mathcal{M} = \bigcup_t \mathcal{M}_k^t$。

这种快速匹配算法的总体复杂度大概是$O(kWH)$，相比朴素方法的$O((WH)^2)$，有了显著的提升。
chart

具体证明过程可以参考论文的附录部分。

个人总结

MAst3R 这篇论文的阅读，本人自己对 mast3r 的理解，以及对 transformer 在三维重建任务中应用的理解，基本上就到这里了，当然， mast3r 的实验部分我并没有过多地去阅读，因为我觉得 mast3r 的实验部分并没有太多的创新性，基本上都是在验证 mast3r 在各个任务上都达到了 SOTA 的水平。
我个人觉得 mast3r 的创新点主要有以下几点：

在 DUst3R 的基础上，增加了一个匹配头，用于像素匹配任务，这个头部的设计比较简单，但是效果却非常好。
在 3D 损失函数中，改变了点图回归损失的计算方式，使其更加适合绝对尺度一致性的任务。
提出了一个快速的互惠匹配算法，大大提升了匹配的效率。
总的来说， MAst3R 是一篇比较实用的论文，通过一些小的改动和创新，使得模型在多个任务上都达到了 SOTA 的水平，值得学习和借鉴。

另外， MAst3R 的代码也已经开源：

喵喵补坑完毕，虽然感觉说了和没说一样😭

]]> blog 3Dreconstruction paper reading VGGT读后有感 //blog/VGGT/ 引言

继写完SLAM3R的 onlinee 处理后，我又将目光投向了今年 CVPR 的最佳论文：VGGT:Visual Geometry Grounded Transformer 不要问我研究 3R 为什么不先看 vggt😂, 问就是我太摆了一开始懒得看了。

VGGT主要介绍了一个离线的多视图重建，位姿估计和轨迹追踪的强大的模型，与之前类似于SfM、DUst3R的重建方法相比，它的先进之处在于：

摆脱了这些方法所依赖的昂贵的后处理过程（而这通常没有计入到之前模型的性能评估中）
将多个任务：深度估计、位姿估计、视图重建、轨迹追踪等全部输出，表现甚至超过了之前单一领域的SOTA方法。
在将多个任务的结果全部输出的过程中，作者发现了引入不同结果之间的内在数学联系限制后会大幅提高模型的性能。

项目架构

与之前的模块化解决问题不同，VGGT的主要结构是一个大的 Transformer ，它接受一个图片集作为输入，然后输出场景图片的不同三维属性。

值得一提的是，它所能解决的多视角三维属性几乎涵盖了三维视觉的方方面面：

相机位姿以及内参
点图重建
关键区域追踪
关于单张图片的深度图

并且， VGGT 通过更加创新的举动，它将输出的多任务成果的内在几何关系作为归纳偏置整合进了模型，并发现了大幅度的性能提升，这个很值得去研究。

总结

感觉VGGT就是一个巨大的 transformer ，通过极其暴力的手段解决问题，客观上来说，这确实展示了 transformer 在三维重建领域的应用，但其实我是有一些疑问的：
像自然语言处理这种工作，它是无法定量化去研究的，所以我们引入了 transformer ，似乎是用未知对抗不确定性的手段，但是，在这个三维重建这个领域，它真的有那么多不确定性吗？
还是感觉 transformer 对于三维重建的成果属于是结果能看，但是要达到更高的精度会让人很迷惑。

]]> blog 3Dreconstruction paper reading 为SLAM3R补充实时处理函数方法 //blog/SLAM3R_online%20edit/ 在上个周阅读SLAM3R论文结束后，学长让我去看一下它的源代码，读完代码之后，发现虽然论文里讲述的是“可以实时重建”，但是实际上在recon.py文件中的scene_recon_pipeline函数中，代码采取了先对所有input_views进行输入到i2p_model得到res_feats，然后再将所有图片的 token 输入到 l2w 网络中进行重建的大致逻辑。

显然，这样的处理方法不是论文里所提出的online处理方法，因此，在过去的一个周里，本人一边练着科三显然今天上午刚挂掉，该死的直线行驶😡，同时抽出了一点点时间完成了recon_online.py, 一个把原本的scene_recon_pipeline改成online处理的改动。

原函数的处理逻辑

阅读原函数的代码，我们可以将其分为以下几段：

预处理&得到所有view的token

# Pre-save the RGB images along with their corresponding masks 
# in preparation for visualization at last.
rgb_imgs = []
for i in range(len(data_views)): 
 if data_views[i][' img' ].shape[0] == 1: 
 data_views[i][' img' ] = data_views[i][' img' ][0] 
 rgb_imgs.append(transform_img(dict(img=data_views[i][' img' ][None]))[..., : :-1])
if ' valid_mask'  not in data_views[0]: 
 valid_masks = None
else: 
 valid_masks = [view[' valid_mask' ] for view in data_views] 

#preprocess data for extracting their img tokens with encoder
for view in data_views: 
 view[' img' ] = torch.tensor(view[' img' ][None])
 view[' true_shape' ] = torch.tensor(view[' true_shape' ][None])
 for key in [' valid_mask' , ' pts3d_cam' , ' pts3d' ]: 
 if key in view: 
 del view[key]
 to_device(view, device=args.device)
# pre-extract img tokens by encoder, which can be reused 
# in the following inference by both i2p and l2w models
res_shapes, res_feats, res_poses = get_img_tokens(data_views, i2p_model) # 300+fps
print(' finish pre-extracting img tokens' )

这里重点就是最后的res_shapes, res_feats, res_poses = get_img_tokens(data_views, i2p_model)，采用i2p_model的_encode_multiview方法批次化地(batchify)对data_views进行处理，从而得到所有的 view 的token。

对所有view进行推理得到最合适的key_frame_stride

这里的核心代码就是：

# decide the stride of sampling keyframes, as well as other related parameters
if args.keyframe_stride == -1: 
 kf_stride = adapt_keyframe_stride(input_views, i2p_model, 
 win_r = 3, 
 adapt_min=args.keyframe_adapt_min, 
 adapt_max=args.keyframe_adapt_max, 
 adapt_stride=args.keyframe_adapt_stride)
else: 
 kf_stride = args.keyframe_stride

其中，adapt_keyframe_stride函数是一个典型的offline处理函数，它的功能是在所有的 input_view 中遍历可能的kf_stride取值，然后对每一个可能的取值随机取样，然后利用i2p_inference_batch函数得出置信度作为相似度？然后选取最高的所对应的kf_stride作为最优的取值。

使用初始的几个滑动窗口创建初始的全局scene&初始化buffer set

因为SLAM3R初始化时的特殊性:

对于第一个帧这种特殊情况，我们采用了重复运行多次 I2P 获取足够多数量的初始帧作为缓冲集

在原本的 offline 格式的recon.py中，这种做法以这种样式呈现：

initial_pcds, initial_confs, init_ref_id = initialize_scene(input_views[: initial_winsize*kf_stride: kf_stride], 
 i2p_model, 
 winsize=initial_winsize, 
 return_ref_id=True) # 5*(1,224,224,3)

# start reconstrution of the whole scene
init_num = len(initial_pcds)
per_frame_res = dict(i2p_pcds=[], i2p_confs=[], l2w_pcds=[], l2w_confs=[])
for key in per_frame_res: 
 per_frame_res[key] = [None for _ in range(num_views)]

registered_confs_mean = [_ for _ in range(num_views)]

# set up the world coordinates with the initial window
for i in range(init_num): 
 per_frame_res[' l2w_confs' ][i*kf_stride] = initial_confs[i][0].to(args.device) # 224,224
 registered_confs_mean[i*kf_stride] = per_frame_res[' l2w_confs' ][i*kf_stride].mean().cpu()

# initialize the buffering set with the initial window
assert args.buffer_size < = 0 or args.buffer_size > = init_num 
buffering_set_ids = [i*kf_stride for i in range(init_num)]

# set up the world coordinates with frames in the initial window
for i in range(init_num): 
 input_views[i*kf_stride][' pts3d_world' ] = initial_pcds[i]
 
initial_valid_masks = [conf > conf_thres_i2p for conf in initial_confs] # 1,224,224
normed_pts = normalize_views([view[' pts3d_world' ] for view in input_views[: init_num*kf_stride: kf_stride]], 
 initial_valid_masks)
for i in range(init_num): 
 input_views[i*kf_stride][' pts3d_world' ] = normed_pts[i]
 # filter out points with low confidence
 input_views[i*kf_stride][' pts3d_world' ][~initial_valid_masks[i]] = 0 
 per_frame_res[' l2w_pcds' ][i*kf_stride] = normed_pts[i] # 224,224,3

其中，

initial_pcds, initial_confs, init_ref_id = initialize_scene(input_views[: initial_winsize*kf_stride: kf_stride], 
 i2p_model, 
 winsize=initial_winsize, 
 return_ref_id=True) # 5*(1,224,224,3)

这一行是对初始化的几个view_token进行场景重建，并选出一开始的init_ref_id

然后之后就是把所有初始化的帧放到buffer_set里，然后进行一些归一化处理。

对原始的view再继续进行i2p重建点图

这里我们重新遍历所有图像，对应论文里面通过I2P的decoder重建所有view的点图。此外，注意initial window的关键帧图片基本上已经在上面的初始化中被创建出了点图，因此我们选择略过他们，只对没有被创建点图的帧进行I2P处理
以得到点图，然后就采用论文中的输入窗口多个帧，重建每个帧的点云作为L2W model的输入。

for view_id in tqdm(range(num_views), desc=" I2P resonstruction" ): 
 # skip the views in the initial window
 if view_id in buffering_set_ids: 
 # trick to mark the keyframe in the initial window
 if view_id // kf_stride == init_ref_id: 
 per_frame_res[' i2p_pcds' ][view_id] = per_frame_res[' l2w_pcds' ][view_id].cpu()
 else: 
 per_frame_res[' i2p_pcds' ][view_id] = torch.zeros_like(per_frame_res[' l2w_pcds' ][view_id], device=" cpu" )
 per_frame_res[' i2p_confs' ][view_id] = per_frame_res[' l2w_confs' ][view_id].cpu()
 continue
 # construct the local window 
 sel_ids = [view_id]
 for i in range(1, win_r+1): 
 if view_id-i*adj_distance > = 0: 
 sel_ids.append(view_id-i*adj_distance)
 if view_id+i*adj_distance < num_views: 
 sel_ids.append(view_id+i*adj_distance)
 local_views = [input_views[id] for id in sel_ids]
 ref_id = 0 
 # recover points in the local window, and save the keyframe points and confs
 output = i2p_inference_batch([local_views], i2p_model, ref_id=ref_id, 
 tocpu=False, unsqueeze=False)[' preds' ]
 #save results of the i2p model
 per_frame_res[' i2p_pcds' ][view_id] = output[ref_id][' pts3d' ].cpu() # 1,224,224,3
 per_frame_res[' i2p_confs' ][view_id] = output[ref_id][' conf' ][0].cpu() # 224,224

 # construct the input for L2W model 
 input_views[view_id][' pts3d_cam' ] = output[ref_id][' pts3d' ] # 1,224,224,3
 valid_mask = output[ref_id][' conf' ] > conf_thres_i2p # 1,224,224
 input_views[view_id][' pts3d_cam' ] = normalize_views([input_views[view_id][' pts3d_cam' ]], 
 [valid_mask])[0]
 input_views[view_id][' pts3d_cam' ][~valid_mask] = 0

对初始窗口非关键帧进行注册

显然我们在之前的初始化场景中只注册了关键帧，因此我们现在开始对非关键帧进行注册：

# Special treatment: register the frames within the range of initial window with L2W model
# TODO:  batchify
if kf_stride > 1: 
 max_conf_mean = -1
 for view_id in tqdm(range((init_num-1)*kf_stride), desc=" pre-registering" ): 
 if view_id % kf_stride == 0: 
 continue
 # construct the input for L2W model
 l2w_input_views = [input_views[view_id]] + [input_views[id] for id in buffering_set_ids]
 # (for defination of ref_ids, see the doc of l2w_model)
 output = l2w_inference(l2w_input_views, l2w_model, 
 ref_ids=list(range(1, len(l2w_input_views))), 
 device=args.device, 
 normalize=args.norm_input)
 
 # process the output of L2W model
 input_views[view_id][' pts3d_world' ] = output[0][' pts3d_in_other_view' ] # 1,224,224,3
 conf_map = output[0][' conf' ] # 1,224,224
 per_frame_res[' l2w_confs' ][view_id] = conf_map[0] # 224,224
 registered_confs_mean[view_id] = conf_map.mean().cpu()
 per_frame_res[' l2w_pcds' ][view_id] = input_views[view_id][' pts3d_world' ]
 
 if registered_confs_mean[view_id] > max_conf_mean: 
 max_conf_mean = registered_confs_mean[view_id]
 print(f' finish aligning { (init_num-1)*kf_stride}  head frames, with a max mean confidence of { max_conf_mean: .2f} ' )

这里正如注释所说，是一个Special treatment。也是一个特殊情况处理。

缩放confs

我们发现，我们只用l2w网络对非关键帧进行了置信度预测，关键帧的置信度是由之前的i2p网络进行预测的，作者在这里为了控制计算成本，选择直接将后者乘上一个常数因子进行缩放，大致反映出了场景的置信度分数：

# A problem is that the registered_confs_mean of the initial window is generated by I2P model, 
# while the registered_confs_mean of the frames within the initial window is generated by L2W model, 
# so there exists a gap. Here we try to align it.
max_initial_conf_mean = -1
for i in range(init_num): 
 if registered_confs_mean[i*kf_stride] > max_initial_conf_mean: 
 max_initial_conf_mean = registered_confs_mean[i*kf_stride]
factor = max_conf_mean/max_initial_conf_mean
# print(f' align register confidence with a factor { factor} ' )
for i in range(init_num): 
 per_frame_res[' l2w_confs' ][i*kf_stride] *= factor
 registered_confs_mean[i*kf_stride] = per_frame_res[' l2w_confs' ][i*kf_stride].mean().cpu()

对剩下的views进行注册

OK ，经过了以上的对于初始帧的特殊处理，我们终于踏入了正途：在过程中对每个帧进行实时处理

从buffer set里选择最相近的sel_num个帧：

# select sccene frames in the buffering set to work as a global reference
cand_ref_ids = buffering_set_ids
ref_views, sel_pool_ids = scene_frame_retrieve(
 [input_views[i] for i in cand_ref_ids], 
 input_views[ni: ni+num_register: 2], 
 i2p_model, sel_num=num_scene_frame, 
 # cand_recon_confs=[per_frame_res[' l2w_confs' ][i] for i in cand_ref_ids], 
 depth=2)

这里正如论文中所述，采用了i2p_model的前 2 个decoder进行相似评分。

将选取的最相近的几个帧作为参考合并当前帧进行l2w重建

显而易见，言以概之：

# register the source frames in the local coordinates to the world coordinates with L2W model
l2w_input_views = ref_views + input_views[ni: max_id+1]
input_view_num = len(ref_views) + max_id - ni + 1
assert input_view_num == len(l2w_input_views)

output = l2w_inference(l2w_input_views, l2w_model, 
 ref_ids=list(range(len(ref_views))), 
 device=args.device, 
 normalize=args.norm_input)

# process the output of L2W model
src_ids_local = [id+len(ref_views) for id in range(max_id-ni+1)] # the ids of src views in the local window
src_ids_global = [id for id in range(ni, max_id+1)] #the ids of src views in the whole dataset
succ_num = 0
for id in range(len(src_ids_global)): 
 output_id = src_ids_local[id] # the id of the output in the output list
 view_id = src_ids_global[id] # the id of the view in all views
 conf_map = output[output_id][' conf' ] # 1,224,224
 input_views[view_id][' pts3d_world' ] = output[output_id][' pts3d_in_other_view' ] # 1,224,224,3
 per_frame_res[' l2w_confs' ][view_id] = conf_map[0]
 registered_confs_mean[view_id] = conf_map[0].mean().cpu()
 per_frame_res[' l2w_pcds' ][view_id] = input_views[view_id][' pts3d_world' ]
 succ_num += 1

需要注意的是，这里其实还是有改进空间的，我们可以根据l2w_model的output对参考帧进行微调。

通过一些手段更新buffer set

buffer_set的选取方法差不多就和论文里面讲的一样，基本上就是随机选取了。

# update the buffering set
if next_register_id - milestone > = update_buffer_intv: 
 while(next_register_id - milestone > = kf_stride): 
 candi_frame_id += 1
 full_flag = max_buffer_size > 0 and len(buffering_set_ids) > = max_buffer_size
 insert_flag = (not full_flag) or ((strategy == ' fifo' ) or 
 (strategy == ' reservoir'  and np.random.rand() < max_buffer_size/candi_frame_id))
 if not insert_flag: 
 milestone += kf_stride
 continue
 # Use offest to ensure the selected view is not too close to the last selected view
 # If the last selected view is 0, 
 # the next selected view should be at least kf_stride*3//4 frames away
 start_ids_offset = max(0, buffering_set_ids[-1]+kf_stride*3//4 - milestone)
 
 # get the mean confidence of the candidate views
 mean_cand_recon_confs = torch.stack([registered_confs_mean[i]
 for i in range(milestone+start_ids_offset, milestone+kf_stride)])
 mean_cand_local_confs = torch.stack([local_confs_mean[i]
 for i in range(milestone+start_ids_offset, milestone+kf_stride)])
 # normalize the confidence to [0,1], to avoid overconfidence
 mean_cand_recon_confs = (mean_cand_recon_confs - 1)/mean_cand_recon_confs # transform to sigmoid
 mean_cand_local_confs = (mean_cand_local_confs - 1)/mean_cand_local_confs
 # the final confidence is the product of the two kinds of confidences
 mean_cand_confs = mean_cand_recon_confs*mean_cand_local_confs
 
 most_conf_id = mean_cand_confs.argmax().item()
 most_conf_id += start_ids_offset
 id_to_buffer = milestone + most_conf_id
 buffering_set_ids.append(id_to_buffer)
 # print(f" add ref view { id_to_buffer} " ) 
 # since we have inserted a new frame, overflow must happen when full_flag is True
 if full_flag: 
 if strategy == ' reservoir' : 
 buffering_set_ids.pop(np.random.randint(max_buffer_size))
 elif strategy == ' fifo' : 
 buffering_set_ids.pop(0)
 # print(next_register_id, buffering_set_ids)
 milestone += kf_stride
# transfer the data to cpu if it is not in the buffering set, to save gpu memory
for i in range(next_register_id): 
 to_device(input_views[i], device=args.device if i in buffering_set_ids else ' cpu' )

保存环节

当我们处理完所有帧后，我们会保存我们的所有帧的点云，把这些所有帧的点云合到一起进行重建，得出最后的场景点云。

Review

显而易见，原recon.py中的这个pipeline是一个完全的offline处理方法，因此，我编写了一个真正的（？online版本的方法，处理逻辑如下所示：

Online 函数的处理逻辑

既然是要 online ，我们显然第一件要做的事情就是写下：

1	for i in range(len(data_views)):

之后我们在进行一系列处理：

预处理 & 得到当前view的token

显然，通过对原先offline版本的函数分析，这个过程没有初始化的困扰，因此，我们可以大胆对所有遍历到的 view 都进行这一步：

# Pre-save the RGB images along with their corresponding masks
# in preparation for visualization at last.

if data_views[i][' img' ].shape[0] == 1: 
 data_views[i][' img' ] = data_views[i][' img' ][0]
rgb_imgs.append(transform_img(dict(img=data_views[i][' img' ][None]))[..., : :-1])

if is_have_mask_rgb: 
 valid_masks.append(data_views[i][' valid_mask' ])

# process now image for extracting its img token with encoder
data_views[i][' img' ] = torch.tensor(data_views[i][' img' ][None])
data_views[i][' true_shape' ] = torch.tensor(data_views[i][' true_shape' ][None])
for key in [' valid_mask' , ' pts3d_cam' , ' pts3d' ]: 
 if key in data_views[i]: 
 del data_views[key]
to_device(data_views[i], device=args.device)

# pre-extract img tokens by encoder, which can be reused 
# in the following inference by both i2p and l2w models
temp_shape, temp_feat, temp_pose = get_single_img_tokens([data_views[i]], i2p_model, True)
res_shapes.append(temp_shape[0])
res_feats.append(temp_feat[0])
res_poses.append(temp_pose[0])
print(f" finish pre-extracting img token of view { i} " )

input_views.append(dict(label=data_views[i][' label' ], 
 img_tokens=temp_feat[0], 
 true_shape=data_views[i][' true_shape' ], 
 img_pos=temp_pose[0]))
for key in per_frame_res: 
 per_frame_res[key].append(None)
registered_confs_mean.append(i)

这里我使用了一个get_single_img_tokens函数，与之前的get_img_tokens函数相比，该函数除了不能 batch 化(online 的限制)之外，效果输出别无二致。

积累帧以用于场景初始化

需要注意的是，当帧序数小于初始化所需要的帧数时，我们后续的程序均无法进行，因此在我的代码中，我选择直接跳过，先蓄势待发🤣

一旦积累到初始化场景所需帧后，函数会采用一系列操作初始化场景以及初始化 buffer set ，对初始化后的各帧点云进行归一化处理：

# accumulate the initial window frames
if i < (initial_winsize - 1)*kf_stride and i % kf_stride == 0: 
 continue
elif i == (initial_winsize - 1)*kf_stride: 
 initial_pcds, initial_confs, init_ref_id = initialize_scene(input_views[: initial_winsize*kf_stride: kf_stride], 
 i2p_model, 
 winsize=initial_winsize, 
 return_ref_id=True)
 # set up the world coordinates with the initial window
 init_num = len(initial_pcds)
 for j in range(init_num): 
 per_frame_res[' l2w_confs' ][j * kf_stride] = initial_confs[j][0].to(args.device)
 registered_confs_mean[j * kf_stride] = per_frame_res[' l2w_confs' ][j * kf_stride].mean().cpu()
 # initialize the buffering set with the initial window
 assert args.buffer_size < = 0 or args.buffer_size > = init_num 
 buffering_set_ids = [j*kf_stride for j in range(init_num)]
 # set ip the woeld coordinates with frames in the initial window
 for j in range(init_num): 
 input_views[j*kf_stride][' pts3d_world' ] = initial_pcds[j]
 initial_valid_masks = [conf > conf_thres_i2p for conf in initial_confs]
 normed_pts = normalize_views([view[' pts3d_world' ] for view in input_views[: init_num*kf_stride: kf_stride]], 
 initial_valid_masks)
 for j in range(init_num): 
 input_views[j*kf_stride][' pts3d_world' ] = normed_pts[j]
 # filter out points with low confidence
 input_views[j*kf_stride][' pts3d_world' ][~initial_valid_masks[j]] = 0
 per_frame_res[' l2w_pcds' ][j*kf_stride] = normed_pts[j]

elif i < (initial_winsize - 1) * kf_stride: 
 continue

需要注意的是，这里一旦积累到足够多的初始帧，我们就不会进行 continue 处理了，然后直接进行下一部分。

对之前积累的view进行i2p重建点图（包含正在处理的帧） & 注册初始窗口非关键帧

这里我们采用类似于之前offline的顺序，只不过把外在的表现形式作出了改变，实际上内在的顺序逻辑基本不变：

# first recover the accumulate views
if i == (initial_winsize - 1) * kf_stride: 
 for view_id in range(i + 1): 
 # skip the views in the initial window
 if view_id in buffering_set_ids: 
 # trick to mark the keyframe in the initial window
 if view_id // kf_stride == init_ref_id: 
 per_frame_res[' i2p_pcds' ][view_id] = per_frame_res[' l2w_pcds' ][view_id].cpu()
 else: 
 per_frame_res[' i2p_pcds' ][view_id] = torch.zeros_like(per_frame_res[' l2w_pcds' ][view_id], device=" cpu" )
 per_frame_res[' i2p_confs' ][view_id] = per_frame_res[' l2w_confs' ][view_id].cpu()
 print(f" finish revocer pcd of frame { view_id}  in their local coordinates(in buffer set), with a mean confidence of { per_frame_res[' i2p_confs' ][view_id].mean(): .2f}  up to now." )
 continue
 # construct the local window with the initial views
 sel_ids = [view_id]
 for j in range(1, win_r + 1): 
 if view_id - j * adj_distance > = 0: 
 sel_ids.append(view_id - j * adj_distance)
 if view_id + j * adj_distance < i: 
 sel_ids.append(view_id + j * adj_distance)
 local_views = [input_views[id] for id in sel_ids]
 ref_id = 0

 # recover poionts in the initial window, and save the keyframe points and confs
 output = i2p_inference_batch([local_views], i2p_model, ref_id=ref_id, 
 tocpu=False, unsqueeze=False)[' preds' ]
 # save results of the i2p model for the initial window
 per_frame_res[' i2p_pcds' ][view_id] = output[ref_id][' pts3d' ].cpu()
 per_frame_res[' i2p_confs' ][view_id] = output[ref_id][' conf' ][0].cpu()

 # construct the input for L2W model
 input_views[view_id][' pts3d_cam' ] = output[ref_id][' pts3d' ]
 valid_mask = output[ref_id][' conf' ] > conf_thres_i2p
 input_views[view_id][' pts3d_cam' ] = normalize_views([input_views[view_id][' pts3d_cam' ]], 
 [valid_mask])[0]
 input_views[view_id][' pts3d_cam' ][~valid_mask] = 0

 local_confs_mean_up2now = [conf.mean() for conf in per_frame_res[' i2p_confs' ] if conf is not None]
 print(f" finish revocer pcd of frame { view_id}  in their local coordinates, with a mean confidence of { torch.stack(local_confs_mean_up2now).mean(): .2f}  up to now." )

 # Special treatment: register the frames within the range of initial window with L2W model
 if kf_stride > 1: 
 max_conf_mean = -1
 for view_id in tqdm(range((init_num - 1) * kf_stride), desc=" pre-registering" ): 
 if view_id % kf_stride == 0: 
 continue
 # construct the input for L2W model

 l2w_input_views = [input_views[view_id]] + [input_views[id] for id in buffering_set_ids]
 # (for defination of ref_ids, seee the doc of l2w_model)
 output = l2w_inference(l2w_input_views, l2w_model, 
 ref_ids=list(range(1, len(l2w_input_views))), 
 device=args.device, 
 normalize=args.norm_input)
 # process the output of L2W model
 input_views[view_id][' pts3d_world' ] = output[0][' pts3d_in_other_view' ] # 1,224,224,3
 conf_map = output[0][' conf' ] # 1,224,224
 per_frame_res[' l2w_confs' ][view_id] = conf_map[0] # 224,224
 registered_confs_mean[view_id] = conf_map.mean().cpu()
 per_frame_res[' l2w_pcds' ][view_id] = input_views[view_id][' pts3d_world' ]
 
 if registered_confs_mean[view_id] > max_conf_mean: 
 max_conf_mean = registered_confs_mean[view_id]
 print(f' finish aligning { (init_num)*kf_stride}  head frames, with a max mean confidence of { max_conf_mean: .2f} ' )
 # A problem is that the registered_confs_mean of the initial window is generated by I2P model, 
 # while the registered_confs_mean of the frames within the initial window is generated by L2W model, 
 # so there exists a gap. Here we try to align it.
 max_initial_conf_mean = -1
 for i in range(init_num): 
 if registered_confs_mean[i*kf_stride] > max_initial_conf_mean: 
 max_initial_conf_mean = registered_confs_mean[i*kf_stride]
 factor = max_conf_mean/max_initial_conf_mean
 # print(f' align register confidence with a factor { factor} ' )
 for i in range(init_num): 
 per_frame_res[' l2w_confs' ][i*kf_stride] *= factor
 registered_confs_mean[i*kf_stride] = per_frame_res[' l2w_confs' ][i*kf_stride].mean().cpu()
 # register the rest frames with L2W model
 next_register_id = (init_num - 1) * kf_stride + 1
 milestone = init_num * kf_stride + 1
 update_buffer_intv = kf_stride*args.update_buffer_intv # update the buffering set every update_buffer_intv frames
 max_buffer_size = args.buffer_size
 strategy = args.buffer_strategy
 candi_frame_id = len(buffering_set_ids) # used for the reservoir sampling strategy
 continue

然后在处理完这么一堆之后我们直接continue到下一个循环。

处理新图片

在下一个循环中，我们拿到了新图片，此时我们也在我们的online函数中踏上了正途，可以对每一个帧进行实时处理了。

这里，我们的处理逻辑与第一种方法类似，不同的一点是我是一帧一帧地去处理。

保存环节

与上一个方法略微不同，我提供了参数选项选择是否在线保存/逐几帧保存，因此我重写了一个增量式保存的类：

class IncrementalReconstructor: 
 " " " 
 A class used for reconstruting the pts incrementally
 " " " 
 def __init__(self): 
 self.res_pcds = None
 self.res_rgbs = None
 self.res_confs = None
 self.res_valid_masks = None
 self.is_initialized = False

 def add_frame(self, view: dict, img: np.ndarray, conf: np.ndarray = None, valid_mask: np.ndarray = None): 
 " " " 
 Incrementally add a new frame of view data.

 Args: 
 view (dict): a dictionary for a new view
 img (np.ndarray): rgb_img
 conf (np.ndarray, optional): 
 valid_mask (np.ndarray, optional): 
 " " " 
 try: 
 new_pcd = to_numpy(view[' pts3d_world' ]).reshape(-1, 3)
 new_rgb = to_numpy(img).reshape(-1, 3)
 except KeyError: 
 print(f" Warning: ' pts3d_world' not found in the new view. Frame skipped." )
 return
 if not self.is_initialized: 
 self.res_pcds = new_pcd
 self.res_rgbs = new_rgb
 if conf is not None: 
 self.res_confs = to_numpy(conf).reshape(-1)
 if valid_mask is not None: 
 self.res_valid_masks = to_numpy(valid_mask).reshape(-1)
 self.is_initialized = True
 else: 
 self.res_pcds = np.concatenate([self.res_pcds, new_pcd], axis=0)
 self.res_rgbs = np.concatenate([self.res_rgbs, new_rgb], axis=0)
 if conf is not None: 
 new_conf = to_numpy(conf).reshape(-1)
 self.res_confs = np.concatenate([self.res_confs, new_conf], axis=0)
 if valid_mask is not None: 
 new_mask = to_numpy(valid_mask).reshape(-1)
 self.res_valid_masks = np.concatenate([self.res_valid_masks, new_mask], axis=0)

 def save_snapshot(self, snapshot_id: int, save_dir: str, num_points_save: int = 200000, conf_thres_res: float = 3.0): 
 " " " 
 Just save
 " " " 
 if not self.is_initialized: 
 print(" Warning: Reconstructor not initialized. Nothing to save." )
 return
 save_name = f" recon_snapshot_{ snapshot_id: 05d} .ply" 
 pts_count = len(self.res_pcds)
 final_valid_mask = np.ones(pts_count, dtype=bool)

 if self.res_valid_masks is not None: 
 final_valid_mask & = self.res_valid_masks
 
 if self.res_confs is not None: 
 conf_masks = self.res_confs > conf_thres_res
 final_valid_mask & = conf_masks

 valid_ids = np.where(final_valid_mask)[0]
 
 if len(valid_ids) == 0: 
 print(f" Warning for snapshot { snapshot_id} : No valid points left after filtering." )
 return
 
 print(f' Snapshot { snapshot_id} : Ratio of points filtered out: { (1. - len(valid_ids) / pts_count) * 100: .2f} %' )
 n_samples = min(num_points_save, len(valid_ids))
 print(f" Snapshot { snapshot_id} : Resampling { n_samples}  points from { len(valid_ids)}  valid points." )
 sampled_idx = np.random.choice(valid_ids, n_samples, replace=False)
 sampled_pts = self.res_pcds[sampled_idx]
 sampled_rgbs = self.res_rgbs[sampled_idx]
 save_path = join(save_dir, save_name)
 print(f" Saving reconstruction snapshot to { save_path} " )
 save_ply(points=sampled_pts, save_path=save_path, colors=sampled_rgbs)

在每一个循环最后加以调用：

reconstructor.add_frame(
 view=input_views[i], 
 img=rgb_imgs[i], 
 conf=per_frame_res[' l2w_confs' ][i], 
 valid_mask=valid_masks
 )
 if args.save_online: 
 if (i + 1) % args.save_frequency == 0: 
 reconstructor.save_snapshot(
 snapshot_id=i + 1, 
 save_dir=save_dir, 
 num_points_save=num_points_save, 
 conf_thres_res=conf_thres_l2w
 )

OK ，到此为止我就写完了原本的处理逻辑的解释和新写的*onlinee处理逻辑介绍，其实要说不说，online处理逻辑也并非太过复杂，但是奈何我这几天因为学车耽误了太多时间也没做什么东西（ x

又水了一篇 blog😋

新的仓库：

]]> blog 3Dreconstruction coding SLAM3R读后有感 //blog/SLAM3R/ 最近几天读完了SLAM3R的论文，这是 2025 年 CVPR 的一篇Highlight论文，也是我在 3R 方向的读过的第 3 篇论文。

这篇论文主要介绍了一个叫做SLAM3R的根据视频即时重建的系统，感觉是由DUst3R中获得的灵感，不同的是DUst3R是根据两张图片重建出三维点图，并且是离线处理；而SLAM3R是从一个单目视频中实时在线重建，并且相较于之前的一些方法具有极高的效率。

SLAM3R的主要模块

SLAM3R 主要由I2P和L2W两大模块组成，分别负责从视频中的关键帧重建点图(Image to Point)和利用点图增量式地重建全局点图（ Local to World ）, 具体结构如下：

nothing

视频预处理

首先， SLAM3R 采用了滑动窗口算法将视频拆成多个小片段，把多个小片段输入到 I2P 中进行处理。

I2P网络

I2P 模块接受预处理产生的视频片段，该视频片段由多个帧${F_i},i = 1, … N$组成。通常我们从中选取最中间的帧作为关键帧$F_{key}$，剩下的$N - 1$个帧作为补充帧输入到 I2P 中。

首先，我们将所有帧通过一个由$m$个 ViT encoder 组成的$E_{img}$，生成相应的 token ，然后再进行 decoder 操作。具体就是将关键帧的 token 输入到一个特殊处理的 decoder:$D_{key}$里（如下图所示），然后剩下的$N - 1$个补充帧共享同一个 decoder 结构（继承自DUst3R，由$n$个 ViT decoder 组成），均生成对应的$G_{sup_i}$。

然后，我们再使用类似于DUSt3R中的方法，将这些帧（尤其是关键帧）做出一个置信度最高的三维重建。从而得到某一个视频片段对应的点图$\hat{X}_{key}$。

L2W网络

这个模块接受 I2P 模块产生的$X_{key}$作为输入，因为其是一个在线处理方法，所以我们引入了缓冲集这一关键的组分。

首先，我们在已经处理完的关键帧点图中采用reservoir strategy选取$B$个已经注册完的帧作为缓冲集（对于第一个帧这种特殊情况，我们采用了重复运行多次 I2P 获取足够多数量的初始帧作为缓冲集），然后，每当一个新的帧输入时，我们使用一个检索模块（由 I2P 中的 decoder 组成）在缓冲集中将特征的相似度进行匹配，我们然后选取匹配度最高的$K$个关键帧点图，然后将这$K$个关键帧点图 $$ \hat{X}_{i}^{H \times W \times 3},i = 1 , …, K + 1 $$作为这个模块的输入。

如前图所示，我们将这$K + 1$个点图输入到我们的 L2W 模块的 encoder $E_{pts}$ 中：
$$
\mathcal{P}i^{(T\times d)}=E{pts}(\hat{X}_i^{(H\times W\times3)}),i=1,…,K+1.
$$
然后，由于我们实际上不能只通过点图信息来进行建模（如纹理相同的两个不一样的平面或不同的一块地面），因此我们选择将特征与 I2P 网络中的特征融合：
$$
\mathcal{F}_i^{(T\times d)}=F_i^{(T\times d)}+\mathcal{P}_i^{(T\times d)},i=1,…,K+1.
$$
在这之后，我们便生成了每张点图的位置外观特征序列。

紧接着，我们会这$K + 1$个点图输入到两个解码器中：

Registration Decoder

Registration Decoder 将所有 token 作为输入，然后目的是将 L2W 的关键帧重建转换到场景坐标系下，它与$D_{key}$采用相同的架构。

解码过程大概是：
$$
\mathcal{G}{sce_i}=D{sce}(\mathcal{F}{sce_i},\mathcal{F}{key}),\quad i=1,…,K
$$

Scene Decoder

Scene Decoder 同样将所有 token 作为输入，但是它的目的是在不改变场景坐标系的情况下，精化坐标几何。他同样采用与$D_{key}$相同的架构，但是他是对每一个在已选中的关键帧点图进行优化：
$$
\mathcal{G}{sce_i}=D{sce}(\mathcal{F}{sce_i},\mathcal{F}{key}),\quad i=1,…,K
$$
通过这样的方式将已生成的 point map 进行优化

最后，我们采用类似于 I2P 模块中的方法对我们所有已经重建的关键帧 token 进行点图重建：
$$
\tilde{X}_i^{(H\times W\times3)},\tilde{C}_i^{(H\times W\times1)}=\mathrm{H}(\mathcal{G}_i^{(T\times d)}),i=1,…,K+1.
$$

得到一个实时的三维表示。

结论

本人目前涉猎不深，但是论文最后与其他系统做比较，其展现的效率确实令我印象深刻，感觉以上的这个系统的两大模块也令非常简洁舒适。等我再去阅读其他的 3R 文章来进一步理解这个 SOTA 的含金量吧😋

github 项目地址：

喵喵又是充实的一天🥳，本人可能理解有偏差（ bushi

]]> blog 3Dreconstruction paper reading Celebrate and Introduce My First Page //blog/celebrate/ Here, I build my first website(not the first but the first one I’m serious about building/running)😋.

My website will include:

Study course expriences

This kind of content will record my experiences learning some meaningful courses in PKU.I hope it will help me review my courses.

Research experiences

As a college student, researching and finding will be the main task in the future. Currently I am interested in 3R(3D reconstruction). So maybe I will update huge contents about my reflections for each paper.

My own projects

Of course, my some great(just in my standard) project will be post on the site. It’s meaningful to me as long as I think it’s great, regardless of how others see it.

…

Above might be the main topics of content in the site.

Additions

The posts will be in Chinese and English randomly(maybe most time Chinese🤣). Please forgive my poor English.

]]> blog Introduction