Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (2024)

[1]\fnmRosco \surHunter[2,3]\fnmŁukasz \surDudziak

1]\orgnameUniversity of Warwick,\countryUK2]\orgnameSamsung AI Centre Cambridge,\countryUK3]\orgnameCornell University,\countryUSA

*    \fnmMohamed S. \surAbdelfattah  \fnmAbhinav \surMehrotra  \fnmSourav \surBhattacharya  \fnmHongkai \surWen[[[

Abstract

Text-to-image diffusion models have demonstrated unprecedented capabilities for flexible and realistic image synthesis.Nevertheless, these models rely on a time-consuming sampling procedure, which has motivated attempts to reduce their latency.When improving efficiency, researchers often use the original diffusion model to train an additional network designed specifically for fast image generation.In contrast, our approach seeks to reduce latency directly, without any retraining, fine-tuning, or knowledge distillation.In particular, we find the repeated calculation of attention maps to be costly yet redundant, and instead suggest reusing them during sampling.Our specific reuse strategies are based on ODE theory, which implies that the later a map is reused, the smaller the distortion in the final image.We empirically compare these reuse strategies with few-step sampling procedures of comparable latency, finding that reuse generates images that are closer to those produced by the original high-latency diffusion model.

keywords:

Efficient Sampling, Attention Reuse, Diffusion Models

DDIM20-step Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (1)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (2)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (3)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (4)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (5)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (6)
DDIM20-step with10 reuse steps Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (7)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (8)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (9)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (10)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (11)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (12)
DDIM13-step Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (13)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (14)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (15)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (16)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (17)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (18)

1 Introduction

Diffusion probabilistic models (DPMs) have become increasingly popular for text-conditioned image generation [1, 2, 3, 4].While DPMs can generate images of unprecedented quality, they require considerable amounts of time in order to do so, motivating researchers to improve their efficiency.Currently, there are two main approaches for improving DPM efficiency: (1) decrease the number of calls to the U-Net111A U-Net is the deep neural network that powers DPMs., and (2) decrease the cost of calling the U-Net.This is typically facilitated by knowledge distillation or alterations to the U-Net’s training objective.Notably, there has been minimal work on methods to improve a DPM’s latency without any retraining, fine-tuning, or knowledge distillation.

Our paper focuses on this gap, by directly removing an expensive aspect of the sampling procedure that we find to be redundant: the repeated calculation of attention maps.Specifically, instead of recalculating attention maps from the key-query pairs at each step, the most recently calculated attention maps are stored in memory and can be reused during the sampling procedure.The main contribution of this paper is identifying and examining the reuse strategies that produce the smallest distortions to the original image. In particular:

  1. 1.

    We locate a heuristic reuse strategy by analysing error propagation in the reverse diffusion process.

  2. 2.

    We adapt this heuristic strategy to account for dependencies between different steps of the sampling procedure that the heuristic strategy neglects.

  3. 3.

    We show that our reuse strategies outperform step-reduced samplers of comparable latency.

2 Related Work

Sampling from a Diffusion Model.The forward diffusion process iteratively injects noise into an input image in order to transform it into a standard Gaussian.In the reverse direction, de-noising involves repeatedly invoking a decoder, which is typically trained to predict a score function, 𝒙logqt(𝒙t)subscript𝒙logsubscript𝑞𝑡subscript𝒙𝑡\nabla_{\boldsymbol{x}}\text{log }q_{t}(\boldsymbol{x}_{t})∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT log italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This is either learned directly, 𝒔θ(𝒙t,t)subscript𝒔𝜃subscript𝒙𝑡𝑡\boldsymbol{s}_{\theta}\left(\boldsymbol{x}_{t},t\right)bold_italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), or indirectly via error estimation, ϵθ(𝒙t,t)subscriptbold-italic-ϵ𝜃subscript𝒙𝑡𝑡\boldsymbol{\epsilon}_{\theta}\left(\boldsymbol{x}_{t},t\right)bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ).Once the network is trained, the model must be paired with a sampling procedure that uses information from the decoder to reproduce the original image from Gaussian noise, potentially with the aid of a prompt.The reverse diffusion process is often modelled by the following ODE [5]:

d𝒙tdt=f(t)𝒙t12g2(t)𝒙logqt(𝒙t),𝑑subscript𝒙𝑡𝑑𝑡𝑓𝑡subscript𝒙𝑡12superscript𝑔2𝑡subscript𝒙logsubscript𝑞𝑡subscript𝒙𝑡\displaystyle\frac{d\boldsymbol{x}_{t}}{dt}=f(t)\boldsymbol{x}_{t}-\frac{1}{2}%g^{2}(t)\nabla_{\boldsymbol{x}}\text{log }q_{t}(\boldsymbol{x}_{t}),divide start_ARG italic_d bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = italic_f ( italic_t ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT log italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,(1)

where f(t)𝑓𝑡f(t)italic_f ( italic_t ), g(t)𝑔𝑡g(t)italic_g ( italic_t ) are determined by the noise schedule.Successfully solving this equation corresponds to the DPM sampling a realistic image.

Decreasing the Number of U-Net Calls.A DPM’s latency is often reduced by lowering the total number of steps in its sampling procedure.But by shortening a DPM’s sampling process, a developer might degrade its performance, as shown in Figure 1.Therefore, researchers typically use knowledge distillation to maintain sample quality, training a new (step-reduced) network to emulate the original (high-latency) DPM.For example, progressive distillation [6, 7] iteratively trains a student network to predict the original DPM’s two-step output in a single step, repeatedly halving the number calls to the U-Net.Since then, a variety of new methods have been proposed to extend this simple approach.Several researchers [8, 9, 10] have introduced a notion of consistency, producing single-step solvers that generate (or sample from) any point along the whole diffusion trajectory.Alternatively, Liu et al. [11] reduced the curvature of the score function, which made the U-Net more accurate for large step-sizes.

All of the methods listed above reduce latency with a minimal impact on sample quality.However, they require some form of (re-)training and, as such, can’t directly bolster pre-trained DPMs.In contrast, Lu et al. [12] leveraged ODE-theory to directly improve DPM’s sampling procedure.Specifically, they derived an analytical solution to the reverse diffusion ODE expressed as an exponentially weighted integral.Based on this, they developed a numerical method that approximates the Taylor expansion of the exponential integral, allowing the U-Net to track the curvature of score function with larger step-sizes [13].Unlike the approaches that require knowledge distillation, this method can be directly applied to improve pre-trained DPMs in a ‘plug-and-play’ [12] manner.

Decreasing the Cost of U-Net Calls.Some researchers focus on reducing the cost of a single step rather than reducing the total number of steps.In particular, they perform knowledge distillation by training a small (i.e. low-cost) U-Net to emulate the large U-Net used by the original DPM.For example, Kim et al. [14] manually removed blocks from the U-Net that powers Stable Diffusion [15] and trained it to approximate the unpruned network.Li et al. [16] developed this manual approach by pruning the U-Net in accordance with a formalised trade-off between performance (CLIP-score) and latency.Both of these papers retain sample quality by employing knowledge distillation, but can the cost of calling a U-Net be reduced without (re-)training?

Bhojanapalli et al. [17] improved transformers’ efficiency by exploiting redundancies in their repeated calculation of attention maps.These maps are particularly consequential for latency as their calculation involves a costly outer product between a high-dimensional key and query.Upon finding a reasonable degree of similarity between a transformer’s attention maps, they reused an arbitrary subset of the first layer’s maps in the following layers.This approach can be applied in a ‘plug-and-play’ manner to any pre-trained transformer, as it doesn’t require any retraining or knowledge distillation.We wonder whether similar redundancies exist in the attention maps produced by DPMs, and if so, whether this can be exploited to reduce the cost of calling a DPM’s U-Net.

3 Analysis

We start this section by examining redundancies in the repeated calculation of attention maps by DPMs.Upon finding a significant degree of redundancy, we then define and locate reuse strategies that attempt to optimally exploit this redundancy.In particular, we propose a simple strategy by considering the Lyapunov exponents (a common tool for studying ODEs) of the reverse diffusion process.All of the figures presented in this section are run using Stable Diffusion v1.5 with 20-step DDIM [18] as the (base) sampling procedure.

Attention Map Redundancy.We conjecture that DPM attention maps are (predictably) similar to their temporally adjacent neighbours.As such, we empirically investigate the normalised L1 difference [17] between them, averaged over a variety of prompts.Figure 2 illustrates a high degree of similarity, in which temporally adjacent maps are (on average) within 0.2 of each other.This suggests a relatively stable degree of redundancy in the repeated calculation of attention maps that can be exploited to reducea DPM’s latency.Instead of repeatedly calculating attention maps from key-query pairs, certain maps can be used more than once.But what exactly would this look like?

Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (19)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (20)

Defining Reuse and Success.Let 𝑨slsuperscriptsubscript𝑨𝑠𝑙\boldsymbol{A}_{s}^{l}bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT denote the attention map222In fact, the conditioned and unconditioned prompts both have a corresponding cross- and self- attention map. For brevity, we refer to just one map at each layer and step. calculated in the s𝑠sitalic_s-th call of the l𝑙litalic_l-th layer of the U-Net in the reverse diffusion process.Moreover, let 𝑴lsuperscript𝑴𝑙\boldsymbol{M}^{l}bold_italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT denote a set of memory variables which store the maps that we may wish to reuse.During sampling, set 𝑴l𝑨slsuperscript𝑴𝑙superscriptsubscript𝑨𝑠𝑙\boldsymbol{M}^{l}\leftarrow\boldsymbol{A}_{s}^{l}bold_italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ← bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT every time that an attention map is directly calculated from key-query pairs.Then, for a reuse step r>1𝑟1r>1italic_r > 1, set 𝑨rl𝑴lsuperscriptsubscript𝑨𝑟𝑙superscript𝑴𝑙\boldsymbol{A}_{r}^{l}\leftarrow\boldsymbol{M}^{l}bold_italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ← bold_italic_M start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, instead of directly calculating 𝑨rlsuperscriptsubscript𝑨𝑟𝑙\boldsymbol{A}_{r}^{l}bold_italic_A start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT from key-query pairs.We parameterise a reuse strategy by a binary vector, 𝝅𝝅\boldsymbol{\pi}bold_italic_π, where each entry corresponds to a step in the sampling procedure.

In particular, 𝝅s=1subscript𝝅𝑠1\boldsymbol{\pi}_{s}=\texttt{1}bold_italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 1 indicates that the attention map 𝑨ssubscript𝑨𝑠\boldsymbol{A}_{s}bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is directly calculated from a key-query pair and 𝝅s=0subscript𝝅𝑠0\boldsymbol{\pi}_{s}=\texttt{0}bold_italic_π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 denotes reuse at sampling step s𝑠sitalic_s.So the reuse strategy 𝝅=[1, 1, 0, 0, 1, 0]𝝅delimited-[]1, 1, 0, 0, 1, 0\boldsymbol{\pi}=\left[\texttt{1, 1, 0, 0, 1, 0}\right]bold_italic_π = [ 1, 1, 0, 0, 1, 0 ] for a 6-step sampling procedure would directly calculate the attention maps from key-query pairs at step 1, 2 and 5.Consequently, it would reuse the second attention map in steps 3 and 4 and reuse the fifth attention map at step 6.

While we have now defined what reuse involves, it is also important to consider when it can be deemed successful.In this paper, we consider a reuse strategy to be successful if it generates images that are close to the original DPM’s output given the same prompt and initial conditions, as measured by PSNR.A looser definition of success would require only that reuse strategies generate realistic outputs that align with their prompt.This could be evaluated by measuring the outputs’ CLIP-Score or FID with a distribution of natural images.

The PSNR-measurable definition of success is preferable as it’s easier to evaluate and less subjective.In particular, the PSNR between two samples can be calculated quickly, which facilitates fast search algorithms.Moreover, PSNR doesn’t depend on an arbitrary choice of natural images, unlike FID and CLIP.For the reasons outlined above, our reuse strategies are optimised (via PSNR) to approximate the behaviour of a normal DPM, which itself could be trained to optimise FID or CLIP-Scores.Nevertheless, in Section 5 we present results regarding both characterisations of success.

Locating a Successful Reuse strategy.A brute force approach for finding a successful reuse strategy would evaluate all possibilities and select the one whose output is closest to the true sample.However, an exhaustive search is infeasible given the exponential growth in possible strategies as the number of steps increases.As such, we probe the diffusion process for insights that might warm-up our search.A common tool for analysing dynamical systems (ztsubscriptz𝑡\textbf{z}_{t}z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) are ‘Lyapunov exponents’ [19] which assume that errors (δzt𝛿subscriptz𝑡\delta\textbf{z}_{t}italic_δ z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) grow (or shrink) exponentially in time.Formally, they assume there exists a k\{0}𝑘\0k\in\mathbb{R}\backslash\{0\}italic_k ∈ blackboard_R \ { 0 } such that for every t>τ>0𝑡𝜏0t>\tau>0italic_t > italic_τ > 0:

|δzt|ekτ|δztτ|𝛿subscriptz𝑡superscript𝑒𝑘𝜏𝛿subscriptz𝑡𝜏\left|\delta\textbf{z}_{t}\right|\approx e^{k\tau}\left|\delta\textbf{z}_{t-%\tau}\right|| italic_δ z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ≈ italic_e start_POSTSUPERSCRIPT italic_k italic_τ end_POSTSUPERSCRIPT | italic_δ z start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT |(2)

For our case, let 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the reverse diffusion process, where 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is Gaussian noise and 𝒙Tsubscript𝒙𝑇\boldsymbol{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is a sample.Furthermore, let AssubscriptA𝑠\textbf{A}_{s}A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denote the attention maps at sampling step s.Adapting Lyapunov exponents for a reverse diffusion process in which attention maps are perturbed, we conjecture that there exists a λ>0𝜆0\lambda>0italic_λ > 0 such that for every step, s𝑠s\in\mathbb{N}italic_s ∈ blackboard_N and corresponding timestep tssubscript𝑡𝑠t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with T>ts>0𝑇subscript𝑡𝑠0T>t_{s}>0italic_T > italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 0:

|δ𝒙T|eλ(Tts)|δAs|eλts|δAs|proportional-to𝛿subscript𝒙𝑇superscript𝑒𝜆𝑇subscript𝑡𝑠𝛿subscriptA𝑠proportional-tosuperscript𝑒𝜆subscript𝑡𝑠𝛿subscriptA𝑠\left|\delta\boldsymbol{x}_{T}\right|\propto e^{\lambda(T-t_{s})}\left|\delta%\textbf{A}_{s}\right|\propto e^{-\lambda t_{s}}\left|\delta\textbf{A}_{s}\right|| italic_δ bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | ∝ italic_e start_POSTSUPERSCRIPT italic_λ ( italic_T - italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | italic_δ A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | ∝ italic_e start_POSTSUPERSCRIPT - italic_λ italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_δ A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT |(3)

That is, the change in the final sample (δ𝒙T𝛿subscript𝒙𝑇\delta\boldsymbol{x}_{T}italic_δ bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT) is proportional to the change in the attention maps (δAs𝛿subscriptA𝑠\delta\textbf{A}_{s}italic_δ A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) introduced by reuse at step s𝑠sitalic_s, subject to an exponential growth over the remaining steps.If our conjecture is true, then a heuristic strategy that follows from this would be to reuse (i.e., perturb the attention map) as late as possible in the sampling procedure, maximising s𝑠sitalic_s.

We will refer to this HeURistic Reuse StrategY as HURRY, defined for r𝑟ritalic_r reuse steps in an a N-step sampler by:

HURRY=[1|0]=[1, …, 1Nr,0, …, 0r]HURRYdelimited-[]conditional10superscript1, …, 1𝑁𝑟superscript0, …, 0𝑟\text{HURRY}=\left[\textbf{1}|\textbf{0}\right]=\Bigl{[}\overbrace{\texttt{1, %\ldots, 1}}^{N-r},\overbrace{\texttt{0, \ldots, 0}}^{r}\Bigr{]}HURRY = [ 1 | 0 ] = [ over⏞ start_ARG 1, …, 1 end_ARG start_POSTSUPERSCRIPT italic_N - italic_r end_POSTSUPERSCRIPT , over⏞ start_ARG 0, …, 0 end_ARG start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ](4)

Equally, if λ𝜆\lambdaitalic_λ were negative, then the error would shrink exponentially, suggesting that reuse is most suitable early in the sampling procedure.Figure 3 empirically explores whether the (attention-adapted) Lyapunov exponent appears to be positive or negative.We find that the impact of perturbations to the U-Net’s attention maps decrease (roughly) monotonically over the sampling procedure, at least between steps one and eighteen (inclusive).The correlation between the empirical data and the theoretical curve (exponential decay) is 0.96, for the first eighteen steps.This decay supports our conjecture that the (attention-adapted) Lyapunov exponent of the reverse diffusion process is positive, see Equation 3.

Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (21)

4 Method

In the Section 3 we developed a reuse strategy based on the following conjectures: (1) The temporally-adjacent attention maps of a pre-trained DPM are (predictably) similar to each other; (2) The Lyapunov exponent of a pre-trained DPM is positive.The first conjecture implies that reuse a sensible idea and the second implies that if you are going to reuse, it is best to do it late into the sampling procedure.We provided supporting evidence for each of these conjectures in figures 2 and 3, respectively.

However, HURRY also relies on an assumption that we have so far neglected.In particular, it assumes that the suitability of step s𝑠sitalic_s for reuse is independent of whether the model has already reused at step r<s𝑟𝑠r<sitalic_r < italic_s.This might be problematic when creating reuse strategies that contain more than one reuse step.For instance, a ‘later is better’ strategy might be ideal for a single instance of reuse.But, clustering several reuse steps towards the end of the sampling procedure without any intermediate recalculations might be sub-optimal.

Perturbing HURRY.To address this limitation, we evaluate perturbations of our heuristic reuse strategy to determine whether they perform unexpectedly well due to some unforeseen dependencies.Specifically, we set HURRY as our initial ‘Best Strategy’ and then perform a greedy search to locate the closest locally optimal strategy.A reuse strategy is said to be locally optimal if it outperforms all strategies that are one bit-flip away; where a bit-flip swaps a reuse and non-reuse step333This is the smallest strategy mutation that preserves the number of reuse steps..We refer to the result of this search algorithm as PHAST: Perturbed Heuristic Attention STrategy - this is our second reuse strategy.The utility function below, U𝑈Uitalic_U, is PSNR in our case.

Best StrategyHURRYBest Strategy𝐻𝑈𝑅𝑅𝑌\text{Best Strategy}\leftarrow HURRYBest Strategy ← italic_H italic_U italic_R italic_R italic_Y
Optima(0)U(Best Strategy)𝑂𝑝𝑡𝑖𝑚𝑎0𝑈Best StrategyOptima(0)\leftarrow U(\text{Best Strategy})italic_O italic_p italic_t italic_i italic_m italic_a ( 0 ) ← italic_U ( Best Strategy )
Run0𝑅𝑢𝑛0Run\leftarrow 0italic_R italic_u italic_n ← 0
repeat
-RunRun+1𝑅𝑢𝑛𝑅𝑢𝑛1Run\leftarrow Run+1italic_R italic_u italic_n ← italic_R italic_u italic_n + 1
-Optima(Run)Optima(Run1)𝑂𝑝𝑡𝑖𝑚𝑎𝑅𝑢𝑛𝑂𝑝𝑡𝑖𝑚𝑎𝑅𝑢𝑛1Optima(Run)\leftarrow Optima(Run-1)italic_O italic_p italic_t italic_i italic_m italic_a ( italic_R italic_u italic_n ) ← italic_O italic_p italic_t italic_i italic_m italic_a ( italic_R italic_u italic_n - 1 )
-for 𝝅BitFlipSet(Best Strategy)𝝅𝐵𝑖𝑡𝐹𝑙𝑖𝑝𝑆𝑒𝑡Best Strategy\boldsymbol{\pi}\in BitFlipSet(\text{Best Strategy})bold_italic_π ∈ italic_B italic_i italic_t italic_F italic_l italic_i italic_p italic_S italic_e italic_t ( Best Strategy )
--if U(𝝅)>Optima(Run)+ε𝑈𝝅𝑂𝑝𝑡𝑖𝑚𝑎𝑅𝑢𝑛𝜀U(\boldsymbol{\pi})>Optima(Run)+\varepsilonitalic_U ( bold_italic_π ) > italic_O italic_p italic_t italic_i italic_m italic_a ( italic_R italic_u italic_n ) + italic_ε
---Best Strategy𝝅Best Strategy𝝅\text{Best Strategy}\leftarrow\boldsymbol{\pi}Best Strategy ← bold_italic_π
---Optima(Run)U(𝝅)𝑂𝑝𝑡𝑖𝑚𝑎𝑅𝑢𝑛𝑈𝝅Optima(Run)\leftarrow U(\boldsymbol{\pi})italic_O italic_p italic_t italic_i italic_m italic_a ( italic_R italic_u italic_n ) ← italic_U ( bold_italic_π )
--end if
-end for
until Optima(Run)=Optima(Run1)𝑂𝑝𝑡𝑖𝑚𝑎𝑅𝑢𝑛𝑂𝑝𝑡𝑖𝑚𝑎𝑅𝑢𝑛1Optima(Run)=Optima(Run-1)italic_O italic_p italic_t italic_i italic_m italic_a ( italic_R italic_u italic_n ) = italic_O italic_p italic_t italic_i italic_m italic_a ( italic_R italic_u italic_n - 1 )
return Best Strategy

Algorithm 1 has a number of desirable properties.One such property is that, unlike evolutionary approaches, it’s deterministic.This determinism stems from the fact that it (effectively) performs gradient descent on the set of reuse strategies, calculating the finite difference between adjacent strategies w.r.t. to PSNR, and moving to (approximately444We include a small threshold, ε𝜀\varepsilonitalic_ε, to prevent insignificant rounds of bit flipping.) minimise this.Another desirable property is that it’s quick.For an N-step sampling procedure with r reuse steps, there are only 𝒪(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bit-flipped strategies, in comparison to an exhaustive search, which has 𝒪(Nr)𝒪superscript𝑁𝑟\mathcal{O}(N^{r})caligraphic_O ( italic_N start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) strategies.In the case of a 20-step sampling procedure with 10 reuse steps, this corresponds to one round of bit-flipping covering 100 strategies, while an exhaustive search must cover 184,756 strategies.

5 Evaluation

In this section, we begin by examining the effectiveness of algorithm 1 at locating the optimal reuse strategy.Following this, we compare the performance of attention reuse and step-reduction across various datasets, models, and sampling procedures.Our results demonstrate that reuse outperforms step-reduction in terms of PSNR and is competitive w.r.t. FID and CLIP-Score.

Comparison with Alternative Reuse Strategies.How does algorithm 1 compare to an exhaustive search?To investigate this, we evaluated every reuse strategy in the space of 10-step DPM++ samplers with 3 reuse steps.As a result of this exhaustive search, we found HURRY to be the third-best strategy, with the two best strategies differing by a single bit-flip:

  1. 1.

    Strategy = [1,1,1,1,1, 1,0,1,0,0], PSNR = 27.2 (i.e., PHAST)

  2. 2.

    Strategy = [1,1,1,1,1, 1,0,0,1,0], PSNR = 26.8

  3. 3.

    Strategy = [1,1,1,1,1, 1,1,0,0,0], PSNR = 25.9 (i.e., HURRY)

While we can’t perform an exhaustive search on larger spaces we find that algorithm 1 rapidly settles into a local optima, typically within a single bit-flip.For instance, given a 20-step DDIM sampler with 10 reuse steps algorithm 1 terminates with the following solution:
PHAST = [1,1,1,1,1, 1,1,1,0,1, 0,0,0,1,0, 0,0,0,0,0]

These results indicate that HURRY is a near-optimal strategy, stunted by it’s assumption of step-wise independence, which PHAST amends.To further support this assessment, and given the infeasibility of an exhaustive search in larger spaces, we compare our reuse strategies with several alternatives in Figure 4:

Strategy-No ReuseRandomEarlyLatePHAST
PSNRref.15.9815.7718.7721.16
-Sample‘Macaw’- -Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (22)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (23)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (24)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (25)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (26)

Initial Comparison Between Step-Reduction and Reuse.We have now demonstrated that HURRY and PHAST are promising reuse strategies.But do they outperform step-reduction, the canonical approach to reducing latency?Figure 5 compares step-reduced DDIM samplers with reuse strategies acting on a full 20-step DDIM sampler, for latencies between 3500ms and 5500ms.The results demonstrate that our reuse strategies consistently outperform step-reduced samplers at comparable latencies.Figure 6 visually supports these results with samples taken from a cross-section of Figure 5, at a latency of similar-to\sim4000ms.The samples generated by the reuse strategies are more similar to those produced by the original sampler than their step-reduced counterparts.Notably, the step-reduced samplers produce distorted images — such as the firetruck which looks more like a building, the detached tail of the Terrier, or the distorted macaw.

The latencies in Figure 5 are calculated by appropriately summing two components: (1) the latency of a full U-Net call; and (2) the latency of a reuse U-Net call.These latencies are estimated by calculating the share of total clock cycles used by each block of the U-Net and scaling the total latency accordingly.We establish that a full U-Net call has an approximate latency of 152ms, and reuse555We derive this latency under the assumption that reuse can reduce the latency of attention blocks by 90%percent\%% - a conservative estimate. has an approximate latency of 47ms.For each number of reuse steps (i.e., each latency) in Figure 5, PHAST is separately selected by algorithm 1.

Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (27)

DDIM20-step Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (28)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (29)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (30)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (31)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (32)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (33)
DDIM12-step Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (34)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (35)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (36)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (37)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (38)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (39)
DDIMHURRY Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (40)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (41)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (42)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (43)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (44)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (45)
DDIMPHAST Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (46)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (47)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (48)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (49)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (50)Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (51)

Further Comparisons Between Step-Reduction and Reuse.So far, this paper has focused (almost solely) on the PSNR of modifications to a 20-step DDIM sampler acting on Stable Diffusion (SD) v1.5 with ImageNet labels as prompts.We must now demonstrate that our reuse strategies also excel with different datasets, models, sampling procedures, and measures of performance.Table 1 starts this wider evaluation by taking the PSNR and FID for SD v1.5 on numerous datasets.Additionally, it evaluates PHAST on a more powerful model (SDXL) with a complex dataset (PartiPrompts).For this table, and all datasets in this section, we don’t re-run algorithm 1.As the underlying dynamics of the model are independent of its prompts, we reuse the same strategy (PHAST) that was selected by algorithm 1 given ImageNet labels as prompts.

For every dataset in Table 1, PHAST significantly outperforms step-reduction w.r.t. PSNR.However, they are relatively indistinguishable w.r.t. FID, which suggests that these approaches generate images that are equally coherent and realistic.Yet, on SDXL with PartiPrompts (See Fig.1), the step-reduced sampler generates a bowl of Pho with an atypical sub-bowl and an impossibly contorted tree that lacks a reflection.As such, we suggest that the FID might not faithfully track whether or not the image is realistic.PHAST and HURRY avoid these unrealistic distortions, as they are designed to maximise fidelity w.r.t. the original model, which was trained to minimise FID.

ModelDatasetMethodPSNR (\uparrow)FID (\downarrow)
SD1.5MS-COCO 2017Base(13)17.5028.22
(5k val., 512x512, See Tab.2)PHAST25.9029.18
MS-COCO 2014Base(13)17.5318.71
(40k val., 512x512)PHAST26.0317.62
Instruct-Pix2PixBase(13)17.8359.57
(30k val., 512x512)PHAST24.9860.08
SDXLPartiPromptsBase(13)20.30202.58
(100 random, 1024x1024)PHAST29.92199.88

ModelMethodDDIMDPM++Lat. (ms)Mem. (MiB)
PSNR (\uparrow)CLIP (\uparrow)FID (\downarrow)PSNR (\uparrow)CLIP (\uparrow)FID (\downarrow)
SD 1.5Base(20)ref.0.267128.23ref.0.267327.1140528091
Base(13)17.500.268128.2217.160.267627.3926648091
PHAST25.900.266729.1823.020.266427.95293712157
PHASTFP16FP16{}_{\text{FP16}}start_FLOATSUBSCRIPT FP16 end_FLOATSUBSCRIPT25.950.266729.2123.030.266427.95316410141
SD 1.5+ CFG Dist.†† Base(20)ref.0.261131.71ref.0.261331.6620907273
Base(13)17.680.262531.9616.930.261931.1613827273
PHAST28.630.261432.5625.340.261232.6615179093
PHASTFP16FP16{}_{\text{FP16}}start_FLOATSUBSCRIPT FP16 end_FLOATSUBSCRIPT28.630.261432.5725.340.261232.6616238045

avg. time needed to generate one image when running continuously for 5 minutes on an NVidia A40 (46GB) with maxed-out batch size (DPM++, full precision).

total memory needed to generate one image (DPM++, full precision). absent\quad{}^{\dagger\dagger}start_FLOATSUPERSCRIPT † † end_FLOATSUPERSCRIPT our reimplementation of ”stage one distillation” from [7].

In Table 2, we evaluate PHAST against step-reduction for two samplers (DDIM and DPM++) on the MS-COCO [20] validation dataset.Algorithm 1 was ran twice to separately select PHAST for each sampler; however, we found that up to a timestep rounding error, these two strategies were the same.In particular, by directly comparing the timesteps (1-1000) rather than the steps (1-20), any differences between the two reuse strategies reduced to a rounding error.This suggests that our search algorithm can generalise across different samplers, which is perhaps unsurprising as the U-Net is unchanged between samplers.

Table 2 shows that PHAST significantly outperforms step-reduction for both samplers w.r.t. PSNR, yet it is marginally worse for CLIP and FID.But as we’ve previously alluded to, these more subjective measures of performance might not be completely faithful (See Figs.1 and 6).Interestingly, the PSNR of PHAST is consistently larger for DDIM over DPM++.We attribute this difference to the more linear nature of DDIM, which might aid reuse.

Along with evaluating PHAST for two different samplers, Table 2 also considers two different models.The top row is a vanilla version of Stable Diffusion, and the bottom row evaluates a model whose conditional and unconditional forward passes of the classifier-free guidance (CFG) are distilled into one, following [7].In particuar, we distilled the model using the MS-COCO 2013 training dataset with 4 A100 for 2-3 days666Since Meng et al. didn’t release their code or checkpoints, we had to re-implement their method.As we’re not interested in improving raw performance of DPMs but to approximate them efficiently, we didn’t perform a very exhaustive distillation..We observe that even when a model has been additionally optimised, the same PHAST strategy — searched for on vanilla stable diffusion — results in very similar behaviour, with a high fidelity (PSNR) and slight reduction in FID and CLIP.This demonstrates that our reuse strategies can be used in tandem with other approaches for optimising a DPM’s latency.

In summary, both of the reuse strategies proposed in this paper appear to be optimal or near-optimal.Moreover, they consistency and significantly outperform step-reduced samplers w.r.t. PSNR, and remain competitive for the more subjective measures of image quality.For a given number of reuse steps, we have shown that the strategy selected using SD v1.5 on ImageNet with a DDIM sampler can generalise across models, datasets, and sampling procedures.

6 Discussion

The Memory-Latency Trade-Off.The reductions in latency achieved by our reuse strategies come at the expense of increased memory usage, which is required to cache the attention maps for reuse.Although Table 2 shows that speedup is not affected by this extra cost, the possibility for utilising reuse might be limited in memory-constrained systems.To alleviate this, developers could be to opt for caching attention maps in reduced precision.For instance, we observe that storing attention maps in FP16 does not degrade our results (See Tab.2), but allows us to halve the memory required to cache attention maps.

We could reduce memory even further by considering 8-bit quantisation — PHASTINT8INT8{}_{\text{INT8}}start_FLOATSUBSCRIPT INT8 end_FLOATSUBSCRIPT with our CFG-distilled model and the DPM++ solver achieves an FID of 32.65 and PSNR of 25.35.However, we also notice that the overhead of performing quantisation becomes a bottleneck in this case, hurting latency; this is why we did not include full comparison.Having said that, we expect that this might be a feasible way forward for devices that operate on quantised tensors, since the cost of performing quantisation will be amortized in those cases.We leave further investigation into memory reduction for future work.

Refining our Reuse Strategies.There are numerous ways to refine the reuse strategies proposed in this paper.For example, the step-wise strategies ignore a natural axis for search that might bolster performance, namely layer-wise search.Rather than employing a reuse vector, we might envision a reuse matrix where the rows index the U-Net’s layers and the columns index a step in the sampling procedure.We excluded this search from the main body of the paper due to its computational complexity and limited impact on performance for SD v1.5 (See SM).However, we acknowledge that for certain DPMs this layer-wise refinement might produce significant improvements on the proposed reuse strategies.

Additionally, fine-tuning the U-Net to better tolerate reuse might further enhance performance.This could involve first fixing a reuse strategy, then fine-tuning the model whilst applying this strategy.Alternatively, the U-Net could be preemptively fine-tuned to better accommodate reuse strategies in general.For example, a developer might fine-tune the U-Net with random reuse strategies or modify it to have a more linear score function.We didn’t investigate this in the paper, as our focus was to create a training-free method for reducing the cost of calling the U-Net — an area that has so far been overlooked.

Selecting the Parameters for Reuse.This paper has explored the following problem: Given an N-step sampling procedure with r𝑟ritalic_r reuse steps, how should these reuse steps be allocated in order to maximise the performance of the model?In reality, N and r𝑟ritalic_r are not necessarily fixed; they could be selected alongside their corresponding strategy (𝝅Nrsuperscriptsubscript𝝅N𝑟\boldsymbol{\pi}_{\text{N}}^{r}bold_italic_π start_POSTSUBSCRIPT N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT) to maximise the model’s performance while keeping the latency below a certain threshold.However, considerable amounts of computation would be required to solve this constrained optimisation problem.As such, we leave it for developers to select sensible parameters (N,r𝑟ritalic_r) for their specific applications, perhaps via a short process of trail and error with HURRYrNsuperscriptsubscriptabsentN𝑟{}_{\text{N}}^{r}start_FLOATSUBSCRIPT N end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT.

7 Conclusion

This paper introduces two reuse strategies that reduce a DPM’s latency while retaining the original DPM’s weights and number of calls to the U-Net.We started by analysing the dynamics of the reverse diffusion process, which pointed to a ‘later-is-better’ reuse strategy.By conducting a local search around this heuristic strategy, we improved the model’s performance for a fixed latency.Both strategies outperformed naive step reduction, especially when remaining faithful to the model’s baseline behaviour is of primary concern.Moreover, we showed that reuse strategies can generalise across models, sampling procedures, and datasets.We hope that future work can further investigate redundancies in the reverse diffusion process, and their potential for improving a DPM’s latency.

Broader Impact.Our work reduces the latency of high-quality DPM image synthesis.While this may pose societal benefits, DPMs can also be used to produce biased or harmful content [21].Reductions in a DPM’s latency might increase the ease with which malicious actors can produce harmful content.

References

  • \bibcommenthead
  • Ramesh etal. [2022]Ramesh, A.,Dhariwal, P.,Nichol, A.,Chu, C.,Chen, M.:Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125(2022)
  • Dhariwal and Nichol [2021]Dhariwal, P.,Nichol, A.:Diffusion models beat gans on image synthesis.Advances in Neural Information Processing Systems,8780–8794(2021)
  • Nichol etal. [2021]Nichol, A.,Dhariwal, P.,Ramesh, A.,Shyam, P.,Mishkin, P.,McGrew, B.,Sutskever, I.,Chen, M.:Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741(2021)
  • Saharia etal. [2022]Saharia, C.,Chan, W.,Saxena, S.,Li, L.,Whang, J.,Denton, E.L.,Ghasemipour, K.,GontijoLopes, R.,KaragolAyan, B.,Salimans, T., et al.:Photorealistic text-to-image diffusion models with deep language understanding.Advances in Neural Information Processing Systems,36479–36494(2022)
  • Song etal. [2020]Song, Y.,Sohl-Dickstein, J.,Kingma, D.P.,Kumar, A.,Ermon, S.,Poole, B.:Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456(2020)
  • Salimans and Ho [2022]Salimans, T.,Ho, J.:Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512(2022)
  • Meng etal. [2023]Meng, C.,Rombach, R.,Gao, R.,Kingma, D.,Ermon, S.,Ho, J.,Salimans, T.:On distillation of guided diffusion models.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp. 14297–14306(2023)
  • Song etal. [2023]Song, Y.,Dhariwal, P.,Chen, M.,Sutskever, I.:Consistency models(2023)
  • Gu etal. [2023]Gu, J.,Zhai, S.,Zhang, Y.,Liu, L.,Susskind, J.:Boot: Data-free distillation of denoising diffusion models with bootstrapping.arXiv preprint arXiv:2306.05544(2023)
  • Berthelot etal. [2023]Berthelot, D.,Autef, A.,Lin, J.,Yap, D.A.,Zhai, S.,Hu, S.,Zheng, D.,Talbot, W.,Gu, E.:Tract: Denoising diffusion models with transitive closure time-distillation.arXiv preprint arXiv:2303.04248(2023)
  • Liu etal. [2023]Liu, X.,Zhang, X.,Ma, J.,Peng, J.,Liu, Q.:Instaflow: One step is enough for high-quality diffusion-based text-to-image generation.arXiv preprint arXiv:2309.06380(2023)
  • Lu etal. [2022]Lu, C.,Zhou, Y.,Bao, F.,Chen, J.,Li, C.,Zhu, J.:Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems35,5775–5787(2022)
  • Dockhorn etal. [2022]Dockhorn, T.,Vahdat, A.,Kreis, K.:Genie: Higher-order denoising diffusion solvers.Advances in Neural Information Processing Systems,30150–30166(2022)
  • Kim etal. [2023]Kim, B.-K.,Song, H.-K.,Castells, T.,Choi, S.:On architectural compression of text-to-image diffusion models.arXiv preprint arXiv:2305.15798(2023)
  • Rombach etal. [2022]Rombach, R.,Blattmann, A.,Lorenz, D.,Esser, P.,Ommer, B.:High-resolution image synthesis with latent diffusion models.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp. 10684–10695(2022)
  • Li etal. [2023]Li, Y.,Wang, H.,Jin, Q.,Hu, J.,Chemerys, P.,Fu, Y.,Wang, Y.,Tulyakov, S.,Ren, J.:Snapfusion: Text-to-image diffusion model on mobile devices within two seconds.arXiv preprint arXiv:2306.00980(2023)
  • Bhojanapalli etal. [2021]Bhojanapalli, S.,Chakrabarti, A.,Veit, A.,Lukasik, M.,Jain, H.,Liu, F.,Chang, Y.-W.,Kumar, S.:Leveraging redundancy in attention with reuse transformers.arXiv preprint arXiv:2110.06821(2021)
  • Song etal. [2020]Song, J.,Meng, C.,Ermon, S.:Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502(2020)
  • Layek etal. [2015]Layek, G., et al.:An Introduction to Dynamical Systems and Chaosvol. 449.Springer, ???(2015)
  • Lin etal. [2014]Lin, T.-Y.,Maire, M.,Belongie, S.,Hays, J.,Perona, P.,Ramanan, D.,Dollár, P.,Zitnick, C.L.:Microsoft coco: Common objects in context.In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13,pp. 740–755(2014).Springer
  • Anderljung and Hazell [2023]Anderljung, M.,Hazell, J.:Protecting society from ai misuse: When are restrictions on capabilities warranted?arXiv preprint arXiv:2303.09377(2023)
Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models (2024)
Top Articles
Latest Posts
Article information

Author: Sen. Ignacio Ratke

Last Updated:

Views: 6214

Rating: 4.6 / 5 (76 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Sen. Ignacio Ratke

Birthday: 1999-05-27

Address: Apt. 171 8116 Bailey Via, Roberthaven, GA 58289

Phone: +2585395768220

Job: Lead Liaison

Hobby: Lockpicking, LARPing, Lego building, Lapidary, Macrame, Book restoration, Bodybuilding

Introduction: My name is Sen. Ignacio Ratke, I am a adventurous, zealous, outstanding, agreeable, precious, excited, gifted person who loves writing and wants to share my knowledge and understanding with you.