University of California, Berkeley
*Equal Contribution
In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate compute-optimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute.
We display the effectiveness of leveraging iterative feedback computation through diffusion models for visual perception tasks. We perform an in-depth study of train/test-time compute scaling laws across all layers of the stack, including pre-training, fine-tuning, and diffusion inference. Specifically, we perform our study on the monocular depth estimation task. We show how to transfer the scaling laws derived for depth estimation to boost performance on tasks such as optical flow or amodal segmentation for both training and inference. Finally, we apply all of our scaling strategies to efficiently train a generalist mixture-of-experts model on perception tasks, achieving state-of-the-art results across various benchmarks.
We derive scaling laws for generative pre-training and fine-tuning of diffusion models for perceptual tasks. We pre-train DiT models of varied sizes on the ImageNet-1K dataset for class-conditional image generation. We observe clear power law scaling behavior as we increase the model size by increasing the hidden dimension and number of layers linearly.
In addition to pre-training, we also derive scaling laws for fine-tuning on the downstream task of monocular depth estimation. We fine-tune the pre-trained DiT models by posing the depth estimation task as an image-to-image translation. We fine-tune our models for conditional denoising diffusion generation, training on the Hypersim dataset. We show that larger dense DiT models predictably converge to a lower fine-tuning loss. We also observe a strong correlation between the fine-tuning loss scaling law and validation metric scaling laws.
Finally, we also explore the effect of scaling pre-training compute, image resolution, and mixture-of-experts upcyling during fine-tuning. These results can be found in our paper.
Scaling test-time compute has been explored for autoregressive LLMs to improve performance on long-horizon reasoning tasks. Diffusion models by design allow efficient scaling of test-time compute. First, we can simply increase the number of denoising steps to increase the compute spent at inference. Since we are estimating deterministic outputs, we can then initialize multiple noise latents and ensemble the predictions to get a better estimation. Finally, we can also reallocate test-time compute for low and high frequency denoising by modifying the noise variance schedule.
The most natural way of scaling diffusion inference is by increasing denoising steps. Since the model is trained to denoise the input at various timesteps, we can scale the number of diffusion denoising steps at test-time to produce finer, more accurate predictions. This coarse-to-fine denoising paradigm is also reflected in the generative case, and we can take advantage of it for the discriminative case by increasing the number of denoising steps. We show a clear power law scaling behavior in depth estimation validation metrics by simply increasing the number of diffusion sampling steps at test-time.
We can also exploit the fact that denoising different noise latents will generate different downstream predictions. We do so through a test-time ensembling approach in which we compute
Finally, we can scale test-time compute by increasing compute usage at different points of the denoising
process. In diffusion noise schedulers, we can define a schedule for the variance of the Gaussian
noise applied to the image over the total diffusion timesteps
We train a unified generalist model capable of performing depth estimation, optical flow estimation, and amodal segementation tasks. We apply all of our training and inference scaling techniques, highlighting the generalizability of our approach.
To train our generalist model, we modify the DiT-XL architecture by replacing the patch embedding layer with a separate
Our results prove the effectiveness of our training and test-time scaling strategies, removing the need to use pre-trained models trained on internet-scale datasets to enable high-quality visual perception in diffusion models. We hope to inspire future work in scaling training and test-time compute for iterative generative paradigms.
@article{ScalingDiffusionPerception2024,
title={Scaling Properties of Diffusion Models for Perceptual Tasks},
author={Rahul Ravishankar and Zeeshan Patel and Jathushan Rajasegaran and Jitendra Malik},
year={2024},
journal={arXiv:2411.08034}
url={https://arxiv.org/abs/2411.08034}
}