Scalable Pre-Training of Large Autoregressive Image Models

Apple’s Machine Learning research team recently introduced a series of Autoregressive Image Models (AIM) that vary in size, ranging from a few hundred million parameters to several billion. The study focused on showcasing the performance of these differently sized models during training. This article will delve into the various experiments conducted, datasets utilized, and the conclusions drawn. But first, let’s grasp the concept of autoregressive modeling and its application in image modeling.

**About us:** Viso Suite is a versatile and expandable infrastructure designed for businesses to seamlessly incorporate computer vision into their technical ecosystems. Viso Suite empowers enterprise ML teams to train, deploy, manage, and secure computer vision applications within a unified interface.

**Autoregressive Models:** Autoregressive models are a family of models that leverage historical data to predict future data points. These models grasp the underlying patterns and causal relationships within data to forecast future data points. Examples include Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA), commonly used in sales and revenue time-series forecasting.

**Autoregressive Image Models:** Autoregressive Image Modeling (AIM) employs the same principle but on image pixels as data points. By segmenting the image and treating these segments as a sequence of data points, the model learns to predict the next image segment based on previous data. Popular models like PixelCNN and PixelRNN use autoregressive modeling for visual data prediction, particularly in applications like image enhancement and generative networks for creating new images.

**Pre-training Large-Scale Autoregressive Image Models:** Pre-training an AI model involves training a foundation model on a substantial and generic dataset. Autoregressive image models, pre-trained on datasets like MS COCO and ImageNet, were used by Apple’s researchers utilizing the DFN dataset introduced by Fang et al. The dataset comprises 2 billion cleaned and filtered images labeled as DFN 2B.

**Architecture:** The training process involves dividing the input image into segments, which are then processed through a transformer architecture utilizing self-attention to comprehend pixel information. A multi-layer perceptron serves as the prediction head on top of the transformer implementation.

**Experimentation:** Various Autoregressive Image Models were developed with differing heights and depths, trained on datasets of varying sizes. The models were evaluated based on their performance across different iterations.

**Results:** The study observed that increasing model parameters slightly enhanced training performance, with larger datasets exhibiting better performance. The experimentation highlighted the scalability and performance of the proposed models.

**Conclusions:** The experiments concluded that the AIM models scale effectively in terms of performance, showcasing improved performance with larger datasets and increased model capacity. The models demonstrated competitive performance against other generative and autoregressive models on downstream tasks, striking a balance between speed and accuracy.

In summary, Apple’s Autoregressive Image Models exhibit cutting-edge scalability and performance, providing a stable pre-training experience across various model sizes and dataset combinations. These models showcase exceptional scaling capabilities and competitive performance in comparison to similar models, offering a blend of speed and accuracy.

Leave a Comment