PaLI-3 Vision Language Models: Smaller, Faster, Stronger(arxiv.org)

posted 1 year ago

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

Sort:

Hot Top Controversial New Old

[ - ]

KingsmanVince@kbin.socialOP

1 point

1 year ago

permalink

report

[ - ]

Nicolas Rojas@social.vivaldi.net

0 points

1 year ago

Impressive results! Only wished they had shared some code or any way to replicate the experiments easily

permalink

report

[ - ]

KingsmanVince@kbin.socialOP

1 point

1 year ago

indeed it would be great if the authors did so. I personally found some non-official implementations:

permalink

report

parent

Machine Learning

!machinelearning@kbin.social

Create post

Machine learning (ML) is a field devoted to understanding and building methods that let machines “learn” – that is, methods that leverage data to improve computer performance on some set of tasks. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.

Community stats

1
Monthly active users
27
Posts
10
Comments

Community stats

Community moderators