MaMMUT: La rivoluzionaria architettura multimodale che unisce visione e linguaggio

AJ Piergiovanni e Anelia Angelova, ricercatori di Google Research, hanno presentato MaMMUT, una nuova architettura che unisce visione e linguaggio per compiti multimodali. Il modello, chiamato “A Simple Architecture for Joint Learning for MultiModal Tasks”, si basa su un’idea innovativa che permette di allenare il modello contemporaneamente su obiettivi contrastivi, generativi e di localizzazione. MaMMUT è un modello multimodale compatto, con 2 miliardi di parametri, che supera i modelli di riferimento in compiti come il recupero immagine-testo, la risposta alle domande visive (VQA) e la generazione di didascalie per immagini.

Il modello MaMMUT utilizza un solo encoder di immagini e un solo decoder di testo, consentendo il riutilizzo diretto di entrambi i componenti. Inoltre, l’architettura del modello permette una facile adattamento a compiti video-testo, sfruttando informazioni spazio-temporali dai video tramite l’utilizzo di “tubi video” e permettendo di elaborare video con un numero maggiore di fotogrammi rispetto ai modelli precedenti.

Un aspetto sorprendente di MaMMUT è che un singolo decoder di linguaggio è sufficiente per compiere tutti i compiti multimodali. Questo elimina la necessità di strutture complesse e procedure di addestramento precedentemente utilizzate. La struttura del modello permette di combinare obiettivi contrastivi e generativi all’interno del decoder stesso, consentendo al modello di apprendere rappresentazioni sia per il recupero immagine-testo che per la generazione di didascalie.

MaMMUT supera i modelli di riferimento in compiti come il recupero immagine-testo e il recupero testo-immagine senza necessità di adattamenti specifici. Inoltre, ottiene risultati competitivi nel compito di VQA, anche se con un modello significativamente più piccolo rispetto ai modelli di riferimento. Il modello MaMMUT dimostra anche un’elevata efficienza nell’adattamento a compiti video-testo, consentendo di elaborare un numero maggiore di fotogrammi rispetto ai modelli precedenti.

L’architettura di MaMMUT offre vantaggi significativi in termini di dimensioni del modello e prestazioni. Inoltre, la sua versatilità permette l’applicazione a una vasta gamma di compiti multimodali come il recupero immagine-testo, la VQA e la generazione di didascalie.

Concludendo, MaMMUT rappresenta un passo avanti nella ricerca sulla visione e il linguaggio, offrendo un’architettura semplice e compatta che supera i modelli di riferimento in diversi compiti multimodali.

Cookie	Durata	Descrizione
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Durata	Descrizione
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_198202384_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Durata	Descrizione
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

MaMMUT: La rivoluzionaria architettura multimodale che unisce visione e linguaggio

NEWS AIopenmind su:

Iscrizione NEWSLETTER

Visita le sezioni del sito

Link utili

Media Partner