Microsoft has announced the launch of "Visual ChatGPT," a new system that incorporates several types of Visual Foundation Models (VFMs) with ChatGPT. The used models include Transformers, ControlNet, and Stable Diffusion. This system allows users to interact with ChatGPT beyond language, by enabling them to send and receive text/images through chat, as well as insert visual model prompts into the chat to edit their images.
According to a research paper named "Visual ChatGPT: Talking, Drawing, and Editing with Visual Foundation Models," each visual transformer model has its own unique set of tasks with exact inputs and outputs, similar to ChatGPT, which is only trained on text. Even so, when these models are combined, they offer unbounded capabilities for image generation and modification.
To bridge the breach between ChatGPT and VFMs, the research work proposes utilizing a Prompt Manager with various features. Such as notifying ChatGPT about every VFM's capabilities and detailing the required input-output formats. Besides, converting visual information into language format, and managing the histories, priorities, and conflicts of diverse VFMs.
The Prompt Manager enables ChatGPT to use VFMs effectively and receive responses from them continuously until users' wishes are fulfilled. Users can interact with ChatGPT using images and make complex image-related requests or search visual editing by using a multi-step technique including numerous AI models. Additionally, users can request reactions and corrections on the outcomes.
In summary, Visual ChatGPT system offers a new way for users to interact with ChatGPT, but only time will show whether these features are useful for them. You can also find out about the previous ChatGPT announcement of adding the customization features to the famous AI solution.