Article reprint source: AI Trends
Source: Quantum Bit
The multimodal super-powerful model GPT-4V, with a 166-page "instruction manual" was released! And it was produced by the Microsoft team.
What kind of paper can be written into 166 pages?
It not only evaluates GPT-4V’s performance in ten major tasks in detail, from basic image recognition to complex logical reasoning;
A complete set of tips for using multimodal large model prompts is also taught——
It teaches you step by step how to write prompt words from 0 to 1, and the professional level of the answers can be understood at a glance. It really lowers the threshold for using GPT-4V to non-existent.
It is worth mentioning that the authors of this paper are also an “all-Chinese team”. All seven authors are Chinese, led by a female chief research manager who has worked at Microsoft for 17 years.
Before the release of the 166-page report, they also participated in OpenAI's latest DALL·E 3 research and have a deep understanding of this field.
Compared to OpenAI’s 18-page GPT-4V paper, this 166-page “Guide to Use” was immediately regarded as a must-read for GPT-4V users as soon as it was released:
Some netizens lamented: This is not a thesis at all, it's almost a 166-page book.
Some netizens were already panicking after watching it:
Don’t just look at the details of GPT-4V’s answers. I’m really scared of the potential capabilities demonstrated by AI.
So, what exactly does this "paper" of Microsoft talk about, and what "potential" of GPT-4V does it demonstrate?
What does Microsoft’s 166-page report say?
The core of this paper's method of studying GPT-4V lies in one word - "try".
Microsoft researchers designed a series of inputs covering multiple domains, fed them to GPT-4V, and observed and recorded GPT-4V's outputs.
Subsequently, they evaluated GPT-4V's ability to complete various tasks and also gave new prompt word skills for using GPT-4V, specifically including four aspects:
1. Usage of GPT-4V:
5 usage modes: input images, sub-images, texts, scene texts and visual pointers.
3 supported capabilities: instruction following, chain-of-thoughts, and in-context few-shot learning.
For example, this is the instruction-following ability demonstrated by GPT-4V after changing the questioning method based on the thought chain:
2. GPT-4V’s performance in 10 tasks:
Open-world visual understanding, visual description, multimodal knowledge, common sense, scene text understanding, document reasoning, coding, temporal reasoning, abstract reasoning, emotion understanding
This includes "image reasoning questions" that require some IQ to answer:
3. Tips for GPT-4V-like multimodal large models:
A new multimodal prompting technique, “visual referring prompting”, is proposed, which can directly edit the input image to indicate the task of interest and can be used in combination with other prompting techniques.
4. Research and implementation potential of multimodal large models:
Two areas that multimodal learning researchers should focus on are predicted, including implementation (potential application scenarios) and research directions.
For example, this is one of the scenarios where the researchers found GPT-4V to be useful - fault detection:
But whether it is the new prompt word technique or the application scenarios of GPT-4V, what everyone is most concerned about is the true strength of GPT-4V.
Therefore, this "instruction manual" then used more than 150 pages to show various demos, detailing the capabilities of GPT-4V when faced with different answers.
Let’s take a look at how far GPT-4V’s multimodal capabilities have evolved.
Proficient in professional field images and can also learn knowledge on the spot
Image Identification
The most basic recognition is of course a piece of cake, such as celebrities from the technology, sports, and entertainment circles:
Not only can you tell who these people are, but you can also interpret what they are doing. For example, in the picture below, Huang is introducing Nvidia's newly launched graphics card products.
In addition to people, landmark buildings are also a piece of cake for GPT-4V. It can not only determine the name and location, but also give a detailed introduction.
△Left: Times Square, New York; Right: Kinkaku-ji Temple, Kyoto
However, the more famous the people and places are, the easier it is to judge them, so more difficult pictures are needed to demonstrate the capabilities of GPT-4V.
For example, in medical imaging, GPT-4V gives the following conclusion for the lung CT image below:
There are consolidation and ground-glass opacities in multiple areas of both lungs, which may indicate infection or inflammation of the lungs. There may also be a mass or nodule in the right upper lobe.
Even without telling GPT-4V the type and location of the image, it can make the judgment on its own.
In this picture, GPT-4V successfully identified that this is a magnetic resonance imaging (MRI) image of the brain.
At the same time, GPT-4V also found a large amount of fluid accumulation, which was considered to be a high-grade brain glioma.
After judgment by professionals, the conclusion given by GPT-4V is completely correct.
In addition to these "serious" contents, GPT-4V also grasped the "intangible cultural heritage" emoticons of contemporary human society.
△Machine translation, for reference only
GPT-4 can not only interpret the memes in emoticons, but also the emotions expressed by human expressions in the real world.
In addition to these real images, text recognition is also an important task in machine vision.
In this regard, GPT-4V can recognize languages written in Latin, as well as other scripts such as Chinese, Japanese, and Greek.
Even handwritten math formulas:
Image Reasoning
The DEMO shown above, no matter how professional or difficult to understand, is still in the field of recognition, but this is only the tip of the iceberg of GPT-4V's skills.
In addition to understanding the content of the picture, GPT-4V also has certain reasoning capabilities.
In simpler terms, GPT-4V can find the differences between the two images (although there are still some errors).
In the following set of pictures, the differences between the crown and the bow were discovered by GPT-4V.
If the difficulty is increased, GPT-4V can also solve graphic problems in IQ tests.
The features or logical relationships in the above three questions are relatively simple, but the difficulty will increase in the following questions:
Of course, the difficulty does not lie in the graph itself. Pay attention to the fourth text description in the graph. The arrangement of the graph in the original question is not as shown in the graph.
Image Annotation
In addition to answering various questions with text, GPT-4V can also perform a range of operations on images.
For example, we have a group photo of four AI giants, and we want GPT-4V to frame the characters and mark their names and brief introductions.
GPT-4V first answered these questions with text, and then gave the processed pictures:
Dynamic content analysis
In addition to these static contents, GPT-4V can also perform dynamic analysis, but it does not directly feed the model a video.
The five pictures below are taken from a tutorial video on making sushi, and GPT-4V’s task is to infer (based on understanding the content) the order in which these pictures appear.
For the same series of pictures, there may be different ways of understanding them, and GPT-4V will make judgments based on text prompts.
For example, in the following set of pictures, whether the person's action is opening or closing the door will lead to completely opposite sorting results.
Of course, by looking at the changes in the characters' states in multiple pictures, we can also infer what they are doing.
Even predicting what will happen next:
“On-site learning”
GPT-4V not only has strong visual capabilities, but more importantly, it can learn and apply immediately.
Let’s take another example. Let GPT-4V read the car dashboard. The answer it got at first was wrong:
Then I gave the method to GPT-4V in words, but the answer was still wrong:
The example was then shown to GPT-4V, and the answer was similar, but unfortunately the numbers were made up.
Only one example is indeed a bit too little, but as the number of samples increased (actually only one more), the hard work finally paid off and GPT-4V gave the correct answer.
That’s all the effects of GPT-4V. Of course, it also supports more fields and tasks, which cannot be shown here one by one. If you are interested, you can read the original report.
So, what kind of team is behind the amazing effects of GPT-4V?
Led by Tsinghua alumni
There are 7 authors in total for this paper, all of whom are Chinese, and 6 of them are core authors.
The project's lead author, Lijuan Wang, is the chief research manager for cloud computing and AI at Microsoft.
She graduated with a bachelor's degree from Huazhong University of Science and Technology and received her Ph.D. from Tsinghua University in China. She joined Microsoft Research Asia in 2006 and joined Microsoft Research in Redmond in 2016.
Her research area is deep learning and machine learning based on multimodal perceptual intelligence, which specifically includes AI technologies such as visual language model pre-training, image caption generation, and object detection.
Original address: https://arxiv.org/abs/2309.17421