Can Multimodal LLMs Truly "See" Images? A Deep Dive with ASCII Art

With last week’s Pixtral release, multimodal large language models (LLMs) like OpenAI’s GPT-4o, Google’s Gemini Pro, and Pixtral are making significant strides. These models are not only able to generate text from images but are often touted for their ability to “see” images similarly to human perception. But how true is this claim?

Update (2024-09-25): I added a new test for Vision Llama 3.2. Unfortunately, it failed again. Update (2024-09-18): I added a new test for Qwen2-VL, which still fails to generate the correct ASCII art.

Experimenting with ASCII Art

To explore this, I conducted an experiment to test the model’s ability to understand the spatial structure of simple images and to convert them into ASCII art. This task challenges the model to interpret and express visual details in textual form, testing their spatial reasoning.

The ASCII Art Test

I used a simple image: the SkyPilot logo. Yes, it’s a bit random, but I came across it when I was writing the SkyPilot deployment guide for Pixtral on any cloud or Kubernetes (check it out!).

I used three different prompts for each model:

Give me the ascii art for this logo
Convert this SkyPilot logo into an ASCII art in a minimalist way, i.e. no background
Covert this logo to ASCII art

This test is basically asking the model to understand the spatial structure of the image, and replicate it with text.

Spoiler Alert: All but one attempts on five multimodal LLMs with three prompts each failed to fully understand the spatial structure, i.e., failed to generate the given image in ASCII art form.

What’s the Expected Output?

An online image to ASCII art converter generates this test’s desired output: a faithfully reproduced ASCII version of the given logo:

                                                          ++ 
                                                    +++++++++
                                             ++++++++++++++++
                                       +++++++++++++++++++++ 
                                ++++++++++++++++++++++++++++ 
                          +++++++++++++++++++++++++++++++++  
                   +++++++++++++++++++++++++      +++++++++  
             +++++++++++++++++++++++++           +++++++++   
      ++++++++++++++++++++++++                   +++++++++   
++++++++++++++++++++++++                        +++++++++    
 ++++++++++++++++                               +++++++++    
   +++++++++++                                 +++++++++     
     +++++++++++                   ---         +++++++++     
       +++++++++++           -------++        +++++++++      
         +++++++++++     ---------+++         +++++++++      
           ++++++++++-----------++++         +++++++++       
             +++++++----------++++++         +++++++++       
              ++++++---------++++++         +++++++++        
                ++++-------+++++++          +++++++++        
                  +--     ++++++++         +++++++++         
                         ++++++++          ++++++++          
                        +++++++++         +++++++++          
                        +++++++++         ++++++++           
                         ++++++++++      +++++++++           
                           ++++++++++    ++++++++            
                             ++++++++++ +++++++++            
                               +++++++++++++++++             
                                 +++++++++++++++             
                                   ++++++++++++              
                                     ++++++++++              
                                      ++++++++               
                                        ++++++               
                                          +++

Open-Source Models

Vision Llama 3.2

To deploy Vision Llama 3.2, I used the single command from the SkyPilot Vision Llama 3.2 example:

sky serve up -n llama3_2-vision llama3_2-vision-11b.yaml

Attempt 1

Give me the ascii art for this logo

  /\
 /  \
|  | |
 _______
    |
    |

Attempt 2

Convert this SkyPilot logo into an ASCII art in a minimalist way, i.e. no background

>>>>>>
>>>>>
>>>

Attempt 3

Covert this logo to ASCII art

        _____
    __/      \
   /          \
  /            \
 /____________  \
|               |
|               |
|               |
|_______________|

Pixtral

To deploy Pixtral, I used the single command from the SkyPilot Pixtral example:

sky serve up -n pixtral pixtral.yaml

Attempt 1

Give me the ascii art for this logo


 ____        __
/\  _`\      /\ \
\ \ \L\ __ __\ \ \_____  _____ ____
 \ \ ,__\/ __/\ \ \/ / \/ / _ \/ __/
  \ \ \/ /\ \ \ \ \  __  >  __/\ \
   \ \_\ \_\ \ \ \ \ \ \/\ \ \ \ \
    \/_/ \/_/ \_\_\_\_\_\_\_\__\___\_\

Attempt 2

Convert this SkyPilot logo into an ASCII art in a minimalist way, i.e. no background


  ______
 /      \
| () () |
 \  __  /
  `---'

Attempt 3

Covert this logo to ASCII art


     ______
  _ //__  __\
 / \/  \/  \/
 \_/\______/

Qwen2-VL

To deploy Qwen2-VL, I used the single command with SkyPilot (see here for the full example):

sky serve up -n qwen2 qwen2-vl-7b.yaml

Attempt 1

Give me the ascii art for this logo


  ____
 /    \
|   _   |
|  | |  |
|  | |  |
 \___\/

Attempt 2

Convert this SkyPilot logo into an ASCII art in a minimalist way, i.e. no background


  ____
 /    \
|  _   |
| |_|  |
|  _   |
 \___/

Attempt 3

Covert this logo to ASCII art


   _______
  /         \
 /           \
|           |
 \           /
  \_____   _/
    |   |   |
     |_| |_| |

Proprietary Models

Proprietary multimodal vision-text models are often considered more powerful than open-source models. Let’s see how they perform.

Gemini 1.5 Pro

I now switch to Google’s Gemini 1.5 Pro Experimental 0827 on Google AI studio, which is currently the top multimodal model on the LMSYS Chatbot Arena Leaderboard.

Attempt 1 (links to the chat session)

Google AI Studio Screeshot

Give me the ascii art for this logo

.....
     .......
    .........
   ...........
  ............
 .............
...............
. .............
 .. ...........
  ... ........
   .... .......
    ..... .....
     ...... ....
      ....... ...
       ........ ..
        ......... .
         ..........
          .........
           .........
            ........
             .......
              ......
               .....
                ....
                 ...
                  ..
                   .

Attempt 2

Google AI Studio Screenshot

Convert this SkyPilot logo into an ASCII art in a minimalist way, i.e. no background

_.--""--._
    .'          `.
   /   O      O   \
  |    \  ^^  /    |
  \      `--'      /
   `.__________.'

Attempt 3

Google AI Studio Screenshot

Covert this logo to ASCII art

     /\\
    /  \\
   /____\\
  /______\\
 /________\\
/__________\\
\__________/
 \________/
  \______/
   \____/
    \  /
     \/

GPT-4o

Last chance for multimodal LLMs to shine, let’s try GPT-4o.

Attempt 1

ChatGPT 4o Screenshot

Give me the ascii art for this logo

     /\
    /  \
   /    \
  /      \
 /________\
  \      /
   \    /
    \  /
     \/

Attempt 2

ChatGPT 4o Screenshot

Convert this SkyPilot logo into an ASCII art in a minimalist way, i.e. no background

       /\
      /  \
     /    \
    /      \
   /        \
  /__________\
     |    |
     |____|
     /\  /\
    /  \/  \
   /        \

Attempt 3

ChatGPT 4o Screenshot

Covert this logo to ASCII art

This time the model “cheated” by generating some code that reads the image and programatically replaces the pixels with ASCII characters. It didn’t really understand the spatial structure. (Below, we re-rendered the output of the program since ChatGPT did not show the ASCII output correctly in markdown.)

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@%%%%%%%%%%%%@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@%%%%%%%%%%%%@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@%%%%%%%%%%%%@@
@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@%%%%%%%%%%%%@@@
@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@
@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@
@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@
@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@
@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@
@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@
@%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@
@@%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@
@@@%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@
@@@@%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@
@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@
@@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@
@@@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@@@@%#@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@
@@@@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@@@@@#*+#@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@
@@@@@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@@@#*+++#%@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@
@@@@@@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@%*+++++##@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@
@@@@@@@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@%*+++++++##%@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@
@@@@@@@@@@@@%%%%%%%%%%%%%@@@@@@@@@%*+++++++++###%@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@
@@@@@@@@@@@@@%%%%%%%%%%%%%@@@@@%#*++++++++++####@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@
@@@@@@@@@@@@@@%%%%%%%%%%%%%@%#*++++++++++++####%@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@
@@@@@@@@@@@@@@@%%%%%%%%%%%%*++++++++++++++#####@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@
@@@@@@@@@@@@@@@@%%%%%%%%%%#++++++++++++++######@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@
@@@@@@@@@@@@@@@@@%%%%%%%%%#+++++++++++++######%@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@
@@@@@@@@@@@@@@@@@@%%%%%%%%#++++++++++++#######@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@
@@@@@@@@@@@@@@@@@@@%%%%%%%*+++++++++++########@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@%%%%%%*++++++++++########%@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@%%%%#++++++++++#########@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@%%%#+++++++*##########%@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@%%*++++*#%@@#########%@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@%*+*#%@@@@%#########@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@#%@@@@@@@#########%@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#########@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%#########@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#########%@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#########@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%########%@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#########%@@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@##%%%%%%%%%@@@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@@@%%%%%%%%%%%%@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@@%%%%%%%%%%%%@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@%%%%%%%%%%%%@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@%%%%%%%%%%%%@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%%%@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%%@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%%@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%%@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%%%@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%%@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%%@@@@@@@@@@@@@@@@@@@@
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@%%%%@@@@@@@@@@@@@@@@@@@@@

Summary

Multimodal LLMs are impressive due to their capability to understand text and images. However, when it comes to spatial reasoning, there’s still a lot of room for improvement.

It is possible that the issue above is just another example of tokenization issue as we saw in the “strawberry test”, but it is still a long way to go before we see AGI.

Maybe the latest o1-preview from OpenAI could resolve this problem. I am excited to wait for the new model to support images.

If you would like to self-host LLMs in your cloud/cluster to test these out, check out some LLM deployment examples in SkyPilot for running AI on any cloud or Kubernetes.

Experimenting with ASCII Art#

The ASCII Art Test#

What’s the Expected Output?#

Open-Source Models#

Vision Llama 3.2#

Pixtral#

Qwen2-VL#

Proprietary Models#

Gemini 1.5 Pro#

GPT-4o#

Summary#

Experimenting with ASCII Art

The ASCII Art Test

What’s the Expected Output?

Open-Source Models

Vision Llama 3.2

Pixtral

Qwen2-VL

Proprietary Models

Gemini 1.5 Pro

GPT-4o

Summary