The Surprising Idea That Generative AI Might Be Better Off Using Visual Images Of Text Rather Than Pure Text As Tokens

In today’s column, I analyze a rather ingenious concept that skillfully turns the standard style of generative AI and big language designs (LLMs) on its head. Just specified, think about the bold concept that rather of generative AI getting pure text, the text was very first recorded as images, and the images were then fed into the AI.

State what?

For anybody versed in the technical foundations of LLMs, this appears totally oddball and counterproductive. You may currently be screaming aloud that this makes no sense. Here’s why. An LLM is developed to handle natural languages such as English and, for that reason, makes plentiful usage of text. Text is the manner in which we typically input triggers and enter our concerns into LLMs. Choosing to utilize pictures of text, in location of real text, has actually got to be a screwball principle. Blasphemous.

Keep your hat due to the fact that some earnest scientists attempted the technique, and there suffices benefit that we should provide the flight of fancy a degree of seriously dedicated thorough attention.

Let’s discuss it.

This analysis of AI advancements becomes part of my continuous Forbes column protection on the most recent in AI, consisting of recognizing and describing numerous impactful AI intricacies (see the link here).

Tokenization Is Vital

The heart of the matter requires the tokenization elements of modern-era generative AI and LLMs. I have actually covered the information of tokenization at the link here. I will offer a fast introduction to get you up to speed.

When you go into text into AI, the text gets transformed into numerous numbers. Those numbers are then handled throughout the remainder of the processing of your timely. When the AI has actually come to a response, the response is really in a numerical format and requires to be transformed back into text, so it is understandable by the user. The AI continues to transform the numbers into text and shows the reaction appropriately.

That entire procedure is called tokenization. The text that you go into is encoded into a set of numbers. The numbers are described as tokens. The numbers, or will we state tokens, circulation through the AI and are utilized to find out responses to your concerns. The reaction is at first in the numerical format of tokens and requires to be deciphered back into text.

Thankfully, a daily user is blissfully uninformed of the tokenization procedure. There is no requirement for them to learn about it. The subject is of eager interest to AI designers, however of little interest to the public. All sorts of numerical hoax are frequently utilized to attempt and make the tokenization procedure as quick as possible so that the AI isn’t being held up throughout the encoding and decoding that requires to take place.

Tokens Are A Problem

I pointed out that the public typically does not learn about the tokenization elements of LLMs. That’s not constantly the case. Anybody who has actually pressed AI to its limitations is most likely slightly familiar with tokens and tokenization.

The offer is this.

The majority of the modern LLMs, such as OpenAI’s ChatGPT and GPT-5, Anthropic Claude, Meta Llama, Google Gemini, xAI Grok, and others, are rather restricted due to the variety of tokens they can properly deal with at one time. When ChatGPT very first burst onto the scene, the variety of enabled tokens in a single discussion was rather restricted.

You would rudely find this reality by ChatGPT all of a sudden no longer having the ability to remember the earlier parts of your discussion. This was because of the AI striking the wall on the number of active tokens might exist at one time. The tokens from earlier in your discussion were summarily being tossed away.

If you were doing any prolonged and intricate discussions, these constraints were exasperating and practically knocked out of contention any big-time usage of generative AI. You were restricted to fairly brief discussions. The exact same problem developed when you imported text through an approach such as RAG (see my conversation at the link here). The text needed to be tokenized and when again counted versus the limit of the number of active tokens the AI might deal with.

It was maddening to those who had imagine utilizing generative AI for larger-scale analytical.

Limitations Are Greater However Still Exist

The early variations of ChatGPT had a restriction of less than 10,000 tokens that might be active at any moment. If you consider a token as representing a little word, such as “the” or “pet dog”, this suggests you struck the wall when your discussion had actually taken in approximately 10 thousand basic words. This was unbearable at the time for any prolonged or intricate use.

Nowadays, the standard variation of GPT-5 has a token context window of about 400,000 tokens. That is thought about the overall capability connected with both the input tokens and the output tokens as a combined overall. Context window sizes can differ. For instance, Claude has a limitation of about 200,000 tokens on a few of its designs, while others extend even more to around 500,000 tokens.

A visionary view of the future is that there will not be any constraints connected with the enabled variety of tokens. There is advanced deal with so-called boundless or unlimited memory in AI that would practically make it possible for any variety of tokens. Naturally, in a useful sense, there is just a lot server memory that can exist; hence, it isn’t really boundless, however the claim is memorable and fairly reasonable. For my description of how AI infinite memory works, see the link here.

Managing The Token Issue

Due to the fact that tokenization is at the essence of how most LLMs are developed and made use of, a great deal of effort has actually been stridently carried out to attempt and enhance the tokenization elements. The goal is to in some way make tokens smaller sized, if possible, enabling more tokens to exist within whatever memory restrictions the system has.

AI designers have actually consistently looked for to compress tokens. Doing so might be a huge aid. Whereas a token window may be usually restricted to 200,000 tokens, if you might drop each token down into half its typical size, you might double the limitation to 400,000 tokens. Good.

There is a bothersome catch connected with the compression of tokens. Typically, yes, you can squeeze them down in size, however the accuracy gets undercut when you do so. That’s bad. It may not be excessively bad in the sense that they are still practical and functional. All of it relies on just how much accuracy gets compromised.

Preferably, you would desire the optimum possible compression and do so at a 100% retention of accuracy. It’s a lofty objective. The chances are that you will require to weigh compression levels versus precision accuracy. Like a lot of things in life, there is never ever a totally free lunch.

Knock Your Socks Off

Expect we enabled ourselves to believe outside package.

The typical technique with LLMs is to accept pure text, encode the text into tokens, and continue in our merry method. We would often start our idea processes about tokenization by rationally and naturally presuming that the input from the user will be pure text. They go into text through their keyboard, and text is what gets transformed into tokens. It’s an uncomplicated technique.

Contemplate what else we may do.

Apparently out of left field, expect we dealt with text as images.

You currently understand that you can take an image of text and have that then optically scanned and either kept as an image or later on transformed into text. The procedure is a longstanding practice called OCR (optical character acknowledgment). OCR has actually been around considering that the early days of computer systems.

The typical OCR procedure includes transforming images into text and is described as image-to-text. Often you may wish to do the reverse, particularly, you have text and wish to change the text into images, which is text-to-image processing. There are lots and great deals of existing software application applications that will happily do image-to-text and do text-to-image. It is old hat.

Here’s the insane concept about LLMs and tokenization.

We still have individuals go into text, however we take that text and transform it to an image (i.e., text-to-image). Next, the image of the text is utilized by the token encoder. Therefore, instead of encoding pure text, the encoder is encoding based upon pictures of text. When the AI is all set to offer a reaction to the user, the tokens will be transformed from tokens into text, using image-to-text conversions.
Boom, drop the mic.

Understanding The Surprise

Whoa, you might be stating, what good does this experimenting with images accomplish?

If the images-to-tokens conversions can get us towards smaller sized tokens, we may be able to compress tokens. This, in turn, suggests we can possibly have more tokens within the bounds of restricted memory. Keep in mind that the compression of tokens is solemnly on our mind.

In a just recently published research study entitled “DeepSeek-OCR: Contexts Optical Compression” by Haoran Wei, Yaofeng Sun, Yukun Li, arXiv, October 21, 2025, the term paper made these claims (excerpts):

” A single image including file text can represent abundant info utilizing considerably less tokens than the comparable digital text, recommending that optical compression through vision tokens might accomplish much greater compression ratios.”
” This insight inspires us to reconsider vision-language designs (VLMs) from an LLM-centric point of view, concentrating on how vision encoders can boost LLMs’ performance in processing textual info instead of standard VQA, which human beings stand out at.”
” OCR jobs, as an intermediate method bridging vision and language, offer a perfect testbed for this vision-text compression paradigm, as they develop a natural compression-decompression mapping in between visual and textual representations while using quantitative assessment metrics.”
” Our technique attains 96%+ OCR decoding accuracy at 9-10x text compression, ∼ 90% at 10-12x compression, and ∼ 60% at 20x compression on Fox criteria including varied file designs (with real precision being even greater when representing formatting distinctions in between output and ground reality).”

As kept in mind above, the speculative work appeared to recommend that a compression ratio of 10x smaller sized might sometimes be attained with a 96% accuracy. If that might be done throughout the board, it would suggest that, whereas a token window limitation today may be 400,000 tokens, the limitation might be raised to 4,000,000 tokens, albeit at a 96% accuracy rate.

The accuracy at 96% may be bearable or excruciating, depending upon what the AI is being utilized for. You can’t get a totally free lunch, a minimum of up until now. A compression rate of 20x would be even much better, though the accuracy at 60% would appear rather unappealing. Still, there may be situations in which one might begrudgingly accept the 60% for the 20x boost.

Famous AI star Andrej Karpathy published his preliminary ideas online about this technique overall: “I rather like the brand-new DeepSeek-OCR paper. It’s a great OCR design (perhaps a bit even worse than dots), and yes information collection etc., however anyhow it does not matter. The more fascinating part for me (esp as a computer system vision at heart who is momentarily masquerading as a natural language individual) is whether pixels are much better inputs to LLMs than text. Whether text tokens are inefficient and simply dreadful, at the input. Possibly it makes more sense that all inputs to LLMs ought to just ever be images.” (source: Twitter/X, October 20, 2025).

Conceptualizing Works

The research study likewise attempted utilizing a wide variety of natural languages. This is yet another worth of utilizing images instead of pure text. As you understand, there are natural languages that utilize pictorial characters and words. Those languages would appear particularly appropriate to an image-based technique of tokenization.

Yet another interesting element is that we currently have VLMs, including AI that handles visual images instead of text per se (i.e., visual language designs). We do not need to transform the wheel when it pertains to doing similarly with LLMs. Simply obtain what has actually dealt with VLMs and adjust to use in LLMs. That’s utilizing the entire noggin and leveraging reuse when possible.

The concept deserves recommendation and extra digging in. I would not recommend walking around and immediately stating that all LLMs require to change to this type of technique. The jury is still out. We require more research study to see how far this goes, in addition to comprehending both the benefits and the drawbacks.

On the other hand, I think we can a minimum of make this strong declaration: “Often, an image truly deserves a thousand words.”

Source: Forbes.