Pushing Electrons

ChatGPT

2023 was a blowout year for Generative AI systems. We saw huge advances in a range of systems – not only LLM's like ChatGPT, but also text-to-image models like DALL-E and Midjourney, text-to-speech and speech-to-text models, and even text-to-video and the ability to create 3D images from 2D. The future is wild, y'all.

So, naturally, this leads to the question of “what's next?”

The following is my best guess, based on my knowledge of the technology and observation of the ecosystem. My hit rate last year was pretty darn good – hopefully these predictions hold up as well. I look forward to your thoughts on it.

It may seem a bit tardy to be prediction the year in GenAI...in April. I'm a bit behind on a lot of posts at the moment, but figured better late than never. Hopefully, this means that the predictions are a bit better than they would be otherwise – and also more embarrassing if I'm off the mark. As Yogi Berra supposedly said “Prediction is hard – especially about the future.” :)

Better and More Efficient Models

First and foremost, we'll see more and better big commercial models, at prices at or lower than we see today. The big players – OpenAI, Google, Anthropic, etc. – will continue to build bigger and more powerful models – more parameters, higher throughput, more functionality. We're already seeing this play out. Google's Gemini model recently premiered with a 1M token context window, allowing users to upload and query multiple documents simultaneously. Anthropic's most recent Claude update already performs at or better than GPT-4 at prices below those from OpenAI, on the order of a few dollars per million tokens (a token is roughly .5-1 words). Rumors already abound about OpenAI's next model, GPT-5. And that's all before April. It's going to be a fun year.

We'll also likely see many more systems in the open-source space as well. The easy version of this prediction is: smarter models, approaching or surpassing GPT-4, that can run locally and fit on a single consumer-grade GPU (meaning 24GB of VRAM or less). However, I believe we can go further in this prediction. The availability and ease-of-use of llama.cpp as an alternative to other deep learning frameworks (like PyTorch or Tensorflow) has been a major source of innovation in the community, bringing the efficiency and portability of C code to the world of LLM's. This means running generative AI systems at usable speed on just a CPU, not a GPU. Justine Tunney has also been doing miraculous work in the world of system-level optimizations – the llamafile project she's engineering at Mozilla allows you to download a single-file and run an LLM on any common computer. It doesn't matter what OS you have, what brand of CPU or GPU you have, any special languages or programs you have installed – it just runs, everywhere. Incredible work!

There are also many other avenues of development.

A major criticism of GenAI tools in recent months has been the sheer quantity of resources supposedly required for these models. Energy, water, chips – there is no shortage that these systems cannot be blamed for, evidence be damned. It's gotten to the point where Sam Altman, the head of OpenAI, is actively exploring investment in fusion technology, suggesting that energy is the major bottleneck to AI development going forward.

Of course, as with other resource systems (energy grids, water supplies, etc.), the biggest dividends usually come not from breakthrough new technologies, but improved efficiency. So, along with advancements in task performance, I also expect this year will see major advances in improving speed and efficiency of GenAI systems, especially LLM's.

Unsurprisingly, there have already been a number of great developments in this direction. Some have explored shrinking models, including Microsoft's 1.58 bit LLM's paper, along with studies showing that a huge portion of model parameters can be zeroed out with minimal performance impact. Google released a paper recently showing that some words are easier to predict than others – meaning that significant energy can be saved by simply focusing more energy on the hard words and less on the easy ones. These sorts of tricks accounted for major advances in basic computing efficiency at the chip level decades ago – I expect will see many more approaches before the year is out, potentially offering order-of-magnitude improvements in resource consumption in the near future.

Compositional LLM's

As people experimented with LLM's over 2023, a common theme kept arising – the more that the models were tuned, the dumber they became. GPT-4 became famous for half-assing programming tasks (framing out the effort, then suggesting the user fill in the details), avoiding details in knowledge tasks, and so on. Conversely, using open-source models that had not yet been aligned (meaning “tuned for human preferences”) was eye-opening, both for the stark improvement in performance, as well as their readiness to create offensive content.

My theory is that we are simply asking too much of a single model. General purpose models have their place for general purpose tasks, in the same way that Wikipedia is a great starting place for learning about things, but rarely a final resource. Models trained on specific tasks, like programming or translation, tend to do much better at those tasks, but worse at others (like planning or question-answering).

Based on this, one promising direction for LLM development has been with Mixture-of-Expert models (or MoE for short) – rather than having a single super-smart expert on everything, designers blend together several smaller models, each expert in a specific area, to complete the user's task. These systems tend to require more memory overall, but surprisingly less than the sum of their parts would suggest, and their performance has been very promising so far.

However, I don't think that's quite far enough. In addition to answering a broad diversity of questions, we're also challenging these systems to detect harmful prompts, output aligned (i.e. human-respecting) outputs or structured outputs, and be less sensitive to specific prompt wording. Along with using multiple LLM's in parallel (as in the MoE approach), I expect we'll also start to see them in series as well.

Imagine a multi-step LLM system, where a single prompt passes through multiple smaller LLM's in order, each accomplishing a specific task to serve the user, such as:

  • Security filtering (block harmful prompts)
  • Prompt optimization (reduce sensitivity to specific prompt phrasing)
  • Expert routing
  • Expert solution
  • Output alignment (keep it from saying something offensive)
  • Structured formatting (output result at JSON, YAML, or more)

There have been some developments in this regard, especially with respect to structured formatting (Anthropic is ahead of the game here, as well as tools like the Outlines), but this overall concept seems underexplored by the community. I expect to see more soon.

Agentic Systems

But let's take this a step further. One of the tantalizing dreams of LLM's has been the potential to extend them beyond chatbots, into fully agentic assistants. There are now a number of frameworks (most notably DSPy) that enable a single LLM to assume multiple personas and talk to other versions of itself. These are now being built into fully functional systems, such as OpenDevin, that can be given a goal and automatically complete the work required as a cohort of differently-prompted model instances, acting as a synthetic “team”.

The best thing about this approach is that you don't need powerful models to make this happen. Rather than using one super-smart LLM as a jack-of-all-trades, you can use a virtual group of more-efficient AI models, where each member is more performant in their area.

This opens up a whole range of use cases and implementation approaches.

  • For privacy, security, cost, or convenience reasons, you could run a model on your own local computer versus having to use pre-defined or pre-tuned models from elsewhere.
  • You are not constrained by using any one model. You could use one base model (but different prompts) for each agent, use different models for each agent, or some mix of the two. You're also not constrained by what's available today – as new models become available, simply switch them in as needed.
  • This also potentially opens up use cases far beyond what could be achieved with a single-prompted LLM. In today's world, you're essentially finding one super-smart person, then asking them to consider a task from multiple points of view. You're carrying around that previous history of the conversation through a single personality lens, even if exploring different facets of a problem. In an agentic structure, you're starting from that same super-smart core person, but, through unique prompts, essentially creating independent experts in collborating with each other. It enables a much broader range of exploration beyond what a single LLM could potentially handle, staging out the work in different phases, and potentially using different LLM's for different parts of the works. It's a whole new ball game, and we're still in the first inning.

Outside of LLM's

So far I've written primarily about LLM-based GenAI, and haven't touched on other systems yet. This is in part because I know less about other systems, and I don't have as clear a vision for them as for LLM's. However, based on previous trends, I think we can safely expect to see them continue to improve along multiple lines:

  • Fidelity: higher accuracy and detail in their outputs, as well as larger outputs (longer songs, bigger pictures, etc.)
  • Steering: staying closer to user expectations, and more consistent across a series of generations. (Think of generating a series of marketing pictures using a single consistent character, or an album of songs with consistent instrumentation and vocalists.)
  • Cost: the price for using these will stay or get lower over time, if they're not already free.

The one direction I'm not prepared to predict is the availability of open-source models. StabilityAI had been a major leader in this area for a range of models, but have recently come in to difficult times. It's not clear that OSS community has the same level of skill with non-text-to-text models as they do with LLM's, so I wouldn't be surprised to see open-source development of locally-deployable non-text-to-text models slow down.

The one wildcard here is multimodal models, which can operate with a wide range of input (and potentially output) modalities, such as text, images, video, and so on. I don't think these will fully supplant targeted models focusing on specific modalities, but they could be a great addition to the assortment.

A long-standing challenge with Generative AI is it's troubled origin story – most major GenAI projects started with a scrape of the Internet, without asking for permission first. For publicly-owned or -committed data, like Wikipedia, that's not so much of an issue. However, for other data – forum posts, artistic output, pirated media, etc. – which has not granted explicit permission for other uses, it's been a grey area, a sticking point, and a source of both enmity and lawsuits that has cast a shadow over the whole undertaking.

It's clear that some of these discussions will reach a point of resolution this year. The New York Times lawsuit against OpenAI is clearly a gambit for a licensing arrangement, and both parties are motivated to resolve it quickly. The EU is moving forward with legislation to regulate AI model creation, including improved documentation of training sources and standards for release. US Representative Adam Schiff recently proposed legislation that would require disclosure of training data sources as well. This is something that AI researchers and companies should have been doing all along – now, it looks like the law is coming to force their hand.

What intrigues me is how this may impact GenAI perception and adoption going forward. Much of the reaction to GenAI on social media has clung to this particular issue with GenAI as a reason to scuttle the whole enterprise. But, as these tools become more ever-present, easier to use, and with proven results in both commercial and personal terms, will people come to accept that these systems are actually beneficial if the whole process is consensual, or will the original sin tarnish these systems for a generation?

As with all other aspects of this ecosystem – only time will tell. :)

#llm #llms #GenerativeAI #GenAI #ChatGPT #OpenAI #Anthropic

Written by Dulany Weaver. Copyright 2022-2024. All rights reserved.

“The future is already here – it's just not evenly distributed.” -William Gibson

It's been six months since OpenAI's ChatGPT system exploded onto the scene, enchanting the world with it's incredible fluidity, range of knowledge, and potential for insight. The reactions have been predictably wide-ranging, from those sneering at the technology as a fancy parlor trick to those seeing it as the future of personal automation. I'm on the record as bullish on large language model (LLM) technology. However, given the frenetic pace of the AI innovation nowadays and the proven profit motive for bringing it to business, I've been surprised at how slow this transition has been so far.

Let me explain.

A Road Taken

As an example of what could happen, consider text-to-image systems. You describe an image in words, and the AI generates pictures based on your description.

Throughout the 2010's, researchers hacked away at the problem, slowly growing the size of the pictures, the fidelity to the prompt, the array of styles available. However, examples never went much beyond poor, small images.

Then, in the course of 2 years, models were released by OpenAI, then Midjourney, then StabilityAI, that created high-quality pictures of a size that could actually be used commercially. Prompts went from a long, complicated paragraph to a simple sentence. Most recently, new tools allow you take these techniques even further, making videos, using reference pictures, or guiding specific elements of the image in certain ways. These systems are now standard plug-ins for Canva, Bing, Adobe products, and others. Once the underlying techniques were widely available, innovation exploded and business applications followed.

A Road To Be Explored

In the world of LLM's, there has been a low, steady drumbeat of progress. The standard architecture used today was published by Google in 2018. Github Copilot – a coding-focused LLM – became available as a subscription in June 2022. OpenAI released ChatGPT in November 2022, Meta released their open-source (but “research-only”) LLAMA model in February 2023, and OpenAI released GPT4 (an even more advanced model) in March 2023. StabilityAI released a fully open-source, commercially-licensed model in April 2023, but it has a lot of room for improvement. There are a few other models available from Google, AnthropicAI, EleutherAI, etc., but they're relatively minor players in this field with decent, not great, models available via the Web.

Meanwhile, the hacker community has experimented extensively with different ways of using these tools, but no killer apps have evolved yet. And despite their enormous potential, LLM's have barely made a dent in the business world outside of coding and some basic writing applications.

There are a few reasons for this.

1) Most good models are proprietary, served by 3rd parties, and can get expensive fast. Although they say they don't keep your data, software companies have historically been less-than-honest here. You want to be very careful about putting confidential information into the systems – which significantly limits their utility to business.

2) The one good open-source model (LLAMA) has a very limited license, meaning that it's only useful for experimentation today – not commercial use.

3) Even the experiments based on LLAMA use bespoke frameworks and only run well on very specific hardware or operating systems (cough Mac cough) that are expensive or not widely-used. Trying to port them outside of these constraints has so far yielded poor results. (Trust me – I've tried!)

(It should be noted that there are still some technical limitations, too. Training these models is not cheap (six to seven figures), so likely only to be driven by an organization with some profit motive. Some of the elements of the model – like the amount of text it can consider in one step – also have notable limits today, although techniques for overcoming these are being developed quickly.)

A Roadmap For The Future

Despite the lack of killer apps, people have been very clever at exploring a range of use cases for these systems. (Big shout out to Ethan Mollick at OneUsefulThing, who has started re-imagining both business and education through the use of LLM's.) Overall, they seem to be settling into three main application types:

1) Chatbots for interaction or text generation. Imagine using your LLM as a personalized tutor, a creative partner, or just a rubber ducky. Likewise, if you need to generate code, documentation, or rote marketing material, an LLM can take you quite far. The technology for this basically exists today, but the main problem to solve is democratization – enabling people to own their conversations and data by running the LLM on their local computer.

2) Data Oracles: LLM's trained on (or with access to) a wide variety of documents which can then answer questions or summarize material. Imagine a law office using an LLM for discovery, or a scientist loading an LLM with relevant papers on a subject and exploring the known and unknown. Along with privacy, this use case has a technical hurdle arising from how much data the LLM can keep “front of mind” – but there are multiple solutions being actively explored.

3) Personal Assistants: agents with access to the Internet who can do work on our behalf, developing plans and executing them autonomously (or with minor oversight). Imagine JARVIS from Iron Man, who can be your travel agent, personal secretary, and project manager all in one. Today, the barrier to this mode is both privacy and cost. Your personal secretary needs all of your passwords, plus your credit card number, and, today, every action they take (big or small) costs $.06. How far would you trust an automated system with this, and what would you let them do?

If these tools could be realized, the possibilities for their personal and commercial use are enormous. But, for businesses to adopt this technology, it must be private, affordable, and high-performing.

Based on this, what should the target for model builders be? Here's my thoughts:

  • Can be run locally on either Windows desktop computers or Windows/Linux servers. Windows has 75% of market share for desktop and laptop computers, and 20% share for servers. (Linux has 80% of servers.) If businesses are to use it, it must be Windows-compatible.
  • If it needs a GPU, it can use mid-to-high-end consumer GPU's (model size between 12GB-24GB). 12GB GPU's are $400-500 today, and 24GB GPU's start at $1200. A big company could run a server farm, but a small company would likely aim for $3-5k for all overall system cost, depending on it's performance and need. That also puts it in the range of pro-sumer users.
  • Can be accessed from outside programs (“server”-type construction vs “chat”-type construction). Chatbot architectures are great for today's common use cases, but Data Oracles and Personal Assistants will need to interface with outside systems to be useful. A chat interface just doesn't work for that.
  • Can execute “fast enough” to meet user needs. Mac users are seeing full answers from LLAMA-based models in about 30 seconds, or roughly 10 words/sec. This (or perhaps down to 5 words/sec) seems to be the limit of utility for these systems – anything slower might as well be done another way. And that would be per user – if a central LLM server is being used for a company of 10 users, it should generate results at a minimum of 50 words/sec.

LLM's have huge potential to transform the way we work with technology and each other. If we can cross the threshold of both easy deployment and easy use – the results will be incredible.

Tags: #ChatGPT #LLM #AI #ArtificialIntelligence

Written by Dulany Weaver. Copyright 2022-2024. All rights reserved.

Following the enthusiastic reception of OpenAI's public ChatGPT release last year, there is now a gold rush by Big Tech to capitalize on it's capabilities. Microsoft recently announced the integration of ChatGPT with its Bing search engine. Google – who have published numerous papers on this tech previously, but never made the systems public – announced the introduction of their Bard system into Google Search. And, naturally, there's a host of start-ups building their own versions of ChatGPT, or leveraging integration with OpenAI's version to power a variety of activities.

What's fascinating about this explosion of applications is that the underlying tech – LLM's, or large-language models – is conceptually simple. These systems take a block of input text (called “tokens”), run them through a neural network, and output a block of text that (according to the system) “best matches” what should come next. It's a language pattern-matcher. That's it.

And yet, it's capabilities are surprisingly powerful. It can compose poems in a variety of styles over a range of subjects. It can summarize. It can assume personas. It can write computer code and (bad) jokes. It can offer advice. And the responses are tight, well-composed English. It's like chatting with another person.

Except, when it's not. As many have noted, it often returns incorrect, if confident, answers. It makes up data and citations. It's code is often buggy or just flat-out wrong. It has a gift for creating responses that sound correct, regardless of the actual truth.

It's language without intelligence.

Let's sit with that for a minute. Engineers have created a machine that can manipulate language with a gift rivalling great poets, and yet often fails simple math problems.

The implications of this are fascinating.

First, from a cognitive science perspective, it suggests that language skill and intelligence – definitely in a machine, possibly in humans, maybe as a general rule – are two completely separate things. Someone compared ChatGPT to “a confident white man” – which a) oof and b) may be more accurate than they realized. In an environment where performance is measured by verbal fluidity or writing skill, but not actual knowledge, ChatGPT would absolutely excel. There are many jobs in the world that fit this description (and unsurprisingly, they seem to be dominated by white men!) For these sorts of activities, an agent – human or machine – doesn't have to be good at any particular thing except for convincing others it is smart through verbal acuity and vague allusions to data, either actual or imagined. (Give it an opinion column in the New York Times!)

Second, technologically, it immediately suggests both the utility and the limits of the system. Need to write an email, an essay, a poem – any product that primarily requires high language skill? ChatGPT and it's successors can now do that with ease. If the ultimate outcome of the activity is influencing a human's opinion (a teacher, a client, a loved one), you're all set. However, if you require a result that is actually right and factual, it requires human intervention. ChatGPT has the human gift for reverse-engineering justifications for it's actions, no matter how outlandish, and so there's no circumstance where you should trust it, on it's own, to do or say the right thing. A person's judgment is still required.

You might ask “how useful is it's output if you still have to revise it?” To which you might also ask “what value is a writer to an editor?” You don't hammer with a chainsaw – all tools don't need to be fit for all purposes. But, if you need to quickly generate readable text with a certain style about a certain subject, it offers a great starting point without minimal labor. For knowledge workers, that offers an incredible potential for time savings.

Finally, these systems do suggest a path toward artificial general intelligence. These models essentially solve the “challenge” of language, but lack both 1) real, truthful information, as well as 2) the ability to sort and assemble that information into knowledge. The first of those is easily answered – hook it up to the Internet, or books or your email account, or any other source of meaningful reference data. Part of ChatGPT's limitations come from the fact that it is deliberately not connected to the Internet, both constraining it and (at this stage) enhancing it's safety.

And, as for the ability to manipulate knowledge – that is underway, with some working proofs-of-concept already developed. If engineers can develop a reasoning system to complement LLM's – enabling them to decompose questions into a connected set of simpler knowledge searches, and perhaps with the tools to integrate that data in various ways – these systems have the potential to facilitate a wide range of knowledge-based activities.

(In fact, some of the earliest AI systems were reasoning machines of exactly this genre, but based on discrete symbols instead of language. LLM's offer the potential to advance these systems by interpreting language-based information that's less clear-cut than mathematical symbols.)

Along with the technical aspects, we must also ask: what does this mean for society? From a business perspective, likely the same as what happens with all automation – the worst gets automated, the best gets accelerated, and humanity's relationship with production changes. Writers of low-quality or formulaic content may be out of a job. Better writers will no longer have to start from a blank page. The best writing will still be manual, bespoke, and rare. The tone of writing across all media will be homogenized, with the quality floor set to “confident white man” (potentially offering benefits toward diversity and inclusion). The quality of all professional communications will improve as LLM's are integrated into Word, Powerpoint, Outlook, and similar communication software. Knowledge management (think wiki's, CRM's, project management tools) becomes much faster and easier through becoming more automated. Software comments will be automatically generated, letting programmers focus on system development. Sales becomes more effective as follow-ups become automated and messages are tailored to the customer. And that's just the beginning.

From a social standpoint, the outlook is more complex. Personalizing content becomes dramatically easier – one could imagine a system where the author just releases prompts for interaction, and an LLM interprets it uniquely for each reader in the way the reader finds most engaging. Video games, especially narrative video games, become deeper and richer. Social media may have more posts but be less interesting. Misinformation production becomes accelerated, and likely becomes more effective as the feedback cycle also accelerates. These new systems magnify many of society's existing challenges, while also opening up exciting new modes of interaction.

This has been a long time coming in the artificial intelligence community. After years of limited results, the availability of Big Computing has enabled revolutions in image processing, art creation – and now, language-based tasks. These are exciting times, with many more developments assuredly coming soon.

Tags: #ChatGPT #LLM #AI #ArtificialIntelligence

Written by Dulany Weaver. Copyright 2022-2024. All rights reserved.