Sam Bleckley | Writing

Build a Laboratory

2024-10-07T00:00:00-04:00

A product is something someone will pay for.
A viable product is one you can sell for more than it costs to produce.
A minimum viable product is the easiest thing you can make with those properties, where “ease” depends on your specific circumstances: the time, money, resources, and skills you have available to you

In those terms, it’s inescapable that an MVP is the best thing for your business to build. It’s nearly tautological; a business can’t exist without a viable product, and the least risky way to get there would be the one with the least cost and complexity.

Nevertheless, in the past year, in meetups, conferences talks, conversations with clients, and online, I’ve heard a fair amount of pushback against the MVP concept, with folks often trying to replace just one letter — recommending a “minimum viable test”, or a “minimum useful product”. And there are problems: before you build it, you don’t know what people will pay! You don’t know how much it will cost! You don’t know what features it needs and which can be eschewed! Building an entire business is an expensive way to learn that your idea isn’t viable.

But when addressed on its own terms, the idea of an MVP feels unassailable — because it’s tautological, it’s hard to argue against.

In order to address that tension and frustration, I think it might be easier to approach the idea side-on, by translating the concept of the MVP into a slightly different language, where there’s more room to examine it. I have found a lot of value in an experimental mindset: make a hypothesis, find a way to test it, adjust, repeat — so I instinctively turn to that framework when I need a way to think about actions and motivations.

Let me invent some vocabulary to fit that perspective:

A market hypothesis is a guess that a sufficiency of customers will pay some amount for some good or service. (This should not be confused with the broader “efficient market hypothesis”).

An operations hypothesis is a guess that you can produce a product at a given cost (in money, in labor, in time, etc).

A complete business hypothesis is a matching pair of a market hypothesis and an operations hypothesis.

In this language, an MVP is just the simplest experiment that demonstrates the validity of a complete business hypothesis.

Importantly, that experiment may still be too complex, too expensive, or too risky to design or implement straight away. And that’s OK — by framing it this way, we already know that our big guess is made up of smaller guesses; every complex hypothesis is made up of lesser assumptions.

In the original language of the MVP, an MVP is atomic — it cannot be broken into smaller bits without losing its fundamental properties, because it’s already, by name and definition, “minimal”.

But in the experimental mindset, we can envision sub-hypotheses — the assumptions from which our larger guesses are built — and we can design and run experiments to test those sub-hypotheses. A proof of your complete business hypothesis is not the least thing you can do; it’s simply the least you can do to be home free. Those smaller experiments along the way aren’t viable products, or even necessarily products at all — if what you’re testing is an underlying assumption in your operations hypothesis, you may not even need to put it in front of users.

The end goal is still always to develop and prove a complete business hypothesis; it’s still just as tautologically true that you’re not a successful business until you have a viable product. Because we’re testing a hypothesis, a disproof is as genuine an outcome as a proof.

User interviews, user tests, market tests, prototypes — all the tools we’re already familiar with can be cast in this experimental light, as ways of testing sub-hypotheses. If an interview or a mockup can cheaply disprove one small assumption on which you based your larger business hypotheses, then you’ve cheaply disproved the whole thing; you know you need to make adjustments without having spent the cost of an entire MVP. (This isn’t a new conclusion, only a new lens: translated back into startup-speak, small, low-risk disproofs are covered by the mantra “fail fast”).

And that is part of the value of this way of thinking: it’s clear that our hypothesis, like most guesses, is probably wrong; rather than trying to build the easiest thing that is a successful business — rather than “build an MVP” — we want to discover a successful business in the easiest way — we want to develop a correct business hypothesis as simply as possible.

Once we acknowledge that our first experiment won’t be the last, the sensible thing to do is to plan for many experiments. A smart inventor doesn’t build a new workshop for every prototype; a smart scientist doesn’t buy all new equipment for every experiment.

So don’t build an MVP; build a laboratory.

Don’t build a product; build yourself tools that will allow you to keep testing and improving hypotheses until you find ones that work.

A lot of my clients balk at the idea of building a componentized design system; but if a slick brand or elegant UI are part of your market hypothesis, then a design system is a UI laboratory. Custom-built one-off UIs are useful only for a single experiment; a design system turns your product into an optical table, where you can stage a thousand experiments using the same building blocks.

A lot of my clients, especially those with small teams don’t want to spend time and energy on robust feature flagging; it doesn’t feel like progress the same way that actually building features does. But being able to add, remove, and recombine entire features turns the product into an inventor’s workbench. It seems too expensive for an MVP, but it’s an obvious step for cheaply running a series of experiments.

(Note that I’m explicitly not talking about A/B testing when I am talking about “experiments” — that is a kind of experiment, but it is more useful for refining and improving an already successful business hypothesis. A/B testing can extract a weak signal from a noisy channel, which is great when you’ve got a real business already, but tight margins and high volume. The gap between a disproven complete business hypothesis and a proven one, on the other hand, is usually visible to the naked eye. So when I say “build a laboratory” I don’t mean “construct a set of rigorous statistical tools like giant enterprises use”; I mean “provide yourself with building blocks that are easy to assemble in lots of ways, and a platform for exposing the results to users where you can watch what happens”. This is part of the difference between a “scientific” mindset and an “experimental” one — I am advocating for the latter.)

If this way of thinking seems appealing, here are some questions to ask yourself:

What are some ways in which your current project is like a laboratory? What experiments have you anticipated and left room for?
What are some ways in which your current project is a one-off? If it turns out to be unsuccessful, what will you have to throw away?
Reframe your business as a complete business hypothesis; how would you say it? What are the most vulnerable assumptions underlying that hypothesis? Are you testing those assumptions in the easiest way? Or are you building the machinery of a business while leaving the assumptions untested?

AI in 2024: Eat the fruit, leave the rind

2024-08-31T00:00:00-04:00

In my opinion, the most exciting uses of LLMs don’t involve generating text at all.

As one step of guessing what the next word of a text might be, existing LLMs turn all the previous text into a set of numbers.

There are the same number of numbers, no matter how short or long the text (with a strict upper limit of the context window). If 5 words produce 50 numbers, 500 words also produce 50 numbers – a different 50, but still 50 of ‘em. These numbers are called a ‘vector embedding’ of that text.

You can measure how similar or how distant those vector embeddings are, and that corresponds well to how similar or different the texts are – not just in a basic “how many words do they share” kind of way, but in a “do they mean the same thing; are they about the same topic” kind of way.

Multi-modal models can embed text, images, and other kinds of data all in the same space, allowing that measurement between very disparate sources.

That’s really cool! That’s a big deal! The ability to pick out related documents, to search by meaning instead of by word, to cluster documents in high-dimensional space… that’s an incredible, exciting tool, and one that can be used disconnected from generation entirely.

Most tutorials suggest using vector embeddings as a search engine, concealed behind retrieval augmented generation – but I say, why bother with the generative part? A more effective search is the win; allowing a generative process to rephrase the results is mostly risk with little to no upside.

Vector embeddings can’t produce false text, because they only organize the text you give them.
Vector embeddings can’t secretly produce copyrighted content because they don’t produce content.
Vector embeddings can still help your users find the answers they’re looking for, but you have complete control over the wording of the results
They have much lower energy costs than generation

Every major LLM provider also offers an API for vector embeddings, and they’re cheap as dirt because they require far less compute than generation.

And that’s it for part III! You’ve now got a pretty complete picture of my opinions about large-model AI in 2024:

only use generation when bullshit is acceptable
know which arguments against it have weight, and
vector embedding is cheap, low-risk, and underutilized

If you’d like to discuss how any of these thoughts relate to your team, product, and business, reach out!

AI in 2024: Making a Case

2024-08-26T00:00:00-04:00

I hang out with writers and artists along with technical people, and the conversation frequently turns to generative AI, both the text and image kind. The sentiment is rarely positive, and I am sympathetic. I think generative AI is overhyped, the wrong choice for many situations, and is already costing people their livelihoods.

I am a pedant, though, so while I am sympathetic, I get frustrated when I hear that negative sentiment carried by specious and ineffective arguments.

My favorite reason to be cautious is the one I gave last week — What Good is Bullshit? — but today I thought I’d take on some other ways people have of saying “generative AI is a problem”, and fine-tune those arguments to align with the truth as best I understand it.

Tired of this relentless negativity? In Part III, I’ll change my tune and talk about some of the LLM use cases I’m most excited about (mostly ignoring the generative aspect entirely).

“LLMs Cost Too Much Energy”

This is, I think, mostly a hold-over argument from fighting against crypto. LLMs do cost energy: training, especially, is expensive, and individual queries are certainly more costly than an average web query; but scary articles about the immense increase in energy usage by major players don’t seem to hold up to scrutiny. While specific energy figures for the major LLMs are largely unavailable, here’s Google’s total energy consumption over the last decade; can you spot the moment Gemini started having its dramatic and damaging effect? (Remember that ChatGPT was released in 2022)

I can’t either.

Too, crypto was doing things that we could already do another way, much faster and much cheaper, with a central database; LLMs are doing something computers couldn’t do before. It’s perfectly fair to argue we don’t need that thing, but it’s a weaker argument, and hard to make the data speak loudly enough for this to be an effective deterrent, I think.

“Training generative models on scraped data is theft.”

This has a good argument at its core, but it’s almost always phrased in a way that I think is morally complicated and potentially legally unsuccessful.

I’ll pursue the legal argument in the US because that’s the law I’m most familiar with, but I believe a similar framework would apply to any attempt at allowing or banning AI training on open web data in the EU, too. Copyright law is complex, the existing precedent can seem contradictory and confusing, and I’m not a lawyer let alone an international intellectual property lawyer, so take all of this with a grain of salt.

In both cases, the only way an action can legally be theft is if it is by distributing a reproduction. The act of scraping published work and storing it in a private database is not, as I understand it, an infringement. If an article is openly published, you’re allowed to save that article and read it later — in fact just by reading it a copy was made, stored on your computer, and displayed. You’re allowed to print it out, to cut it up, to lick it, whatever. What you’re not allowed to do is sell access to your copy, or make a billion copies to give away: distribution is an infringement.

The question of theft, then, is not in the training process at all; I am not a lawyer, but I’m pretty sure doing even very complicated statistics on open-web data is allowed. The question is whether the trained LLM “contains” the original copyrighted work, in whole or in part; and whether whatever gets distributed (the entire model or merely its output) will contain that data.

The answer is “yes, probably, maybe.” It’s reasonably easy to prove that at least some training data is ‘memorized’ by the model – not just analyzed and abstracted but memorized verbatim. The major players, like OpenAI, are doing their best to add layers on top of the LLM to prevent simple exposure of copyrighted training data – it’s easy to run into a “against terms of service” warning if you try obvious ways of prompting for such things – but the data is there, and can be exposed in more creative ways that don’t involve direct prompting at all: see this paper from late last year.

Because these methods produce random memorized data, rather than specific info, it might be hard to find someone with standing to sue; but in the US, a lawsuit about this would mean a fair use factor analysis, and the ‘market impact’ factor is as large as I think it’s possible to imagine a derivative work having.

I understand that this is a disappointing version of the argument — what creators want is to prevent the training at all, not to allow distribution of a version of the LLM with very careful shackles — but if you made me place a bet on a legal argument working in court, it would be this one. If you can provide court-approved proof that the work is present in the model verbatim, the simplest way for an LLM-builder to prove that it cannot appear in the output is to allow opting out as input.

So I’d rephrase the argument this way: Generative models memorize and reproduce copyrighted material, which is illegal; and they do so in unexpected ways, so the user might not even know the response contains plagiarized material.

“Generative AI can’t be creative, it can only regurgitate old ideas” or “It’s just autocomplete”

This is just straightforwardly a bad take. It’s trivial to get LLMs to generate new ideas. Creativity is easy to extract from randomness; it’s more technically impressive that generative AI can be uncreative, and produce trope-ridden schlock.

Try prompting an LLM with “invent and briefly describe a new art movement” or “describe an unusual artwork made by an imaginary artist”. The results are not necessarily thrilling artistic achievements, and I have no fear of LLMs taking over the conceptual art scene unaided, but they’re not descriptions of anything that exists.

Or, more prosaically, “write a sentence that has never been uttered before”.

You can make an argument about the quality, value, and validity of the ideas an LLM generates, but even shuffling a deck of index cards can generate new ideas from old ones.

“Using it is lazy” or “it takes no skill”

This is an old take. Many clever tools are most often used in lazy ways; but it is more morally sound to criticize the laziness, not the tool. A camera is a lazy way of making a picture — except when it’s not.

The LLM doesn’t determine how much or how little effort goes into the work around the act of generation; the user does.

I hate listening to people talk about using LLMs to write entire stories. But I have no objections to someone using an LLM to find 5 more ways of expressing a sentence when their first two tries didn’t work. Mechanically producing infinite variations on a single theme is something artists have been doing for ages, with whatever technology was available to do so. If the other objections can be dealt with, there’s no reason creative people can’t spend just as much sweat and blood with this in their toolbox as without it.

“Generative AI will cost us jobs”

This is a good one! I think the only tweaking it needs is to be more immediate and particular: “Generative AI is costing jobs.” The scaremongering of “art is dead, creative writers will have no jobs, and it’ll come for you next!” is maybe less honest than “copywriters are already losing work; so are stock photographers and some kinds of illustrators.” An argument about the future implications is hard, and it’s easy to be wrong. An argument about what is already occurring is undeniable, with no risk of being proven wrong in a year.

fin

Are there other arguments against LLM use that you think are more effective? Is your team currently building an LLM into your product, and you think it’s the right choice despite these arguments? Feel free to reach out and bend my ear.

AI in 2024: What Good is Bullshit?

2024-08-16T00:00:00-04:00

It’s been a year since I last wrote about generative AI and its practical application, and some of my thoughts have solidified in that time, so I thought I might be due to revisit the topic. I expect this to come in three relatively short parts, of which this is part one.

Generative AI is Bullshit

One of my favorite papers this year so far has been Generative AI is Bullshit, in Ethics and Information Technology by Hicks, Humphries, and Slater. Despite its inflammatory title, it presents a straightforward and largely semantic argument that I think is very useful when thinking about potential uses of generative AI.

I find myself making reference to it over and over again, and I don’t want to force people to read a lengthy academic paper, so here’s a 3-point summary (the paper itself is very readable, though, and if that sort of thing is your idea of fun, click through and read it instead. If you’ve done that, or are already familiar with the paper, feel free to skip ahead to “What is bullshit good for?”)

“Bullshit” as a technical term For the sake of this argument, “bullshit” means anything that is intended to be plausible but is unconcerned with truth. Something purely and purposefully false is a lie; but when it doesn’t matter whether it’s true or false as long as it is believable, that’s bullshit.
LLMs are explicitly trained to bullshit An LLM is given all the text its trainers can get their hands on, true, false, fiction, non-fiction, opinion, fantasy, shitpost, diatribe, peer-reviewed academic study…

Training optimizes the LLM’s ability to predict the next word* of that text, given some chunk of it. If the text is a complete fabrication written by a crackpot, it still predicts the next word. If the text is truth handed down by a god, it predicts the next word. If the text is a dadaist poem, it predicts the next word. It does not have the ability to say “hold on, that’s nonsense!” it can only say “it is unlikely *anyone* would ever say that”.
‘Hallucinations’ and correct answers are the same side of the same coin

It’s so tempting to think about the times an LLM generates blatantly untrue text as somehow different from when it generates true text; that we can work to fix the former and keep the latter.

But the same process generates both, and that process is inherently indifferent to truth. Both are bullshit!

What is bullshit good for?

Just because generative LLMs produce bullshit doesn’t mean they’re useless. You just need to carefully pick tasks where good, high-quality bullshit is all you need.

That sounds very limiting, but one key quality of bullshit is that it’s internally consistent; as soon as it contradicts itself, it loses plausibility. LLMs are not perfect bullshit generators, but a lot of their odd behavior can be understood in terms of this pressure towards consistency; very little training text, even crackpot text, outright changes its tune mid-thought.

Tasks that require agreement with the real world will sometimes fail catastrophically. Tasks that only require plausible consistency with the prompt are safer bets. Some examples:

Walking someone through handling a medical crisis? Hell no!
Explaining scientific concepts to students? No, probably dangerous.
Summarizing a correct but lengthy description of a scientific topic down to a couple sentences? This is on the edge of its ability; it’s much harder to make a summary that’s plausible but incorrect when the document is right there.
Rephrasing a correct short summary in different words, or according to some constraint? Now we’re talking!

This is one reason why retrieval-augmented generation makes for more effective chatbots than training-based specialization: retrieving existing (and truthful) documents is a search task, and then rephrasing their contents to fit the conversation merely requires consistency, and so plausibility is often enough of a constraint to maintain that truth.

Is a generative LLM a good choice for my task? Questions to ask yourself:

Am I OK with producing falsehoods with some frequency? or
Can I provide enough context that even limited internal consistency of the response with the prompt will incidentally ensure correctness? and
Am I in total control of the prompt, or will users or other parties be able to make the prompt inconsistent, leading to poor results?

This is considerably simpler than most of my question-based discernment tools! I suspect you can guess from the last of these questions that I’m still bearish on AI-powered chatbots. I don’t think exposing a raw LLM to users is a sensible approach for reliability, security, expense, or user experience – even with RAG involved.

That’s not to say I am against the vector database portion of RAG. These questions apply only to generative uses of LLMs. It is the non-generative uses which I think are far more exciting in terms of application, and I intend to return to that subject in part III of this series.

The conclusion of this part, however, is that I still believe, like I did a year ago, that too many companies and individuals are relying on LLMs to produce concise, human-readable truth, when what they’re good at is producing plausible filler. LLMs don’t have to be a flash-in-the-pan fad forced into your product and then abandoned; play to their strengths, and you’ll see longer-lasting success.

Build a Product, Not a Chatbot

2023-09-04T00:00:00-04:00

A bare generative AI model by itself is not a viable product. You should stop rushing to build prompt UIs and chat-bots that serve up raw generative results and build a real product.

Let’s unpack why.

Generative AI is the hot new thing to include in your product. Before you do so, I would encourage you to self-reflect. Over the past few years have you also attempted to include blockchain? NFTs? Are you including a generative model because it is well-aligned with your business model? Or just because it’s cool and you think you should?

If your use case has passed that test; think about quality control. What does it mean if your product is serving up unreviewed content? Are you willing to accept that a voice connected to your brand will confidently say untrue things? Or frightening things? These are not solved problems! The former problem – making up untrue facts – cannot be solved by existing LLM technology.

None of the above is an attempt to say that generative AI is useless, or doesn’t belong in your business. Generative AI is shockingly, surprisingly effective, and you should be investigating it – but you should be thinking:

in the long-term
about how to use it indirectly
to deal with actual business problems you have

Adding a chatbot to your web app is not the strongest use of this technology; it’s easy, but it’s dangerous and expensive, and your customers will not appreciate it.

Finally, think extremely carefully about the business model you’re going to use to support your generative features.

For comparison, let’s talk about full-internet search engines as a product.

Customer value for search:

The functional value to the user comes from a small number of results; often just one
The emotional reward to the user comes from seeing a result that seems likely to be useful. It may take a couple of tries to find the right search, but most users will give up after a handful of attempts if the results aren’t promising.
Users will almost always digest the results in some way before using them; search is mostly a tool, rather than a white-labeled service

(I break the customer value into the functional value, which is what the product is providing to them when they are clear-eyed and paying the bill, and the emotional reward, which is what feels good in the heat of the moment while they’re using the product. A good product has both: the product needs to feel good to use and be valuable, but even when both parts are present, they aren’t always well-aligned.)

Cost for search:

There’s a massive setup cost for crawling and indexing
There’s a small but inescapable cost per use
At scale, some of the per-search cost can be mitigated by caching popular searches; no one minds if everyone gets the same results for the same query (though perhaps you tweak them based on history, location, etc.)

Business value for search:

Google and Bing are ad-based; they earn money per search, regardless of whether that search produces a successful result for the user
They earn more if the ads they serve are themselves valuable search results.

Search is a viable product because the costs and values align quite well! The user will only do as much searching as they need to; every search costs money but also earns money; and most searches provide value. There are economies to be had when scaling the number of people searching. The math for viability is easy: can the average search earn more than the average search costs?

Now let’s talk about a pure, unfiltered Generative AI product in those same terms.

Customer value for Generative AI:

The functional value to the customer comes from the final generation they decide to use (the one that provides an answer, or a useful image, or whatever it is).
The emotional reward comes randomly, with many (but not all) generations; this is especially true of image generation, where lots of results are cool and exciting to look at without successfully meeting the demands of the prompt, but it is also true of a chatbot. Each response is a little jolt of emotion, if not always positive.

Cost for Generative AI:

There’s a very large upfront cost to training, which you you may be paying via licensing if you didn’t train the model yourself
There’s a large cost per generation.
Unlike caching popular searches, the cost per generation doesn’t decrease at scale, because for most use cases, users expect unique results.

Business value for Generative AI:

There’s not an agreed-upon model for charging for generative AI, yet.
The method best aligned to costs is to charge per generation, but that’s not well aligned with the functional value to customers.
Most AI generation products today charge a monthly fee and limit the number of generations, with a hard cap, a throttle, or a per-use charge above a certain limit.
Advertising doesn’t pay enough to make up for the cost.

There are several financial challenges to viability, here; the first is simply that generative AI is expensive; the per-use compute cost is astronomical compared to most cloud software products.

The second is more insidious: the emotional incentives are almost perfectly Skinnerian. They’re like a slot machine. Some percentage of all the generations are in some way exiting. The images are titillating (either sexually or just by being cool and exciting to look at); the messages are unexpected. They are that way regardless of whether they suit the users’ actual needs. Because of this random-but-common reward, the user is strongly emotionally encouraged to keep pressing the generate button. If you’ve used any of these tools, I’m sure you’ve experienced this exact urge to see just one more image. Just one more set. Just one more prompt.

Setting the moral implications of that gambling-like feeling aside for a moment, it’s also surprisingly bad news financially. An addictive game is financially solid, because additional play is either free for the business (if it’s local) or earns more than it costs (if a server is required). But generative AI is expensive, and has no economy of scale! You want your users to get what they need with as few generative steps as possible; but in the heat of the moment, that’s not what users are doing.

The solution is to be an ounce more thoughtful! Consider ways the generative AI can power parts of your business without simply exposing it, unalloyed, to your users. In my opinion, the best generative AI products, the ones that will still be going strong in 5 years under their own financial steam, will be ones that users can’t tell are using generative AI at all. They’ll seem like very powerful and clever traditional products – because while the technology is new, the tenants of business and the tenants of product design are the same as they have always been.

Don’t Fire Your Illustrator

2023-08-20T00:00:00-04:00

My academic training is in Fine Art, painting, and printmaking. My professional career for the past 20 years has been in software engineering, including machine learning. This makes me uniquely situated to ~~panic about~~ discuss image-generative AI systems like Midjourney, DALL-E, etc.

This essay comes to you in two parts (both of which are right here on this page).

Part I is a mostly-un-opinionated technical description of how one popular branch of AI image generation currently works. If you’re already familiar enough with stable diffusion to understand the terms “latent space” and “text transformer,” you can skip ahead.

Part II is a very opinionated prediction of how this technology will be successfully used and by whom.

PART I: Stable diffusion

I a) The Latent Space

If you want to talk about colors, there are more and less useful ways to name them for different tasks. Take the color “pinkish purplish autumn mist” and make it a little warmer; what color is that? Mix a little ultramarine, a little alizarin crimson, a tiny dot of cadmium yellow, and a good blob of titanium white. Make that a little warmer; what color is that? Take the RGB color 234, 182, 227, and make it a little warmer; what color is that? Or the HSV color 308° 22° 92°. Or the Lch color 80/30/330.

When we use lists of numbers to name points in a physical space, we don’t do so randomly.

We pick a numbering to ensure that when the numbers are close, the locations they name are close, and when the numbers are very different, the locations they name are far apart.

We can also use a list of numbers to describe a non-physical space, like the space of colors:

And we want the same rules to apply: similar colors should have similar numbers, and similar numbers represent similar colors. There are many ways to give colors numbers — RGB, Lch, CMYK — and each numbering results in slightly different relationships between those colors. The goal is always “put similar things near to one another,” but we might define “similar things” in a variety of subtly different ways.

The numbers don’t contain the appearance of colors nor the literal pigments they’re made from — it’s a labeling system, not a filing system. The numbers for a color are both a label and a set of instructions: mix this much red light, this much green light, and this much blue light, and you’ll get the color that these numbers label.

Words are more complicated than colors. Still, one could imagine assigning every word a bunch of numbers — perhaps hundreds, instead of just three — in such a way that “mom” and “dad” have similar numbers, and “mom” and “prestidigitation” are further apart.

Researchers have used AI to take lots and lots of text and (by assuming that words that appear near each other in text are related in some way) build a space of words that’s like that.

Whole images are even more complicated than single words or colors, but (by using thousands of numbers) we can imagine “spaces” where similar images are represented by similar lists of numbers.

We could do that by listing every RGB color of every pixel in an image — that will make some similar images close to each other! There are downsides, though. That method takes millions of numbers. Worse, some similar pictures won’t be near to one another at all: for instance, the RGB pixels in a picture of a black cat and the RGB pixels in a picture of a white cat will be very, very different, even if they’re otherwise very similar pictures of cats.

So how do we build a useful space for images? It’s one thing to assume words that appear near one another are similar because the same word gets used in millions of different situations, with heaps of nearby words. Most images only get used once, and maybe not near any other images at all!

We can borrow our latent space of words, though: Lots of pictures have captions, labels, or words nearby. Researchers have used AI to build image spaces where pictures with similar captions are near one another. Pictures labeled “cat” are near to each other and pictures labeled “car” are near to each other, and pictures labeled “Miyizaki cat-bus” are somewhere between.

In generative AI research, these spaces of images are called “image latent spaces.”

A latent space can give you a list of numbers that approximates any image you might ever want to see, and every random uninterpretable image, too. It’s Borges’ Library of Babel but for pictures.

How do we navigate that space full of sense and nonsense to find the numbers for images we want to see?

I b) Stable diffusion

(Understand that what I’m about to describe, called stable diffusion, isn’t the only way to accomplish this, but it’s a popular one)

Imagine seeing a picture of a cat on a staticky television: If you squint, you can make out the cat. If you wanted to, you could paint out the static and reveal the cat more clearly.

Imagine seeing a cat behind a lot of static. You could squint, sketch in some lines, squint at those lines, and probably get a picture of a cat, though not exactly the same picture of a cat.

And maybe you can imagine seeing pure static and convincing yourself there’s a cat in there somewhere, and with a lot of squinting, slowly teasing out a staticky picture of a cat, then a less staticky one, and then a clear picture of a cat.

This is stable diffusion: We took millions of images, made them a little noisy, put them in latent space, and trained the computer to clean them up. And then, we took the noisy images, made them noisier, and trained the computer to make them less noisy. And we made those noisier, and then even noisier, until the images were obliterated, and the computer could, step by step, hallucinate its way back to some image. Not necessarily a good one, or a useful one, but a non-noisy image.

We can also take our word space and train the computer to try and associate an image in image-latent-space to some words in word-latent-space and say how likely an image is to have a particular caption. A picture of a cat is likely to have the caption “my cute kitty mister french fry” and unlikely to have the caption “the engine from a 1959 Austin Healey.”

The last piece of the puzzle is called “cross attention,” which is a fancy way of saying, “glue several AI systems together, so they can do two things at once.” In particular, we can ask the computer to remove some static from an image AND nudge that image to be more likely to be captioned with some specific text.

And that’s it — that’s generative AI.

Note some important things:

while every image in the training set is representable by some numbers in latent space, the images themselves are not there in any specific way.
To steer the process, the words you type get turned into points in a word latent space, which then gets retranslated into a movement in image latent space. That’s hard! And making small adjustments to the result using just words can be very hard!
The kinds of images that are very common — statistically likely — get organized in bigger and more well-organized parts of latent space.
There are other systems for steering the denoising process — using text labels is just one of them — but all of them involve cross-attention between the denoising and some other goal.

A test of understanding: why does this produce extra fingers and deformed hands? The wrong way of thinking: every input image the model has trained on has correct hands, so it should learn to draw correct hands! The latent space describes every possible image. The nudge away from noise prioritizes clarity. The nudge towards a text label is satisfied by any image that would best be labeled by that text. Image captions rarely mention people’s hands, especially not with correct and incorrect numbers of fingers. There’s not much pressure toward perfect hands, and adding negative labels like “no deformed hands” relies on a very small number of images out there labeled “deformed hands.” (You might have better luck with a negative label for “polydactyly” since that label correlates very strongly to extra fingers) Most generative systems have separate components specifically to correct faces because the same issue applies; the core system is content if there’s something face-like, while our eyes are very picky about faces being correct, with the eyes pointed in the same direction most of the time.

A test of understanding: why does adding an artist’s name, like James Gurney, produce better images overall? Most image captions are nouns: a person, an object — the picture’s focal point. The background can be nonsense; if the foreground is a teapot, it’s fair to caption it as “a teapot.” Non-teapot parts of the image aren’t under much pressure. An artist’s style is a gestalt; it doesn’t exist in just one part of the image, but the whole thing; every pixel is under correlated pressure, so a coherent outcome is a little more likely.

PART II: Who should use AI image generation, and how

There are some opinions here that will be unpopular with my artist friends. Some will be unpopular with technologists or neophilic managers. I certainly don’t want you to think I am pleased about any of these opinions, or that I want them to come to pass; these are simply my predictions based on a goodish understanding of both generative AI and traditional image-making.

An opinion popular with creatives: Physical media artists will keep their jobs.

The human desire for real paintings, sculptures, woodcuts, embroideries, and so on isn’t going to vanish.

I don’t have much to say about that; I mostly mention it so we can set that segment of artists aside and concentrate on the larger swath of commercial image-makers.

An opinion popular with tech and unpopular with creatives: The output of physical media will continue to be used in training generative AIs.

I absolutely understand the desire to prevent this. Knowing your work has been used without permission to train a computer to replace people’s livelihoods is extremely violating. But understanding the technical basis, I don’t see any plausible way to outlaw it while still allowing fair use in all the ways human artists have been for thousands of years. Images similar to those used to build the latent space may be recoverable with the right prompt and some luck, but they’re not inherently there, any more than my memory of an Andy Warhol is inherently a copyright violation. I can sell Andy Warhol pastiches I make based on that memory. I can augment my memory by having a morgue file of images to train my memory on.

If you have a vision for how this can be structured legally, restricting ML uses of imagery without restricting human uses, I’d love to hear about it!

An opinion popular with tech-loving managers and unpopular with creatives: Generative AI will replace a slice of illustration and writing: in particular, the kind where the content doesn’t actually matter:

This blog post needs a header image that’s vaguely related, not because it needs illustration but to fit the page layout.

This spammy site needs a new blog post once a week for SEO reasons.

No one from one end of the process to the final consumer particularly cares about the image or the text as long as it doesn’t stand out; it is furniture.

This kind of work never paid particularly well and is now rapidly vanishing. My heart goes out to the people suffering because this work is vanishing, but I also can’t see any way out, even with much stronger legal regulation of generative AI than I expect we’ll ever see.

Why only that slice?

If you’ve ever used a generative system, I can pretty much guarantee that you spent an embarrassing amount of time making tiny adjustments to your prompt and retrying. Producing a compelling image with generative AI is pretty easy; maybe one in ten images it generates will make you say, “Wow, cool!” But producing a specific image with generative AI is sometimes almost impossible.

If you visit (often NSFW, beware!) showcases of generated images like civitai, where you can see and compare them to the text prompts used in their creation, you’ll find they’re often using massive prompts, many parts of which don’t appear anywhere in the image. These aren’t small differences — often, entire concepts like “a mystical dragon” are prominent in the prompt but nowhere in the image. These users are playing a gacha game, a picture-making slot machine. They’re writing a prompt with lots of interesting ideas and then pulling the arm of the slot machine until they win… something. A compelling image, but not really the image they were asking for.

Why is it so hard to get what you want?

Let’s return to the technical discussion for a second.

Text is a difficult way to steer an image because while the text latent space is related to the image latent space, there are still multiple translation steps: from the actual prompt to the text latent space to a function in the image latent space. The process can only accommodate so much text at a time (usually ~75 words; if there are more than that, you must break the prompt into separate guiding systems in cross-attention). OK. Are there better ways to direct image generation to have specific results?

Yes! It’s much easier to translate an image into a latent space constraint. Images translate very well into image latent space; that’s what image latent space is for. Here are some ways folks have invented to prompt image generation using images rather than words

Create an image that would reduce to the same line art as another image
Create an image that would reduce to the same depth map as was pulled from another image
Create an image with matching poses pulled for another image
Create an image whose style matches another image even though the content differs
match perspective lines of another image
match the colors palette of another image

These are all very powerful constraints that can exert precise control over the content and composition of a generated image.

The only challenge in using them is: where do all these guiding images come from? Who can take the time to understand the concepts we want illustrations of and turn them into a line drawing, a sketch of poses, or a style? Is there some existing job title for that?

An opinion popular with creatives and unpopular with techy managers: Generative AI isn’t much use for sophisticated needs if there isn’t an illustrator involved.

I believe that will continue to hold true even for future versions of Midjourney, DALL-E, and so on; I think the amount of text they can handle will increase, and the quality and resolution of images they produce will increase, but the fundamental challenge of getting specific imagery is not going to vanish without more fundamental changes in the approach.

Finally, an opinion popular with no one: Commercial illustrators will keep their jobs, but will mostly need to learn to use AI as a part of their workflow to maintain a higher pace of work.

This doesn’t mean illustrators will stop drawing and become prompt engineers. That will waste an immense amount of training and gain very little. Instead, I foresee illustrators concentrating even more on capturing the core features of an image, letting generative AI fill in details, and then correcting those details as necessary.

Here’s a process for digital painting that I’ve tested and found… plausible:

Produce a line drawing traditionally, focusing on the composition and key ideas
Have the generative AI suggest a dozen potential approaches to color and lighting; pick one or two
Paint almost entirely over those AI generated pixels, adjusting and correcting the color to suit my vision

Obviously, this is not a workable approach for artists that put great care and emotion into their color choices. I don’t think there will be any one approach that works for all artists. But for artists working on deadlines, I foresee them using AI to fill in whatever step is the least important and most tedious: crowd scenes, cityscapes, vegetation. Just like a blog post header image is furniture for the page, there is furniture for many images — not important, but still necessary. For better or worse, that furniture is becoming the territory of generative AI.

The more concerning problem is that while generative AI research is heading in this direction, offering more and more ways to direct image generation using image inputs, the products that are entering the market are not easy to slot into an illustrator’s workflow at all. All my experiments have been done running open-data models on my own computer in order to have useful levels of control.

I have more to say on the subject of machine creativity and also the gacha-like nature of generative AI, but I think it best to leave this post here, with that vision of commercial illustration yet to come, and the hope that generative AI products will start catering to it.

Garden Path Content

2023-06-04T00:00:00-04:00

I’ve fought with this essay for ages, partly because the vocabulary in it makes me sad. I want to talk generically about telling stories starting in prehistory, and continuing through printing political pamphlets in the Enlightenment, publishing novels in the 19th century, writing scientific papers in the 20th century, and making TikTok videos last week. The modern vernacular for all that might be “creating content” and a literary theorist might say “constructing a text” and repeating either of those phrases out loud makes me want to hide under a rock and never emerge.

So, just temporarily, allow me to use the verb “writing” and the noun “story” to describe the whole human history of creation, in all media, fiction and non-fiction. Allow this for the sake of my spirit. For reasons that will eventually become apparent, don’t let my word choice fool you into assuming that means solely written words, or even a human author doing the writing.

The oldest and simplest reason to write a story is to communicate an idea. You want someone else to know or feel something, and when they understand or feel that thing, you have your reward.

Writing for money is most often an iterated game, in the sense that when an author sells a story, their audience knows who they are and who their publisher was (if they had one). To convince the audience to buy more, the author must provide them with at least some satisfaction.

(Just like “writing” is standing in for a spectrum of activities, by “satisfaction” I mean any one of a whole rainbow of emotional and intellectual reactions; it doesn’t have to feel “good” so long as it feels purposeful, needful, valuable. Readers like to be tortured, just a little, if they understand in the end what the pain was for, and believe the author intended their experience.)

To be more beguiling in attracting readers to their next work, they might pull a Scheherazade and hint at the beginning of the next story as they end this one; or they might pull a Dickens and break one story into many parts; but each part still needs to produce enough satisfaction on its own that people are willing to continue paying to get the next. A soap opera with too many twists and too few resolutions eventually loses its audience.

There were advertisements in stories long before there were stories paid for by advertisements; my casual research suggests that a consistent form of that innovation took until 1836, with La Presse – a French newspaper that was sold more broadly and more cheaply, with some of the cost to readers offset by the advertisements within.

Despite the addition of a third party – the advertisers joining the authors and the readers – stories paid for by advertisements have much the same iterated-game constraints. If a publisher wants people to see ads, they have to want to read the story, and if the author or publisher hasn’t delivered satisfaction in the past, readers will pass them by. The only change is that whatever kind of satisfaction the author delivers mustn’t lessen the needs your advertisers are trying to fill. A story about being content with the simple life might sell garden planters but not luxury vehicles.

Until very recently, that covered the vast majority of the story-making world: free stories that only want to communicate; stories paid for directly by readers; and stories subsidized by advertisers. (Through this lens, propaganda is just stories with advertisements for the government.)

And so, I think, readers came to believe unconsciously that they have a tacit contract with authors: the stories we pay for with our attention will be written with intention, and will eventually resolve in some satisfactory way. If they don’t, readers have recourse: they can pan the story, and not read things from that author or that publisher. The gap between this tacit contract and the explicit one that goes “readers give writers money for whatever the writers have written, the end” is perhaps why there is tension and frustration from both sides for writers suffering from writer’s block, like George R.R. Martin or Patrick Rothfuss; the readers sense that a story that stops without a satisfactory ending violates the tacit contract, while the writers point to the explicit contract that says they don’t owe more than what the readers have already paid for. (Of course, the publishers may have not-so-tacit contracts to complain about.)

Even early algorithm-driven content platforms obey those standards. Satisfying stories get upvoted, unsatisfying non-stories get downvoted; writers who are effective get followed and writers who are not, do not. Admittedly, the attention economy led to the expansion of “satisfaction” to include “blind rage” – more things got watched and shared and commented on purely because they made people angry – but the tacit contract was tattered but still intact. Authors write stories with intention and readers reward those stories that affect them.

Over the last few years, the TikTok-style of infinite-scrolling algorithm-fed content treadmill has become widely adopted; either in its purest form of content-firehose or blended, like Twitter and Facebook which show you stories from people you follow mixed in with stories from people you don’t. And the small but significant change is that content can now get boosted not just by explicit voting, subscribing, or sharing, or through the reputation of the writer, but simply by being consumed. TikTok (or Instagram or Shorts or…) is interested in anything that I don’t skip past; watching to the end means my eyes are on the platform and consuming ads. The algorithm can serve me stories written by anonymous sources I’ve never heard of, and it does not need my opinion or my satisfaction when it can measure my attention.

This violates the tacit contract. The game is no longer iterated! My satisfaction with the author’s previous work does not matter, so long as I consumed it. The author’s incentives in this game are to capture and hold attention, regardless of the readers’ satisfaction. And so (after almost 1000 words) we arrive at what I set out to talk about: garden path content. These are story-shaped objects that have no intention to communicate at all. Instead, they are merely story-shaped; they seem at every moment to be leading to an exciting revelation or a grand conclusion but just keep leading you down the garden path until they stop at a dead end. It’s clickbait that doesn’t need you to click, only stare.

Now, this sort of thing has been done on purpose in the past, artistically; see Steve Martin:

“What if there were no punch lines? What if there were no indicators? What if I created tension and never released it? What if I headed for a climax, but all I delivered was an anticlimax? What would the audience do with all that tension? Theoretically, it would have to come out sometime. But if I kept denying them the formality of a punch line, the audience would eventually pick their own place to laugh, essentially out of desperation. This type of laugh seemed stronger to me, as they would be laughing at something they chose, rather than being told exactly when to laugh.”

But Martin doesn’t arrange an entire show that way; for the trick to work the way he describes it, the joke-shaped object has to be embedded in a larger fabric of real jokes, to set the audience’s expectation for tension and release. Garden path content relies on other authors to provide that fabric, and by itself gives no satisfaction at all.

@tkfifd #viralvideo #tkfifd ♬ original sound - TK FIFD

(Note that in addition to the central video, which has a storylike structure but no story, there’s the Mr. Incredible reaction image, which also implies something is going to change, something is going to happen.)

You may find yourself consuming garden path content repeatedly because it leaves you feeling you must have missed something. You’ve been so thoroughly conditioned to expect the intent to communicate that you assume it must be there if you just read more closely.

But since it has no intention, garden path content doesn’t even need to be original. It can be collaged and Frankenstein-ed together from attention-holding moments of other stories:

@respectcali #respect #amazing #foryou #omg ♬ Love You So - The King Khan & BBQ Show

A pastiche of setups with no punchlines.

The final piece of the unhappy puzzle is once I suspect you can guess for yourself: what if these authors had a tool that could generate endless story-shaped objects without intention – without any human intervention at all? Statistical machine learning models are here to provide exactly that service; limitless, practically-free text that is easily human enough to trigger our expectations of intention and satisfaction, and so hold our attention. Whether it fulfills those expectations is left entirely to chance; in this context it does not matter.

In written form, ML-generated garden path content frequently comes in the shape of a FAQ; paragraphs of text that wander from their subject might be slightly easier to catch than a series of related-but-not-quite-sensible questions:

Here’s a relatively obvious example that only takes a few paragraphs to catch on to. But this one archived from the same site is quite a bit trickier.

I’ve started to see these sites in the first SERP more and more often; often two or three different ones in one search. It used to be you’d occasionally see a site that was just a gigantic list of search terms, like a dictionary pasted onto a single page, used to improve the PageRank of other, spammy sites. This new form is different because it intends to fool not only the search engine but the human searcher as well. They are formatted, structured, and mostly grammatically correct.

And while fully ML-generated video that can fool a human may still be in the future, algorithmically collaged video, with computer-generated voiceover, is absolutely among us, and has been for years.

(You may be muttering intently at this point about Grice’s Maxims, or the cooperative principle – and it’s true, this kind of violation of storytelling is also a more fundamental violation of the expectations of communication. But the relevant context for garden path content is the structure of pressures that create it and the system that creates those pressures. The fact that it is maxim-violating is merely what makes it feel uncomfortable, not an explanation of how it comes about.)

And I don’t see any trend in the structure and design of social media that will decrease the rate or effectiveness of garden path content – as long as “any form of attention” is the metric that platforms seek to maximize, attention thieves will thrive.

And so I finally reach my prediction: media literacy in the coming decade will not only require us to identify new varieties of malicious, rage-inducing fakery that has characterized the last ten years online but will also require recognizing as quickly as possible the non-emotional non-communication that is garden path content. Some questions to ask:

If it shows a process, do I know the intended result? Especially relevant for videos including cooking, DIY, manufacturing, cleaning, etc
Why would someone film this/write this? Why did this person film this/write this?
Are there obviously collaged elements suggesting a big reveal is coming? Reactions, duets, react memes, “stitch incoming” text. Who applied those elements?
Why is someone reacting to this? Why is this person reacting to this?
What would a satisfactory conclusion to this even look like? Could there be one?

Most garden path content is short enough that asking these questions consciously won’t be fast enough: you’ll have consumed it before you’ve answered them. But bby asking, I hope, we can train ourselves to recognise these story-shaped non-stories and dismiss them instincively.

Have you run into a particularly interesting specimen of garden path content? Reach out! I’d love to hear about it.

On the UI of Selecting Options

2022-10-30T00:00:00-04:00

A switch, a radio group, and a select serve the same function: let the user pick one option from a set.

Which is best depends almost entirely on how many options there are.

If there are two or three options, a switch is very clear and efficient. It lets the user see all the options available as well as which one is selected.
If there are four or five, a switch starts to run out of horizontal space — but you can spare the vertical space, a radio-group also lets the user see every option at once.
If there are more than that, a dropdown select probably makes the most sense — a user won’t be able to comfortable take in so many options all at once anyway, so tuck them out of view.

A checkbox group and a multiselect serve the same function: let the user pick several options from a set, and depend on the number of options in the same way as a single-select.

If you’re building a design system, my recommendation is to build each of these UI elements with a shared interface — RadioGroup should take the same attributes as Select — and then wrap them all in a single component: PickOne.

Give developers the option to force PickOne into a specific appearance, but for most forms you can simply let it choose, based on the number of options available.

A PickOne with three options:

A PickOne with five options:

A PickOne with fifty options:

The PickOne in my personal design system actually has 5 different appearances, not all of which can be sensibly selected programmatically — but I encourage you to think generally about organizing your components by what they allow users to do, rather than what they look like; and pick their final appearance late in the game, rather than early.

On Inputting Numbers

2022-10-29T00:00:00-04:00

It’s easy to ignore little things that ought to be easy. Today’s case-in-point: numeric inputs on the web. Frustrating numeric inputs is something I run into again and again, and I do find it hard to remember all the variables involved myself. Here are some reasons you might want users to type in some numbers:

a price
a weight loss goal
a zip code
a scale factor
a phone number
a PIN code
a house number
a volume amount

And here are some ways to write HTML inputs that accept numbers:

<input /> <input type="tel" /> <input type="number" /> <input type="text" pattern="\d*" /> <input type="text" pattern="[0-9]" /> <input type="number" pattern="[0-9]" /> <input type="text" inputmode="numeric" /> 

You can also consider a slider if precision doesn’t matter, or a select if there are actually only a few options.

Some key things to think about when making a numeric input:

Natural numbers only? The options using pattern usually preclude decimals, negative numbers, and scientific notation
Is the user interested adjusting “one more” or “one less”? type="number" allows the user to use the mouse scroll wheel or the arrow keys to adjust the value of the input. In the case of zip codes, this can cause dangerous typos.
Do you need a numeric value, or are you willing to parse the value before use? number type inputs have values which are numbers or null; all the others have string values.
Do you want to validate the input to provide a meaningful error message, or simply prevent users from typing invalid characters in the first place?
How do you want the input to behave on mobile? type and inputmode are the only ways to guarantee users get a numeric keyboard on mobile devices.
Does the user care about leading zeros?

My recommendation: if you’re building a design system, I think the most universal input for numbers is

<input type="text" inputmode="numeric" /> 

Write your component such that it parses the number from a string before passing it up to its parent.

Unreal Work

2022-03-20T00:00:00-04:00

For every person, in every occupation, there is work that feels central and work that feels peripheral.

When a painter is painting – looking, picking colors, applying them to a canvas – that feels like painting. When they are stretching the canvas, or building the stretchers, or prepping the ground, or cleaning brushes, or talking to galleries, or booking a model, or researching pricing, those things are necessary, but they can feel like they are somehow in the way of the real work of painting.

When a software developer is writing code that adds new features, that feels like software development. When they are writing documentation, doing code review, fixing small bugs, reading the codebase, talking to designers or users or stakeholders, that can feel like obstacles in the way of the real work making software.

Fight that feeling. If you do a sloppy job of those apparently ancillary tasks, you will get a sloppy result.

If you consider people who work exclusively on the “central” work to be more valuable than those who spend time and energy on the “accessory” work, your team will suffer.

If you convince yourself that half your job is an inconvenient time-suck, you will be unhappy doing it.

One way I’ve seen people deal with this challenge is by saying, “They pay me the same for the exciting work and the boring work.” For several reasons, I don’t think that’s a particularly effective way to think about the problem for several reasons. First, the feeling of “unreal work” crops up just as much in hobbies and side projects, where no one is paying you to do anything, as it does in wage labor; and second, because the best outcome is resigned acceptance – which doesn’t feel good. Is there any grinding stone that grinds you down faster than that? Obviously, paying the bills is important! If that is the best motivation available to you, take it, take the money, and live. But if you find meaning in what you think of as the “important” part of your job and not in “the rest”, take some time to consider how the latter supports the former. To whatever extent you can see the work as valuable, your day, and your work, will improve.

The work is the work. Take joy, pride, and care in the preparatory steps, and in the cleanup; in the planning and the assessment, and both you and what you produce will thrive.