Clif Davis – scooter smith

Clif Davis on Text to Artwork

Shared with Public

WARNING – This post gets into the weeds and is a bit longish, as in a wall of words long. It is in response to questions from Teresa Patterson and something Martha Wells said at Chicon 8. So if you want to skip it or skip down to where I answer Theresa’s questions, no problem. On the other hand, if you are interested in text to image applications, read on.

What Martha Wells said, and I’ve heard other people say similar things, is that these text to artwork programs have been written by programmers to go out on the Internet and mash pre-existing artwork together to create new images. Part of one of Teresa’s questions kind of seem to assume something similar. But that is not at all how they work, although I can certainly understand why Martha (and others) think that they do. But in fact, the way they work is nothing like that at all.

An image, any image, inside the computer is represented by a bunch of pixels that make up the picture and for each pixel, the pixel can be described by four numbers, one describing how bright (or dark) the pixel is and three numbers that describe how much of each of three colors are in the pixel. There are different ways of combining three particular colors together to get all the colors there are (subtractive and additive are the main two) but three numbers are all it takes to describe the color. You can also consider the colors as laying on a circle and have a number representing saturation. And the different systems can be converted to one another, but the bottom line is that any pixel can be described with four numbers. The pixels are usually arranged in a rectangular array of rows and columns to make up a complete image.

Now if we just take random values for all the pixels the result is going to look like noise, it’s an image but we don’t recognize it as a picture of anything. The things we would recognize as pictures are a small subset of all possible images. Pictures can range from photographs or photographic quality to cartoons or the world as seen by Pablo Picasso. What makes it a picture is that we can recognize it as a representation of something. A picture is an image, but an image may or may not be a picture.

We now have enough to describe at a very high level how the text to image process works. We need three different mechanisms. One takes a set of parameters which can have values in a range of numbers and translates that set of parameters into a picture (not just an image) in such a way that changes in any of the parameters results in each of the pixels values changing in a smooth way. This forms what we call a manifold in the space of images and this manifold of images forms what is probably a small subset of everything we recognize as pictures. The does not mean that we would call the manifold small.

The second thing we need is a function which can take both a text description of a picture and a picture as input and give a score that reflects how well that image is described by the text. We want the text that describes a picture to score higher with the picture it describes then with a randomly chosen picture. Similarly we want the text that describes a picture to score higher with a picture it describes than the same picture with a randomly chosen description of some other picture.

The third thing we need is an optimization technique to search through the space of all parameter settings to find a resulting picture with a score that is, if not a global optimum, at least a local maximum. By a local maximum we mean a picture whose score cannot be improved by tiny changes in the values of the parameters that generate it. We take that picture as the result of the descriptive text prompt. We can likely find multiple local maximums by starting our search with different random values of the parameters.

We can use neural networks for parts one and two. These are actually artificial neural networks, to distinguish them from biological neural networks, but everyone just says neural network if the meaning is clear. A neural network is a mathematical function that mimics, in an idealized fashion, the way we used to believe that networks of neurons work in the brain. The actual behavior of an artificial neural network is controlled by a bunch of parameters or numbers, typically representing what are called weights and thresholds of the individual neurons. Once the parameters are selected, the neural net will input a vector of numbers and eventually output a vector of numbers, not necessarily of the same size. The computation for a neural network involves performing lots of matrix multiplications combined with the same operations being performed on each element of a vector. As it happens, the special hardware for doing computer graphics for games on your computer, the GPU, operates by doing lots of very fast matrix multiplications combined with doing the same operations simultaneously to each element of a vector.

Running and training neural networks on GPUs has been successful enough that now they build special hardware for running and training very large neural networks.

I’m kind of hiding a bunch of detail here by talking about all artificial neural networks this way, because they can be organized very differently. Deep neural networks have a lot of layers. Convolutional neural networks use the same subnetwork repeatedly in different places in the network. Attention based networks, or transformers, pay more or less attention to some input based on context. But all of them are controlled by parameters and involve repeated matrix multiplication and the same operations being performed on the different elements of a vector.

Machine learning is done by adjusting the values of the parameters to optimize the expected performance of the neural network. Most neural networks are trained using some variation of an optimization technique called gradient descent. The idea behind gradient descent is very simple. Suppose you were blindfolded and placed somewhere on a hill and told to climb to the top. You can’t see where the top of the hill is, but you can tell what the slope is where you are standing. You take a step uphill. You are now probably higher than you were before. If so, you again want to take a step upwards in the direction the slope is steepest, and rinse and repeat. If things are going well, you might want to take larger steps. If you take a step up in the direction of the steepest slope and you don’t go up, you need to take smaller steps because you went too far. Eventually you should come to the top of the hill where there is no uphill and the slope is zero and you can quit. In practice, we would quit when the slope is close enough to zero as to not matter.

What I have described is technically gradient assent as I am trying to go up to a local maximum. Gradient descent is going down to a local minimum. But the algorithm is named gradient descent and the problem of finding a local maximum can be changed to finding a local minimum just by multiplying the heights by -1. So we tend to refer to it as gradient descent either way. The local part of “local optimum” is important. When I find the top of the hill, this may completely ignore the mountain right next to it. If I want to find the top of the mountain, I need a different starting point to start my search. On the surface of the Earth, we are dealing with only two dimensions, but with neural networks we may have thousand or even millions of parameters. If you are Google this may go up to a billion or so parameters. But regardless of whether we are working in two dimensions or a billion dimensions, the way gradient descent works is the same.

There are a lot of variations of gradient descent that are used to train neural networks. Back Propagation concentrates on the slope of a layer of the neural network at a time, taking different size steps for each layer. Stochastic gradient descent is helpful when we have a huge amount of training data. Before each step we calculate the slope using a statistically significant random subset of our data rather than the whole immense thing. Some variations try other things to speed up the search. But no matter what variation of gradient descent we use, at each round we have to be able to find the slope, what mathematicians would call calculating the gradient or taking the complete derivative, the vector sum of taking the partial derivatives in each direction.

Being able to find the slope requires two things. First of all, things have to be continuous, no cliffs where values suddenly change. Secondly the slope must be completely described by a vector, a magnitude and direction. In the real world this is sometimes not the case. If we are on a ridge running north and south, standing on top of the ridge-line facing north there may be a slight slope down while to the south there is a slight slope up. But on our right, to the east, there is a sharp slope down. But to the west, instead of a slope upwards, it also slopes sharply down. The complete slope cannot be described with a simple vector. The mathematician would say that the partial derivative in the east/west direction is ill-defined. Because of this we will usually try to design our method of evaluating the neural network so these two things will be true and we design our neural network so that these properties will be preserved through the different layers of the neural network and the networks output as well.

Neural networks applied to pictures can do some interesting things. For example, suppose I have a lot of pictures of my wife, Margaret. The pictures are taken from various directions in various lighting conditions under various colored lights. In each of the pictures I use a variety of ways of messing up her face to get an image and train a deep neural network (one with a lot of layers) to recreate as close as possible the picture with her original face. As the neural network is trained, the values of the parameters come to encode an idea of what Margaret’s face looks like. But in order to be successful the network also has to learn how to make the lighting on her face consistent with the lighting illuminating the rest of the photograph and deal with shadows. It has to learn how to match the skin tone to her other skin. Now I take the fully trained network and give it a picture of a belly dancer dancing. The network will happily take the picture, treat it as damaged and change it to showing Margaret’s face on the belly dancer. This type of deep fake can be hard to detect.

The idea of a diffusion model is also based on correcting photos. Suppose I have a whole bunch of pictures of people’s faces. I can add a little bit of random noise to the picture, what the mathematician would call Gaussian noise and I can notice it, but it hardly affects my ability to see and appreciate what is in the picture at all. I can recognize the face and mostly see details. We can of course train the neural network on lots of these pictures plus a little noise to reproduce the picture without any noise. It learns what pictures of faces without noise looks like and reproduces that. I can take a picture of a face it’s never seen before, add noise, and it will do a pretty good job at removing the noise and giving me a picture of the face without the noise because it has learned what faces without noise look like and what noise looks like. But we don’t have to stop there. I could take a random picture of a face give it noise and then add a second layer of noise and train a second neural network to reproduce as closely as possible the picture with only one layer of noise. In some respects this is a harder problem because there is a little less information to work with and the second network may not do quite as well at producing the particular photo just one layer of noise. But if it does a reasonable job of producing some related picture with noise, when we run it back though the original network we should get a picture of a face without noise. Again, I don’t have to stop there. I can train a third neural network that will take a picture of a face with three layers of noise added and try to produce a picture of a face with two layers of noise. And so forth with the original picture becoming more and more diffuse, up to, say, 500 layers of noise. I don’t even really need to have 500 different neural networks, I can have one if I give that neural network information on how many layers of noise we are dealing with at any given time. Now the reason for saying a number like 500 is that after adding 500 layers of noise you have an image that’s pretty much indistinguishable from pure random noise. After completely training the network, we can give it an image where each of the four numbers describing each pixel is chosen at random and run it through the network 500 times to get a picture of a face. Except that this picture of a face is almost certainly different than any face in the training data. Rather the diffusion network has learned what pictures of faces look like generally and has translated the random image by removing noise onto the learned manifold of faces. Right now you can go to thispersondoesnotexist(dot)com and after a very short delay it will generate you a picture of a face that no-one has seen before. You can keep hitting reload and it will show you convincing photo after convincing photo of people that have never existed in the history of the world. These are not mashups of faces, they are random points on a manifold of the networks understanding based on a learned generalization of what a picture of a face is. That generalization was based on being trained on a lot of photos of faces but this isn’t a mashup; it’s not essentially different than if we had an artist draw a new imagined face that would be based on their learned generalization of what a face look like. Except that the network was trained on photos and so its faces tend to be photo-realistic.

What we have done for photos of faces, we could do for photos of dogs, and people have, with the network learning a manifold of images that are a sizable subset of photos of dogs. Biases in our training set tend to be reflected in the learned manifold. If the faces in our training set are mostly white, our learned manifold will not contain much in the way of faces of people of color. The training set of dogs is unlikely to result in a picture of a wolf. But we could also go for generality. We could take a really huge and varied collection of different pictures found on the Internet and use that for training. Our diffusion network would create new images that we couldn’t tell from a random picture somewhere on the Internet. We might even have enough generality to where we not be completely surprised if we got an image here and there where we just couldn’t say what something even was although it would look to us like something that could exist. But because some part of the images found on the Internet are pornographic or disturbing, we might not want to just put the random outputs of the network up on the web. But this kind of broadly trained diffusion network meets the requirements for the kind of thing we needed for the first part of our text to image software. We have a set of parameters in the input image which can have a range of values and the network translates that set of parameters into a coherent picture (not just an image). The network is designed in such a way that in such a way that both continuity and existence of of a well defined slope is preserved and changes in any of the input parameters results in each of the pixel values of the output image changing only in a smooth way as the output images move through the manifold of possible generated pictures.

We have concentrated on diffusion models here and that is because Stable Diffusion a diffusion model, although it is a special type of diffusion model that uses a smaller latent space, basically a smaller vector described with fewer numbers, with additional neural networks that learn to translate pictures to and from the latent space. Hopefully it is clear that there can be other types of neural network models that could fill this role as well.

We are not restricted to applying neural networks to images. They can also be applied to text. There are things called language models where the text so far can be used to predict the probability that a given word will be used next. These models translate words into learned vectors and the sequence of vectors is fed into a large attention-based neural network. The neural network can learn patterns in the relationships of words that a re very powerful and the resulting language models have been used in some surprising and useful ways including language translation and conversational bots. When we want to compare a picture to a textual description of the picture to see how well they match, we can steal some things from language models. This includes useful word to vector encodings and the general neural net architecture. Again we need a large batch of matched images and descriptions to use for training. One source for this is photo sharing sites where people post photos and commentary on the photos. Another is images incorporated into the web that use the alt tag. The alt tag allows attaching descriptive text to images inside of HTML. The selection of readily available images tagged with descriptions is small compared to the total number of images available to train a diffusion model, but is still very large. We could train a model to score how well a description matches an picture, but we could also take a publicly available model, such as the scoring network CLIP, which was trained and released by the Open AI corporation. Just using CLIP is, in fact, what many of the text to image programs do. ((My understanding is that Stable Diffusion uses CLIP but augments it with scoring that selects for images more pleasing to the eye.)) Either way, this network serves as our evaluation mechanism.

We now need one more component. We need the optimization technique to guide us through the space of all parameter settings of the input image to find a resulting output picture with a score that is given a local maximum by the evaluation network. Because both neural networks are set up to preserve the two properties needed for gradient descent, we can use gradient descent as our third component. The computer can calculate the gradient of the score back through the evaluation network back to the pixel values of the output pictures back through 500 (if that’s the number being used) applications of the neural network, back to the input image where the gradients guide our search to a local peak of the score.

Now you may be wondering why we need the first network. Why don’t we just start with a random image and use gradient descent on the image itself to find a local optimum? And indeed we can do that, but quite likely to the point of near certainty that the local optimum we find will still be indistinguishable from noise. Even worse, because the scoring network isn’t trained on noise, the local optimum of noise may score even higher than the best local optimum of a manifold of pictures. And the optimal picture may not even be a locally optimal image. The use of the first neural network insures the scoring is only being applied to a manifold of pictures.

With this under our belts, let’s go back to the claim Martha Wells made that the system is producing mashups of existing pictures and the question Theresa Patterson asked, “Are they borrowed from other art?” Not in a meaningful sense. The diffusion network has learned a generalization, a manifold that it “believes” corresponds to pictures that might be found in its training set, effectively what might be found on the Internet. It has formed this generalization from being trained with those pictures, but those pictures it was trained on will make up a very very teeny tiny part of the learned manifold. The actual picture that is produced comes from taking an image that would look like random noise to us and repeatedly removing noise until we are left with a picture on the manifold or very close to it. It’s the de-noisification process that produces the image. We humans too have a generalization of what a picture looks like. Our idea of what a picture looks like was also formed from seeing lots of pictures. But that doesn’t mean when we sit down to draw a new picture that it is merely a mashup of pictures we have seen in the past or that it isn’t original.

The other part of this is that the particular picture we wind up with, though it may have been created by this process of repeatedly trying to remove noise, was selected among those pictures that could be so created to have a high score for matching a given description. If we input the prompt “A cat,” we are going to wind up with an image that roughly matches the generalization the scoring network has of a cat, and this is generalization that has been formed from the pictures in the training set that had the word “cat” in their description. We are likely to wind up with a what appears to be a photo of a cat, though quite likely different than any cat in the training set. If we give it the prompt “A cat, an oil painting by Pablo Picasso” we are likely to wind up with something quite different.

The scoring network likely has had numerous pictures in its training set with the words “oil painting” in its description. From this it has a generalization of what pictures that are oil paintings should look like. It also has what is probably a smaller number of pictures with “Pablo Picasso” or “Picasso” in their description. The network’s concept of something “by Pablo Picasso” is likely to be quite different than ours, but because of the training process there is also likely to be substantial agreement on what looks like it might be by Picasso. The odds of a picture with an actual description, “A cat, an oil painting by Pablo Picasso” are very small. Since it is only a local optimum that is being selected using the scores the network provides, depending on the area in which the initial random image occurs, we might say “That’s exactly what I asked for,” but with another random starting point we might say “that looks something like a cat and it’s an oil painting, but Picasso would have never painted something that looks like that.” The score the neural network provides is a function of an amalgamation of the different generalizations it has formed and the way that it’s “understanding” of language has on the way the generalizations interact.

If there are only five pieces of art by Picasso in the training set, then it seems somewhat fair to describe it’s generalization of “by Pablo Picasso” to be a mashup of those five pictures. It is also likely it’s not a particularly useful or accurate generalization. But this generalization is not being used to generate the picture; it is being used to select a picture generated by the diffusion network in some particular random area of the manifold. And the result is still not, in any meaningful way, a mashup of those five pictures, though undoubtedly it’s selection has been influence by them and their commonalities.

Let’s look at Teresa Patterson’s entire question. “But with this tool, who actually owns the rights to the images? Are they borrowed from other art or do they “belong” to the AI program? I know you are adding your personal touches, and I love your art, but does this count as your original art? If so why or why not?”

It’s easier to start with the questions of rights, because that’s easier to answer. Under US law, the copyright on a piece of art cannot belong to an AI program. We know this because there have been attempts to copyright pieces of art in an AI’s name. The courts have referred to precedence, an earlier case where a monkey was taught to “paint” and an attempt was made to assign the copyright to the monkey. Under US law, rights can only be owned by a natural person or by a corporation and so the courts have ruled. And indeed, while CLIP model may have some kind of a concept of “an oil painting by Pablo Pacasso” it does not have a concept of itself. The idea of such a program exercising rights is very strange. That does not mean that we might not create AI software in the future that does have a rich concept of self, but that day is not today. But just because the artwork cannot be copyrighted by the software, it does not follow that it cannot be copyrighted.

Stable Diffusion is a piece of software just as Photoshop is a piece of software. Both kinds of software can be used to create pieces of digital art. The user interface to use Photoshop may be mouse based while the user interface to use Stable Diffusion is text based. But it’s still the user’s input that guides what the software does. Artwork created through the use of tools such as Photoshop are copyrighted daily. No difficulty attaches to using Stable Diffusion to create original artwork. Artwork produced using text to image software, including Stable Diffusion, has been copyrighted and to date I am aware of no challenges to these copyrights on the basis an AI was used to create them. If you use a text to image software to create original pictures, the rights belong to you, except for whatever rights are assigned by any license to use the software. There are, for example, some web sites where they will create some limited number of images from your text prompts, to which you own the rights, but you grant them a non-revocable world-wide license to use your image, without notice or payment, in any manner they choose, such as in a collection of art which has been produced by their site.

You will note the use of the word “original” twice in the above paragraph. Copyrights under US law protect the right to make derivative works. “Derivative” is not the same as “inspired by” or “in the style of.” Because the pictures that the diffusion model is trained on are in, or should be close to, the learned manifold of pictures, it is not completely impossible that a local optimum derived from a prompt would be sufficiently similar to a copyrighted member of the training set that it would count as a derivative work. It’s just sufficiently improbable to not worry about. The training set is a minuscule part of the learned manifold. The prompt, “a cat, an oil painting by Pablo Picasso” is, trust me, going to result in something original.

Oddly, my usual mixed approach to creating digital art runs up against making a “derivative work” far more than software like Stable Diffusion does. I have used copyrighted images to build 3-D models in imitation and then incorporated renderings of those models, from other points of view, with different skins and lighting, into my art. I photo-bash and use small chunks of other peoples art to form part of mine, whereby a steel girder may become part of a dog collar or a dress design part of a spaceship. I steal textures and paint with them, such as the snow-on-evergreen textures I used in a recent piece. Generally I’m taking these from sources not protected by copyright, but not always. And in any case, these are all derivative works, whether in violation or not. If these uses are unobtrusive and only use small parts of things that don’t reduce the value of the original, as a practical matter I am unlikely to wind up in court for creating derivative works from copyrighted elements, not only because such infringement is difficult to detect or prove, the actual damages are simply not large enough to make it worthwhile. That said, with Stable Diffusion I’m on far more solid ground.

Leaving the legal question behind, we enter murkier waters. Theresa’s question, “Does this count as your original art?” is more than a legal question. If I create a whole bunch of prompts and put Stable Diffusion to work with different seeds giving me different random starting images and then search for the few I like best and post them on Facebook saying, “Look what a great artist I am,” then I’m fooling no-one but myself, if that. Putting prompts into Stable Diffusion does not make one a good artist. At best, it makes one skilled at crafting prompts. Don’t misunderstand. There is a skill to crafting a good prompt and there is an art to getting something out that looks anything like what you were wanting. But the skills involved are nothing like the skills required to draw or paint. It’s of a different incomparable order altogether.

As it is, I suffered from a certain amount of impostor syndrome with respect to digital art already. My drawing and painting skills are laughable at best and getting to the point where I could use enough tools to work around my limitations to create a digital art-like image a day, most days, that I was sometimes happy with, was frankly not easy. The skills that I do have were hard won. And Stable Diffusion has trivialized most of those skills to the point of absurdity. As delightful a play toy as Stable Diffusion is, I also can’t help feeling upset and resenting it on some level.

But I am a latecomer to art. My whole life was not dedicated to being the best artist I could be. Commercial art does not put food on my table. And for commercial art, in any case, the landscape is about to change. I expected truck drivers to be the first to feel the economic impact of the coming AI machine learning resolution, as driverless cars, and more importantly, driverless trucks come online. Artist were not high on my list to feel the disruption.

I should be clear. I don’t expect art created by human artists to go away. Not even commercially. But if you are publisher, are you going to pay big bucks for a human artist if the output of Stable Diffusion’s successors will sell books just as well? There will be more capable successors to Stable Diffusion. None of these programs existed two years ago. As a friend of mine pointed out recently, TV did not kill movies. TV did not kill radio. But what radio was changed. What cinema was changed. What commercial art is and how money is earned from it is also going to change. I’m not yet certain how fast, but the direction is clear.

So what am I going to do? For a while, I’m going to play with Stable Diffusion. Occasionally I’ll revert to not using Stable Diffusion to keep my hand in. After the year of a piece of art a day is over, I’ll probably look to cut back and switch to YouTube and music. That’s the plan for now.