Editor's Note: In the months since we published this article on
Nov. 5, 2022, AI technology has advanced dramatically (as we
predicted it would in the story), particularly via mindboggling
image generation. So rapid has been that advancement that on
Jan. 13, 2023 the New York Times published a story with the same
title "This Film Does Not Exist," which features the eye-popping imagery of a fictional mashup
of the movie "Tron" and the visual style of Chilean director
Alejandro Jodorowsky, as generated by the engine Midjourney from
a prompt by artist Johnny Darrell. We're linking to the Times
article so readers can compare Loka's admittedly rudimentary AI
experiments with Darrell's beautifully rendered images.
From spam filtering to global navigation to bank loan default prediction, artificial intelligence has invisibly influenced our daily lives for years. In fact, AI’s out-of-sight, out-of-mind status might be the reason This Person Does Not Exist recently set the tech world buzzing. Using an AI driven, style-based generative adversarial network, or GAN, this public-facing experiment by Nvidia generated photorealistic images of fictional humans. Encountering the “face” of a neural network was startling, even unnerving for many of us. Since then, GANs have created realistic-looking images of pretty much everything and the uncanny valley has only widened. (Or narrowed?)
At the same time, AI-enabled natural-language processing, or NLP, is generating the predictive text that pops up as we type our email messages and allows Alexa to reply to spoken queries with coherent spoken answers. Robots have become so adept at carrying on sophisticated conversation–complete with humor, irony and original insight–that in July a former Google engineer was convinced that the AI he was messaging with had gained sentience.
The AI team at Loka was curious about this emerging tech, and as obsessive cinephiles, we were inspired to use it to make movies. Not full-length films–not yet!–but the seeds of them: plot outlines and movie posters, plus an associated director, all automatically generated via a few user-supplied keywords. Our method involved using a natural language processor and text-to-image generator, both available via open source. After many months of iterating, we’ve arrived at a level of oddity/quality that we’re compelled to share. Spoiler alert: It gets weird.
In the months since our Hollywoodbot experiment began, the technology behind open-source, AI-generated imagery and narrative text has advanced dramatically, as has the human artistry that drives it. Some might say our humble Hollywoodbot already feels dated; we like to think of it as “first gen”. The capabilities of AI are going supernova at this very moment. As you’ll see here, its capabilities are always unpredictable–and potentially profound. Starting with the latest installment of Iron Man…
‘I Know Who You Were Looking For’
Synopsis: Steel Man finds himself alone in an elevator which is supposed to go up some club but turns out that someone wants it above ground so that people will forget all about the place. This New York city nightscape has no atmosphere for metal workers. This nightclub doesn’t exist until Bruno Mars shows up. Steel Man becomes intrigued by the music from upstairs where Bruno goes clubbing daily, and asks his girlfriend Maggie if Bruno can dance on stage in front of them. But then, Claire sees something that makes her question why the guy chose Bruno over her. She thinks back to what was happening earlier and realizes how odd that it happened the same day in October 1979; how one minute everyone seemed happy with each other and suddenly a strange fog started covering things and leaving the party behind!
Step 1. Gathering Data
The typical movie-making process incorporates dozens, maybe hundreds, of component pieces and systems. But at the core level, the whole thing starts with an idea, a concept, a plot. For that we turned to GPT.
Standing for Generative Pretrained Transformer, GPT is a pre-trained natural-language model developed by OpenAI. It uses an attention mechanism that focuses on previous words most relevant to the context of a written prompt and learns to predict the next ones. Even a small amount of input text is enough for GPT to create articles, poetry, short stories, news reports and dialogue. The model was trained on a variety of data derived from Common Crawl, webtexts, books and Wikipedia. It was up to us to tailor it to come up with a movie plot.
Searching for data was not as easy as we thought it would be. Believe it or not, there are more resources for movie reviews than there are for movie descriptions and plot summaries. Thanks to D. Bamman, B. O’Connor and N. Smith, as well as Kaggle user Samruddhi Mhatre, we managed to obtain data for thousands of movies, which included title, release year, rating, runtime, genre, plot, characters information and other attributes. This data contributed to various functions such as statistics, exploratory data analysis (EDA) and NLP.
Synopsis: Two young men are robbed by armed robbers. After having left town for New York City and living off unemployment on welfare for years with little else but each other they find that there is someone more important looking out for them. Charlie has become obsessed by finding him but finds himself becoming distracted with all sorts of weird occurrences around town - which includes a serial killer from hell and Charlie falling in an obsession of revenge against the robber named Michael “Stinky Man” Johnson and his evil boss.
Step 2. Cleaning up and Tokenization
With such a vast dataset, we had to take several steps before tackling the predictive model. Alfred Hitchock said “Drama is life with the dull bits cut out,” which is maybe why most of the movies in this dataset belong to the drama genre. Comedies are second, followed by thriller and crime. Biography, documentary, adventure, horror and animation also register, with another 10 or so genres barely showing up. Maybe we could expect some Guy Ritchie storylines? 🤔
Our goal was to generate a movie title and plot by using a simple theme of a few words–some of them wildly ambiguous–such as "friends robbing a bank". But because we can’t work with text data if we don’t transform it into machine-legible speech, we process it through tokenization. Tokenization separates a piece of text into smaller units, or tokens, which are the building blocks of natural language. For our purposes they can be either words, characters or subwords.
GPT uses byte-level Byte Pair Encoding (BPE) tokenization. This means that "words" in the vocabulary are not full words, but groups of characters (or for byte-level BPE, bytes) which occur often in text.
Step 3. Fine-tuning
The text sequences we used to fine-tune GPT were constructed using the synopsis, title and full plot for every movie in the dataset. Like many generative transformer models, GPT is pretrained on the task of predicting the most probable next token in the sequence. So if we input text such as “Mike was cooking ___,” GPT would try to predict the most probable word to fill the blank from a huge vocabulary of English-language words. In this case, “dinner” would be a much more probable option for GPT than, say, “chair.”
In our case, we fed GPT the synopses and hoped that it would generate titles and plots. GPT tries to generate the next token in the sequence until it reaches its defined limit of maximum length or its special STOP token.
We ran the samples from our dataset a couple of times through GPT, expecting that it would recognize the patterns hidden inside movie plots and discover the knowledge it needs for generating its own plots. Aaaand… it worked! Sort of.
The first fine-tuned models generated a lot of nonsense. Turns out coming up with a coherent plot is hard! Let alone writing the screenplay for a feature film. But the stuff sounded pretty funny, so we were encouraged to continue. After a few sleepless nights, a lot of coffee, and $$$ spent on hosted GPU-equipped servers, we finally found the super-secret combination of hyperparameters that allowed us to fine-tune a higher-quality model. (Without getting overly technical, it has to do with the very specific way we worded the input prompt that trains the model.)
‘Sauce’s Sauce!! Sauce! An Epic Journey in Italian Verandas of Olives, Peppers and Tomatoes’
Synopsis: Italy’s sauce was put up for sale in the New York area and this year’s barbeque is ready to be cooked by its owners who have just moved in. With the assistance of three seasoned chefs including chef Manuel Garcia and restaurateur Paolo Goethe (Carlos Santos). They are making their first cooking contest since the culinary revolution of 1965. Despite being unable to cook very well, these men are confident in their ability in preparing delicious food that will make them famous among many people at the restaurant and even infamous around town as legends. After all these years, Antonio Valmont is back in Rome expecting his daughter Jeanne to return to celebrate their anniversary dinner. But instead, she suddenly disappears without telling him where her body has been hidden or what had happened to it, leaving behind nothing and no trace of himself except Paulo De La Vega and his wife Maria.
Step 4. Posterize
To make our Hollywoodbot experience even more robust, we decided to complement our movie plot with an official poster. Unlike This Person Does Not Exist, we used Stable Diffusion, an open-source, text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION.
This model is trained on 512 pixel x 512 pixel images from a subset of the LAION-5B database, which itself contains 5.85 billion image-text pairs. It uses a text encoder to condition the model on text prompts, and being relatively lightweight, enables fast creation of quality images. Additionally, specific magic words such as "highly detailed," "surrealism" or "movie" direct the AI to produce better and more relevant images.
We input around 50 image tags for Stable Diffusion to choose from, such as “sharp focus” to make the images clearer, “vibrant colors,” “fantasy,” etc, and Diffusion randomly selects ten. We also added “movie,” “film” and “movie poster” to refine it further. Usually a single prompt yields a decent result, the kind you see here. These images weren’t amended or altered in any way from the original Stable Diffusion creations. We’re using the model as-is, without significant retraining, because it’s pretty new and produces excellent–or at least interesting–results. Apparently Diffusion isn’t so great at generating legible/sensible text, but it does seem to understand the general design principles that most movie posters adhere to.
‘I Saw Him Coming’
Synopsis: After an investigation into multiple murders at a Miami-based hotel where Castro meets his wife (who works as an administrative assistant), he gets caught up with the criminal underworld that has organized this high crime syndicate using their money to finance their nefarious scheme against his former lover. Now, his only chance to get out alive will be by going back to Florida, but there are complications on both sides. This includes a local sheriff and a mob boss who are trying to kill each other, but the guns are already drawn.
Criminal underworld, guns, Miami… I Saw Him Coming sounds like the next feature from A24! Nice work, Hollywoodbot!
Coming Attractions: Life Imitates Art
Our experiment in AI microcinema is a fun playground with serious implications. Much of the technology we used powers other models that are currently revolutionizing real-world fields, particularly healthtech. AlphaFold, for instance, uses attention network mechanisms to predict a protein’s 3D structure from its amino acid sequence, a process that helps researchers better understand the biological function of the protein and thereby harness or hinder its effects. And Nvidia, the company that gave us This Person Does Not Exist, recently launched BioNeMo, a large language model that will simplify the process of training massive neural networks on biomolecular data. The goal is to make it easier for researchers to discover new patterns in biological sequences, eventually leading to new medications and therapies to improve human health.
The more Loka’s AI engineers experiment in this field, the more we master the technology in all its forms and applications. And in the meantime, we’re all looking forward to watching Denis Villeneuve's upcoming blockbuster, the aptly-titled Robot!
Synopsis: Story for tomorrow, then back to 2081 when the government has taken over and created robots in all different states. After some time with this new world, they decide that their only hope is “the best solution” which involves using real-life human language to help them out by making it possible for humans to speak.