I’ve been playing around with them occasionally and I just can’t get them to produce what I’m actually asking for. For example this is Luma’s response to the prompt “A deeply content capybara luxuriating in my suburban back garden with a fluffy black cat reclining on his belly“:
Is this a capybara or is it a cross between a pig and a warthog? Where’s the black cat? Perhaps it’s a bad prompt but it’s a pattern I’ve noticed across these systems. This is Runway’s response to the prompt “We want to make a video about a capybara in a tuxedo singing a song about his life, while holding an umbrella because it’s raining. His capybara wife has got tears in her eyes as she hears him sing because his singing is terrible. It hurts her ears. remember the umbrella and sound” (I’m doing this with kids in case you’re wondering…)
On one level stunning. On another more practical level, kind of useless. I’ve yet to put any significant thought into how I’m prompting video systems, so I suspect there might at least some user failure here. But I’m confident there’s a broader weakness in the reliability of video systems because you just can’t infer robustly for multimodal content in the way you can for text.
Will these always be superficially spectacular but deeply useless in practice? If so is a lot of money being wasted? Or do we risk a future in which a much smaller number of designers are employed to clean up after chronically dysfunctional video models?
