I'll add the new output resolutions and other features ASAP. However, looking at the pricing (https://ai.google.dev/gemini-api/docs/pricing#standard_1), I'm definitely not changing the default model to Pro as $0.13 per 1k/2k output will make it a tougher sell.
> The model generates up to two interim images to test composition and logic. The last image within Thinking is also the final rendered image.
Maybe that's partially why the cost is higher: it's hard to tell if intermediate images are billed in addition to the output. However, this could cause an issue with the base gemimg and have it return an intermediate image instead of the final image depending on how the output is constructed, so will need to double-check.
> The model generates up to two interim images to test composition and logic. The last image within Thinking is also the final rendered image.
I've been using a bespoke Generative Model -> VLM Validator -> LLM Prompt Modifier REPL as part of my benchmarks for a while now so I'd be curious to see how this stacks up. From some preliminary testing (9 pointed star, 5 leaf clover, etc) - NB Pro seems slightly better than NB though it still seems to get them wrong. It's hard to tell what's happening under the covers.
I was going off the footnote of "Image input is set at 560 tokens or $0.067 per image" but 560 * 2 / 1_000_000 is indeed $0.0011 so I have no idea where the $0.067 came from. Fixed, and this is why I typically don't read docs without coffee.
In case anyone missed Max's Nano Banana prompting guide, it's absolutely the definitive manual for prompting the original Nano Banana... and I tried some of the prompts in there against Nano Banana Pro and found it to be very applicable to the new model as well.
How do you know Simon? It's certainly a blog post, with content about prompting in it. If your goal is to make generative art that uses specific IP, I wouldn't use it.
yes they are pricey but the price will go down over time and then you can switch. vlm.run got access as early customers and are releasing it for free with unlimited generations(till they are bottlenecked by google). some results here combining image gen(Nano Banana pro) with video gen(Veo 3.1) in a single chat https://chat.vlm.run/c/38b99710-560c-4967-839b-4578a4146956
this is pretty cool!
have you found success with image editing in nano banana - i mean photoshop-like stuff.
from your article i seem to wonder if nano banana is good for editing versus generating new images.
That IS the use-case for Nano Banana (as opposed to pure generative like Imagen4).
In my benchmarks, Nano-Banana scores a 7 out of 12. Seedream4 managed to outpace it, but Seedream can also introduce slight tone mapping variations. NB is the gold standard for highly localized edits.
Comparisons of Seedream4, NanoBanana, gpt-image-1, etc.
It's written to mimic that style but without meaning that the work has been done for them, just that there is new work to be done, making it an odd perhaps unconscious reference
I tried the same prompt as one of the examples (https://i.imgur.com/iQTPJzz.png), in the two ways they say you can run it, via Google Gemini and Google AI Studio (I suppose they're different somehow?). The prompt was "Create an infographic that shows hot to make elaichi chai" and Google Gemini created a infographic (https://i.imgur.com/aXlRzTR.png), but it was all different from what the example showed. Google AI Studio instead created a interactive website, again with different directions: https://i.imgur.com/OjBKTkJ.png
There is not a single mention about accuracy, risks or anything else in the blogpost, just how awesome the thing is. It's clearly not meant to be reliable just yet, but not making this clear up front. Isn't this almost intentionally misleading people, something that should be illegal?
The interesting tidbit here is SynthID. While a good first step, it doesn't solve the problem of AI generated content NOT having any kind of watermark. So we can prove that something WITH the ID is AI generated but we can't prove that something without one ISN'T AI generated.
Like it would be nice if all photo and video generated by the big players would have some kind of standardized identifier on them - but now you're left with the bajillion other "grey market" models that won't give a damn about that.
Some days it feels like I'm the only hacker left who doesn't want government mandated watermarking in creative tools. Were politicians 20 years ago as overreative they'd have demanded Photoshop leave a trace on anything it edited. The amount of moral panic is off the charts. It's still a computer, and we still shouldn't trust everything we see. The fundamentals haven't changed.
> “My wife and I have been together for over 30 years, and she has my voice everywhere,” Schlegel said. “She could easily clone my voice on free or inexpensive software to create a threatening message that sounds like it’s from me and walk into any courthouse around the country with that recording.”
> “The judge will sign that restraining order. They will sign every single time,” said Schlegel, referring to the hypothetical recording. “So you lose your cat, dog, guns, house, you lose everything.”
At the moment, the only alternative is courts simply never accept photo/video/audio as evidence. I know if I were a juror I wouldn't.
> It's still a computer, and we still shouldn't trust everything we see. The fundamentals haven't changed.
I think that by now it should be crystal clear to everyone that it matters a lot the sheer scale a new technology permits for $nefarious_intent.
Knives (under a certain size) are not regulated. Guns are regulated in most countries. Atomic bombs are definitely regulated. They can all kill people if used badly, though.
When a photo was faked/composed with old tech, it was relatively easy to spot. With photoshop, it became more complicated to spot it but at the same time it wasn't easy to mass-produce altered images. Large models are changing the rules here as well.
I think we're overreacting. Digital fakes will proliferate, and we'll freak out bc it's new. But after a certain amount of time, we'll just get used to it and realize that the world goes on, and whatever major adverse effects actually aren't that difficult to deal with. Which is not the case with nuclear proliferation or things like that.
The story of human history is newer generations freaking about progress and novel changes that have never been seen before. And later generations being perfectly okay with it and adapting to a new style of life.
In general I concur but the adaptation doesn't come out of the blue or just only because people get used to it but also because countermeasures are taken, regulations are written and adjustments are made to reduce the negative impact. Also the hyperconnected society is still relatively new and I'm not sure we have adapted for it yet.
It shouldn’t be that we panic about it and regulate the hell out.
We could use the opportunity to deploy robust systems of verification and validation to all digital works. One that allows for proving authenticity while respecting privacy if desired. For example… it’s insane in the US we revolve around a paper social security number that we know damn well isn’t unique. Or that it’s a massive pain in the ass for most people to even check the hash of a download.
> Knives (under a certain size) are not regulated. Guns are regulated in most countries. Atomic bombs are definitely regulated
I don’t think this is a good comparison: knives are easy to produce, guns a bit harder, atomic bombs definitely harder. You should find something that is as easy to produce as a knife, but regulated.
Politicians absolutely were doing this 20-30 years ago. Plenty of folks here are old enough to remember debates on Slashdot around the Communications Decency Act, Child Online Protection Act, Children's Online Privacy Protection Act, Children's Internet Protection Act, et al.
I suspect watermarking ends up being a net negative, as people learn to trust that lack of a watermark indicates authenticity. Propaganda won’t have the watermark.
You do know that every color copier comes with the ability to identify US currency and would refuse to copy it? And that every color printer leaves a pattern of faint yellow dots on every printout that uniquely identifies the printer?
It would be more productive for camera manufacturers to embed a per-device digital signature. Those care to prove their image is genuine could publish both pre and post processed images for transparency.
I'm sure Apple will roll something out in the coming years. Now that just anyone can easily AI themselves into a picture in front of the Eiffel tower, they'll want a feature that will let their users prove that they _really_ took that photo in front of the Eiffel tower (since to a lot of people sharing that you're on a Paris vacation is the point, more than the particular photo).
I bet it will be called "Real Photos" or something like that, and the pictures will be signed by the camera hardware. Then iMessage will put a special border around it or something, so that when people share the photos with other Apple users they can prove that it was a real photo taken with their phone's camera.
How "real" are iPhone photos? They're also computationally generated, not just the light that came through the lens.
Even without any other post-processing, iPhones generate gibberish text when attempting to sharpen blurry images, they delete actual textures and replace them with smooth, smeared surfaces that look like a watercolor or oil paintings, and combine data from multiple frames to give dogs five legs.
For example, it's trivial to post an advertisement without disclosure. Yet it's illegal, so large players mostly comply and harm is less likely on the whole.
> I don't think it will be easy to just remove it.
No, but model training technology is out in the open, so it will continue to be possible to train models and build model toolchains that just don't incorporate watermarking at all, which is what any motivated actor seeking to mislead will do; the only thing watermarking will do is train people to accept its absence as a sign of reliability, increasing the effectiveness of fakes by motivated bad actors.
It's an image. There's simply no way to add a watermark to an image that's both imperceptible to the user and non-trivial to remove. You'd have to pick one of those options.
I'm not sure that's correct. I'm not an expert, but there's a lot of literature on digital watermarks that are robust to manipulation.
It may be easier if you have an oracle on your end to say "yes, this image has/does not have the watermark," which could be the case for some proposed implementations of an AI watermark. (Often the use-case for digital watermarks assumes that the watermarker keeps the evaluation tool secret - this lets them find, e.g, people who leak early screenings of movies.)
Exactly, a diffusion model can denoise the watermark out of the image. If you wanted to be doubly sure you could add noise first and then denoise which should completely overwrite any encoded data. Those are trivial operations so it would be easy to create a tool or service explicitly for that purpose.
The incentive for commercial providers to apply watermarks is so that they can safely route and classify generated content when it gets piped back in as training or reference data from the wild. That it's something that some users want is mostly secondary, although it is something they can earn some social credit for by advertising.
You're right that there will existed generated content without these watermarks, but you can bet that all the commercial providers burning $$$$ on state of the art models will gradually coalesce around some means of widespread by-default/non-optional watermarking for content they let the public generate so that they can all avoid drowning in their own filth.
Regardless of how you feel about this kind of steganography, it seems clear that outside of a courtroom, deepfakes still have the potential to do massive damage.
Unless the watermark randomly replaces objects in the scene with bananas, these images/videos will still spread like wildfire on platforms like TikTok, where the average netizen's idea of due diligence is checking for a six‑fingered hand... at best.
It solves some problems! For example, if you want to run a camgirl website based on AI models and want to also prove that you're not exploiting real people
> It solves some problems! For example, if you want to run a camgirl website based on AI models and want to also prove that you're not exploiting real people
So, you exploit real people, but run your images through a realtime AI video transformation model doing either a close-to-noop transformation or something like changing the background so that it can't be used to identify the actual location if people do figure out you are exploiting real people, and then you have your real exploitation watermarked as AI fakery.
I don't think this is solving a problem, unless you mean a problem for the would-be exploiter.
Your use case doesn't even make sense. What customers are clamoring for that feature? I doubt any paying customer in the market for (that product) cares. If the law cares, the law has tools to inquire.
All of this is trivially easy to circumvent ceremony.
Google is doing this to deflect litigation and to preserve their brand in the face of negative press.
They'll do this (1) as long as they're the market leader, (2) as long as there aren't dozens of other similar products - especially ones available as open source, (3) as long as the public is still freaked out / new to the idea anyone can make images and video of whatever, and (4) as long as the signing compute doesn't eat into the bottom line once everyone in the world has uniform access to the tech.
The idea here is that {law enforcement, lawyers, journalists} find a deep fake {illegal, porn, libelous, controversial} image and goes to Google to ask who made it. That only works for so long, if at all. Once everyone can do this and the lookup hit rates (or even inquiries) are < 0.01%, it'll go away.
It's really so you can tell journalists "we did our very best" so that they shut up and stop writing bad articles about "Google causing harm" and "Google enabling the bad guys".
We're just in the awkward phase where everyone is freaking out that you can make images of Trump wearing a bikini, Tim Cook saying he hates Apple and loves Samsung, or the South Park kids deep faking each other into silly circumstances. In ten years, this will be normal for everyone.
Writing the sentence "Dr. Phil eats a bagel" is no different than writing the prompt "Dr. Phil eats a bagel". The former has been easy to do for centuries and required the brain to do some work to visualize. Now we have tools that previsualize and get those ideas as pixels into the brain a little faster than ASCII/UTF-8 graphemes. At the end of the day, it's the same thing.
And you'll recall that various forms of written text - and indeed, speech itself - have been illegal in various times, places, and jurisdictions throughout history. You didn't insult Caesar, you didn't blaspheme the medieval church, and you don't libel in America today.
> How can they distinguish from real people exploited to AI models autogenerating everything?
The people who care don't consume content which even just plausibly looks like real people exploited. They wouldn't consume the content even if you pinky promised that the exploited looking people are not real people. Even if you digitally signed that promise.
Reminder that even in the hypothetical world where every AI image is digitally watermarked, and all cameras have a TPM that writes a hash of every photo to the blockchain, there’s nothing to stop you from pointing that perfectly-verified camera at a screen showing your perfectly-watermarked AI image and taking a picture.
Image verification has never been easy. People have been airbrushed out of and pasted into photos for over a century; AI just makes it easier and more accessible. Expecting a “click to verify” workflow is unreasonable as it has ever been; only media literacy and a bit of legwork can accomplish this task.
It is terrifying, but inevitable. Perhaps AI companies flooding the commons with excrement wasn't the best idea, now we all have to suffer the consequences.
If watermarking becomes a legal mandate, it will inevitably include a prohibition on distributing (and using and maybe even possessing, but the distribution ban is the thing that will have the most impact, since it is the part that is most policable, and most people aren't going to be training their own models, except, of course, the most motivated bad actors) open models that do not include watermarking as a baked-in model feature. So, for most users, it'll be much less accessible (and, at the same time, it won't solve the problem.)
We need to be super careful with how legislation around this is passed and implemented. As it currently stands, I can totally see this as a backdoor to surveillance and government overreach.
If social media platforms are required by law to categorize content as AI generated, this means they need to check with the public "AI generation" providers. And since there is no agreed upon (public) standard for imperceptible watermarks hashing that means the content (image, video, audio) in its entirety needs to be uploaded to the various providers to check if it's AI generated.
Yes, it sounds crazy, but that's the plan; imagine every image you post on Facebook/X/Reddit/Whatsapp/whatever gets uploaded to Google / Microsoft / OpenAI / UnnamedGovernmentEntity / etc. to "check if it's AI". That's what the current law in Korea and the upcoming laws in California and EU (for August 2026) require :(
I don't believe that you can do this for photography. For AI-images, if the embedded data has enough information (model identification and random seed), one can prove that it was AI by recreating it on the fly and comparing. How do you prove that a photographic image was created by a CCD? If your AI-generated image were good enough to pass, then hacking hardware (or stealing some crypto key to sign it) would "prove" that it was a real photograph.
Hell, it might even be possible for some arbitrary photographs to come up with an AI prompt that produces them or something similar enough to be indistinguishable to the human eye, opening up the possibility of "proving" something is fake even when it was actually real.
What you want just can't work, not even from a theoretical or practical standpoint, let alone the other concerns mentioned in this thread.
It solves a real problem - if you have something sketchy, the big players can repudiate it, the authorities can more formally define the black market, and we can have a ‘war on deepfakes’ to further enable the authorities in their attempts to control the narratives.
Every model is "grey market". They're all trained on data without complying with any licensing terms that may exist, be they proprietary or copyleft. Every major AI model is an instance of IP theft.
This is the first image model I’ve used that passed my piano test. It actually generated an image of a keyboard with the proper pattern of black keys repeated per octave – every other model I’ve tried this with since the first Dall-E has struggled to render more than a single octave, usually clumping groups of two black keys or grouping them four at a time. Very impressive grasp of recursive patterns.
Google needs to pace themselves. AI studio, Antigravity, Banana, Banana Pro, Grape Ultra, Gemini 3, etc. This information overload don't do them any good whatsoever.
Why? They're mostly different markets. Most people using Nano Banana Pro aren't using Antigravity.
A cluster of launches reinforces the idea that Google is growing and leading in a bunch of areas.
In other words, if it's having so many successes it feels like overload, that's an excellent narrative. It's not like it's going to prevent people from using the tools.
Powell Doctrine, but for AI. No one should dispute that Google is the leader in every(?) category of AI: LLM, image gen, video editing, world models, etc.
You can try it out for free on LMArena [0]: New Chat -> Battle dropdown -> Direct Chat -> Click on Generate Image in the chat box -> Click dropdown from hunyuan-image-3.0 -> gemini-3-pro-image-preview (nano-banana-pro).
I've only managed to get a few prompts to go through, if it takes longer than 30 seconds it seems to just time out. Image quality seems to vary wildly; the first image I tried looked really good but then I tried to refresh a few times and it kept getting worse.
First model I've seen that was consistently compositional, easily handling requests like
“Generate an image of an african elephant painted in the New England flag, doing a backflip in front of the russian federal assembly.”
OpenAI made the biggest step change towards compositionality in image generation when they started directly generating image tokens for decoders from foundation llms, and it worked very well (openais images were better in this regard than nano banana 1, but struggled with some OOD images like elephants doing backflips), but banana 2 nails this stuff in a way I haven't seen anywhere else
if video follows the same trends as images in terms of prompt adherence, that will be very valuable... and interesting
It's crazy how good these models are at text now. Remember when text was literally impossible? Now the models can diagetically render any text. It's so good now that it seems like a weird blip that it _wasn't_ possible before.
I agree, it's improving by leaps. I'm still patiently awaiting for my niche use of creating new icons though, one that can match the existing curvature, weight, spacing, and balance. It seems AI is struggling in the overlap of visuals <-> code, or perhaps there's less business incentive to train on that front. I know the pelican on bicycle svg is getting better, but still really rough looking and hard to modify with prompt versus just spending some time upfront to do it yourself in an editor.
SynthID seems interesting but in classic Google fashion, I haven't a clue on how to use it and the only button that exists is join a waitlist. Apparently it's been out since 2023? Also, does SynthID work only within gemini ecosystem? If so, is this the beginning of a slew of these products with no one standard way? i.e "Have you run that image through tool1, tool2, tool3, and tool4 before deciding this image is legit?"
edit: apparently people have been able to remove these watermarks with a high success rate so already this feels like a DOA product
> SynthID seems interesting but in classic Google fashion, I haven't a clue on how to use it and the only button that exists is join a waitlist. Apparently it's been out since 2023? Also, does SynthID work only within gemini ecosystem? If so, is this the beginning of a slew of these products with no one standard way
No, its not the beginning, multiple different watermarking standards, watermark checking systems, and, of course, published countermeasures of various effectiveness for most of them, have been around for a while.
I've tried to repaint the exterior of my house. More than 20 times with very detailed prompts. I even tried to optimize it with Claude. No matter what, every time it added one, two or three extra windows to the same wall.
Here, it mostly poisons your test, because that exact photo probably exists in the underlying training data and the trained network will be more or less optimized on working with it. It's really the same consideration you'd want to make when testing classifiers or other ML techs 10 years ago.
Most people taking to a task like this will be using an original photo -- missing entirely from any training date, poorly framed, unevenly lit, etc -- and you need to be careful to capture as much of that as possible when trying to evaluate how a model will work in that kind of use case.
The failure and stress points for AI tools are generally kind of alien and unfamiliar because the way they operate is totally different than the way a human operates, and if you're not especially attentive to their weird failure shapes and biases when you want to test them, or you'll easily get false positives (and false negatives) that lead you to misleading conclusions.
I also tried that in the past with poor results. I just tried it this morning with nano banana pro and it nailed it with a very short prompt: "Repaint the house white with black trim. Do not paint over brick."
I don't know what it is with Gemini (and even other models) but I swear they must be doing some kind of active load-dependant quanitization or a/b/c/d testing behind the scenes, because sometimes the model is stellar and hitting everything, and other times it's tripping all over itself.
The most effective fix I have found is that when the model is acting dumb, just turn it off and come back in the few hours to a new chat and try again.
Maybe somewhere in the original comment it would have been fair to mention you can barely see the house in the original photo. This is actually a hilarious complaint
That cannot be a valid excuse. Other than adding extra windows to the clearly visible wall, it's obvious that model perfectly capable to "see" the house. It just cannot "believe" that there can be a big empty wall on a garden house.
I was at a tech conference yesterday, and I asked someone if they had tried nano banana. They looked at me like I was crazy. These names aren't helping! (But honestly I love it, easier to remember than Gemini-2.whatever.
Honestly I give Google credit for realizing that they had something that people were talking about and running with it instead of just calling it gemini-image-large-with-text-pro
They tried calling it gemini-2.5-whatever, but social media obsessed over the name "Nano Banana", which was just its codename that got teased on Twitter for a few weeks prior to launch.
After launch, Google's public branding for the product was "Gemini" until Google just decided to lean in and fully adopt the vastly more popular "Nano Banana" label.
The public named this product, not Google. Google's internal codename went virally popular and outstaged the official name.
Branding matters for distribution. When you install yourself into the public consciousness with a name, you'd better use the name. It's free distribution. You own human wetware market share for free. You're alive in the minds of the public.
Renaming things every human has brand recognition of, eg. HBO -> Max, is stupid. It doesn't matter if the name sucks. ChatGPT as a name sucks. But everyone in the world knows it.
This will forever be Nano Banana unless they deprecate the product.
What can nano-banana do that chatGPT made images can't? Or is it only better for image editing from what I can gather from these comments so far. I haven't used it so genuinely curious.
I made some direct comparisons my Nano Banana post (https://news.ycombinator.com/item?id=45917875) but Nano Banana can handle photorealistic photos with nuanced prompts much better. And there is no yellow filter.
The rollout doesn't seem to have reached my userid yet. How successful are people at getting these things to actually produce useful images? I was trying recently with the (non-Pro) Nano Banana to see what the fuss was about. As a test case, I tried to get it to make a diagram of a zipper merge (in driving), using numbered arrows to indicate what the first, second, third, etc. cars should do.
I had trouble reliably getting it to...
* produce just two lanes of traffic
* have all the cars facing the same way—sometimes even within one lane they'd be facing in opposite directions.
* contain the construction within the blocked-off area. I think similarly it wouldn't understand which side was supposed to be blocked off. It'd also put the lane closure sign in lanes that were supposed to be open.
* have the cars be in proportion to the lane and road instead of two side-by-side within a lane.
* have the arrows go in the correct direction instead of veering into the shoulder or U-turning back into oncoming traffic
* use each number once, much less on the correct car
This is consistent with my understanding of how LLMs work, but I don't understand how you can "visualize real-time information like weather or sports" accurately with these failings.
Below is one of the prompts I tried to go from scratch to an image:
> You are an illustrator for a drivers' education handbook. You are an expert on US road signage and traffic laws. We need to prepare a diagram of a "zipper merge". It should clearly show what drivers are expected to do, without distracting elements.
> First, draw two lanes representing a single direction of travel from the bottom to the top of the image (not an entire two-way road), with a dotted white line dividing them. Make sure there's enough space for the several car-lengths approaching a construction site. Include only the illustration; no title or legend.
> Add the construction in the right lane only near the top (far side). It should have the correct signage for lane closure and merging to the left as drivers approach a demolished section. The left lane should be clear. The sign should be in the closed lane or right shoulder.
> Add cars in the unclosed sections of the road. Each car should be almost as wide as its lane.
> Add numbered arrows #1–#5 indicating the next cars to pass to the left of the "lane closed" sign. They should be in the direction the cars will move: from the bottom of the illustration to the top. One car should proceed straight in the left lane, then one should merge from the right to the left (indicate this with a curved arrow), another should proceed straight in the left, another should merge, and so on.
I did have a bit better luck starting from a simple image and adding an element to it with each prompt. But on the other hand, when I did that it wouldn't do as well at keeping space for things. And sometimes it just didn't make any changes to the image at all. A lot of dead ends.
I also tried sketching myself and having it change the illustration style. But it didn't do it completely. It turned some of my boxes into cars but not necessarily all of them. It drew a "proper" lane divider over my thin dotted line but still kept the original line. etc.
Much better than previous attempts. Still has an extra lane with the cars on the right cutting off the cars in the middle. Still has the numbers in the wrong order.
I'd try a some more if I were you. I saw an example of generated infographic that was greatly improved over anything I've seen an image generator do before. What you desire seems in the realm of possibility.
Imagen4 did no better. edit: example https://imgur.com/Dl8PWgm with a so-so result: four lanes, cars at least facing the same way, lane block looks good, weird extra division in the center, some numbers repeated, one arrow going straight into construction, one arrow going backwards
edit: or Imagen4 Ultra. https://imgur.com/a/xr2ElXj cars facing opposite directions within a lane, 2-way (4 lanes total), double-ended arrow, confused disaster. pretty though.
I'd be curious about how well the inline verification works - an easy example is to have it generate a 9-pointed star, a classic example that many SOTA models have difficulties with.
In the past, I've deliberately stuck a Vision-language model in a REPL with a loop running against generative models to try to have it verify/try again because of this exact issue.
EDIT: Just tested it in Gemini - it either didn't use a VLM to actually look at the finished image or the VLM itself failed.
Output:
I have finished cross-referencing the image against the user's specific requests. The primary focus was on confirming that the number of points on the star precisely matched the requested nine. I observed a clear visual representation of a gold-colored star with the exact point count that the user specified, confirming a complete and precise match.
"Inline verification of images following the prompt is awesome, and you can do some _amazing_ stuff with it." - could you elaborate on this? sounds fascinating but I couldn't grok it via the blog post (like, it this synthid?)
LLMs might be a dead end, but we're going to have amazing images, video, and 3D.
To me the AI revolution is making visual media (and music) catch up with the text-based revolution we've had since the dawn of computing.
Computers accelerated typing and text almost immediately, but we've had really crude tools for images, video, and 3D despite graphics and image processing algorithms.
AI really pushes the envelope here.
I think images/media alone could save AI from "the bubble" as these tools enable everyone to make incredible content if you put the work into it.
Everyone now has the ingredients of Pixar and a music production studio in their hands. You just need to learn the tools and put the hours in and you can make chart-topping songs and Hollywood grade VFX. The models won't get you there by themselves, but using them in conjunction with other tools and understanding as to what makes good art - that can and will do it.
Screw ChatGPT, Claude, Gemini, and the rest. This is the exciting part of AI.
LLMs are useful, but they've hit a wall on the path to automating our jobs. Benchmark scores are just getting better at test taking. I don't see them replacing software engineers without overcoming obstacles.
AI for images, video, music - these tools can already make movies, games, and music today with just a little bit of effort by domain experts. They're 10,000x time and cost savers. The models and tools are continuing to get better on an obvious trend line.
Neat use-case, though the sword literally telescopically inverts itself at the beginning of the scene like a light saber where you would have expected it to be drawn from its scabbard.
I'd be interested to see how Wan 2.2 First/Last frame handles those images though...
That is an interesting error actually. It happened because both orientations of the sword are visually plausible, but not abrupt transitions from one to the other; there needs to be physical continuity.
Here is a reproduction of the Matrix bullet time shot with and without pose guidance to illustrate the problem: https://youtu.be/iq5JaG53dho?t=1125
yeah sadly veo 3.1 has not caught up to the image generation capabilities. May be we need to work on how to make video generation more physically consistent. but the image generation results from banana pro are great.
Just last night I was using Gemini "Fast" to test its output for a unique image we would have used in some consumer research if there had been a good stock image back in the day. I have been testing this prompt since the early days of AI images. The improvement in quality has been pretty remarkable for the same prompt. Composition across this time has been consistent. What I initially thought was "good enough" now is... fantastic. Just so many little details got more life-like w/ each new generation. Funnily enough, our images must be 3:2 aspect ratio. I kept asking GFast to change its square Fast output to 3:2. It kept saying it would, but each image was square or nearly square. GFast in the end was very apologetic, and said it would alert about this issue. Today I read that GPro does aspect ratios. Tried the same prompt again burning up some "Thinking" credits, and got another fantastically life-like image in 3:2. We have a new project coming up. We have relied entirely on stock or in some cases custom shot images to date. Now, apart from the time needed to get the prompts right whilst meeting with the client, I cannot see how stock or custom images can compete. I mean the GPro images -- again which is very specific to an unusual prompt -- is just "Wow". Want to emphasize again -- we are looking for specific details that many would not. So the thoughts above are specific to this. Still, while many faults can be found with AI, Nano Banana is certainly proven itself to me.
edit: I was thinking about this, and am not sure I even saw Pro3 as my image option last night. Today it was clearly there.
I tried the studio ghibli prompt on a photo my me and my wife in Japan and it was... not good. It looked more like a hand drawn sketch made with colored pencils, but none of the colors were correct. Everything was a weird shade of yellow/brown.
This has been an oddly difficult benchmark for Gemini's NB models. Googles images models have always been pretty bad at the studio ghibli prompt, but I'm shocked at how poorly it performs at this task still.
Could be they are specifically training against it. There was some controversy about "studio ghibli style". Similarly how in the early days of Stable Diffusion "Greg Rutkowski style" was a very popular prompt to get a specific look. These days modern Stable Diffusion based models like SD 3 or FLUX mostly removed references to specific artists from their datasets.
This is really impressive. As a former designer, I'm equally excited that people will be able to generate images like this with a prompt, and sad that there will be much less incentive for people to explore design / "photoshopping" as a craft or a career.
At the end of the day, a tool is a tool, and the computer had the same effect on the creative industry when people started using them in place of illustrating by hand, typesetting by hand, etc. I don't want my personal bias to get in the way too much, but every nail that AI hammers into the creative industry's coffin is hard to witness.
I feel you. Infact, IMO, SWE1 level coding industry seems to be a couple years lagging on this aspect.
The trouble is that learning fundamentals now is a large trough to go past, just the way grade 3-10 children learn their math fundamentals despite there being calculators. It's no longer "easy mode" in creative careers.
I wonder how hard it is to remove that SynthID watermark...
Looks like: "When tested on images marked with Google’s SynthID, the technique used in the example images above, Kassis says that UnMarker successfully removed 79 percent of watermarks." From https://spectrum.ieee.org/ai-watermark-remover
My experience with Nano Banana is to constantly get consistent image when dealing with muliple objects in a image, I mean creating consistent sequence etc.
We spent a lot of money trying but eventully gave up. If it is easier in Pro, then probably it stands a chance.
To expand, it comes from the stealth name it was given on LMArena I believe. The model made news while still in "stealth mode" and so Google capitalised on the PR they'd already built around that and just launched it officially with the same name.
The SynthID check for fishy photos is a step in the right direction, but without tighter integration into everyday tooling its not going to move the needle much. Like when I hold the power button on my Pixel 9, It would be great if it could identify synthetic images on the screen before I think to ask about it. For what its worth it would be great if the power button shortcut on Pixel did a lot more things.
1. Trigger Circle to Search with long holding the home button/bar
2. Select the image
3. Navigate to About this image on the Google search top bar all the way to the right - check if it says "Made by Google AI" - which means it detected the SynthID watermark.
I'll be running it through my GenAI Comparison benchmark shortly - but so far it seems to be failing on the same tests that the original Nano Banana struggled with (such as SHRDLU).
> Generate better visuals with more accurate, legible text directly in the image in multiple languages
Assuming that this new model works as advertised, it's interesting to me that it took this long to get an image generation model that can reliably generate text. Why is text generation in images so hard?
It’s not necessarily harder than other aspects. However:
- It requires an AI that actually understands English, I.e. an LLM. Older, diffusion-only models were naturally terrible at that, because they weren’t trained on it.
- It requires the AI to make no mistakes on image rendering, and that’s a high bar. Mistakes in image generation are so common we have memes about it, and for all that hands generally work fine now, the rest of the picture is full of mistakes you can’t tell are mistakes. Entirely impossible with text.
Nano Banana Pro seems to somewhat reliably produce entire pictures without any mistakes at all.
As a complete layman, it seems obvious that it should be hard? Like, text is a type of graphic that needs to be coherent both in its detail and its large structure, and there’s a very small amount of variation that we don’t immediately notice as strange or flat out incorrect. That’s not true of most types of imagery.
I'm trying to create a team T-shirt from a bunch of kids drawings. The model has synthesize a bunch of disparate drawings into a cohesive concept, incorporate the team's name in the appropriate color and font, and make it simple enough for a T-shirt.
1) I have a tricep tendon injury and ChatGPT wants me to check my tricep reflex. I have no idea where on the elbow you're supposed to tap to trigger the reflex.
2) I'm measuring my body fat using skin fold calipers. Show me were the measurement sites are.
3) I'm going hiking. Remind me how to identify poison ivy and dangerous snakes.
I think that's a fair assessment. I write a lot of bizarre fiction in my spare time, so Text2Image tools are a fun way to see my visions visualized.
Like this one:
A piano where the keyboard is wrapped in a circular interface surrounding a drummer's stool connected to a motor that spins the seat, with a foot-operated pedal to control rotation speed for endless glissandos.
Nano Banana is more of an image editing model, which probably has more broad use cases for non-generative applications: interior decorating, architecture, picking wardrobes, etc.
Definitely, but don't sleep on its generative capacities either. You can give it a image and instruct it "Use the attached image purely as a stylistic reference" and then proceed to use it as a regular generative model.
Yeah... For some reason none of these are use cases in my day to day life. That said, I also don't open Photoshop very often. And maybe that's what this is meant to replace.
Not for everyone everyday, but a good tool to have in the toolbox. I recently was very easily able to mock up what a certain Christmas decoration would look like on the house. By next year, I'm sure that feature will be part of the product page.
Honestly I think this is exactly how we're all feeling right now. Racing towards an unknown horizon in a nitrous powered dragster surrounded by fire tornadoes.
Super important for Google as a search engine so they can filter out and downrank AI generated results. However I expect there are many models out there which don’t do this, that everyone could use instead. So in the end a “feature” like this makes me less likely to use their model because I don’t know how Google will end up treating my blog post if I decide to include an AI generated or AI edited image.
> Today, we are putting a powerful verification tool directly in consumers’ hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon.
Re-rolling a few times got it to mention trying SynthID, but as a false negative, assuming it actually did the check and isn't just bullshitting.
> No Digital Watermark Detected: I was unable to detect any digital watermarks (such as Google's SynthID) that would definitively label it as being generated by a specific AI tool.
This would be a lot simpler if they just exposed the detector directly, but apparently the future is coaxing an LLM into doing a tool call and then second guessing whether it actually ran the tool.
By anybody's AI using SynthID watermarking, not just Google's AI using SynthID watermarking (it looks like partnership is not open to just anyone though, you have to apply).
Interesting they didn’t post any benchmark results - lmarena/artificial analysis etc. I would’ve thought they’d be testing it behind the scenes the same way they did with Gemini 3.
I wouldn't trust any of the info in those images in the first carousel if I found them in the wild. It looks like AI image slop and I assume anyone who thinks those look good enough to share did not fact check any of the info and just prompted "make an image with a recipe for X"
I would do the same. But the reason for that is because I’m terrible at drawing and digital art, so I would need some help with the graphics in an infographics anyways. I don’t really need help with writing text or typesetting the text. I feel like if I were better at creating art I would not want AI involved at all.
Currently, it’s rolling out in the Gemini app. When you use the “Create image” option, you’ll see a tooltip saying “Generating image with Nano Banana Pro.”
And in AI Studio, you need to connect a paid API key to use it:
Adobe's stock is down 50% from last year's peak. It's humbling and scary that entire industries with millions of jobs evaporate in a matter of few years.
On the contrary, it's encouraging to know that maliciously greedy companies like Adobe are getting screwed for being so malicious and greedy :thumbsup:
I had second thoughts about this comment, but if I stopped typing in the middle of it, I would've had to pay a cancellation fee.
Did... someone make a bot to try to post a summary to HN with an LLM that also completely fails at being accurate (which is incredibly fitting given what the topic here is)
But ... it comes from Google. My goal is to eventually degoogle completely. I am not going to add any more dependency - I am way too annoyed at having to use the search engine (getting constantly worse though), google chrome (long story ...) and youtube.
Nano Banana Pro sounds like classic Google branding: quirky name, serious tech underneath. I’m curious whether the “Pro” here is about actual professional‑grade features or just marketing polish. Either way, it’s another reminder that naming can shape expectations as much as specs.
Nano Banana Pro should work with my gemimg package (https://github.com/minimaxir/gemimg) without pushing a new version by passing:
I'll add the new output resolutions and other features ASAP. However, looking at the pricing (https://ai.google.dev/gemini-api/docs/pricing#standard_1), I'm definitely not changing the default model to Pro as $0.13 per 1k/2k output will make it a tougher sell.EDIT: Something interesting in the docs: https://ai.google.dev/gemini-api/docs/image-generation#think...
> The model generates up to two interim images to test composition and logic. The last image within Thinking is also the final rendered image.
Maybe that's partially why the cost is higher: it's hard to tell if intermediate images are billed in addition to the output. However, this could cause an issue with the base gemimg and have it return an intermediate image instead of the final image depending on how the output is constructed, so will need to double-check.
I've been using a bespoke Generative Model -> VLM Validator -> LLM Prompt Modifier REPL as part of my benchmarks for a while now so I'd be curious to see how this stacks up. From some preliminary testing (9 pointed star, 5 leaf clover, etc) - NB Pro seems slightly better than NB though it still seems to get them wrong. It's hard to tell what's happening under the covers.
https://minimaxir.com/2025/11/nano-banana-prompts/#hello-nan...
My recreations of those pancake batter skulls using Nano Banana Pro: https://simonwillison.net/2025/Nov/20/nano-banana-pro/#tryin...
How do you know Simon? It's certainly a blog post, with content about prompting in it. If your goal is to make generative art that uses specific IP, I wouldn't use it.
GDM folks, get Max on!
In my benchmarks, Nano-Banana scores a 7 out of 12. Seedream4 managed to outpace it, but Seedream can also introduce slight tone mapping variations. NB is the gold standard for highly localized edits.
Comparisons of Seedream4, NanoBanana, gpt-image-1, etc.
https://genai-showdown.specr.net/image-editing
I tried this prompt:
Here's the result: https://simonwillison.net/2025/Nov/20/nano-banana-pro/#creat...> “Data Ingestion (Read-Only)” is a bit off.
But boy was it beautiful.
There is not a single mention about accuracy, risks or anything else in the blogpost, just how awesome the thing is. It's clearly not meant to be reliable just yet, but not making this clear up front. Isn't this almost intentionally misleading people, something that should be illegal?
Like it would be nice if all photo and video generated by the big players would have some kind of standardized identifier on them - but now you're left with the bajillion other "grey market" models that won't give a damn about that.
https://www.nbcnews.com/tech/tech-news/ai-generated-evidence...
> “My wife and I have been together for over 30 years, and she has my voice everywhere,” Schlegel said. “She could easily clone my voice on free or inexpensive software to create a threatening message that sounds like it’s from me and walk into any courthouse around the country with that recording.”
> “The judge will sign that restraining order. They will sign every single time,” said Schlegel, referring to the hypothetical recording. “So you lose your cat, dog, guns, house, you lose everything.”
At the moment, the only alternative is courts simply never accept photo/video/audio as evidence. I know if I were a juror I wouldn't.
I think that by now it should be crystal clear to everyone that it matters a lot the sheer scale a new technology permits for $nefarious_intent.
Knives (under a certain size) are not regulated. Guns are regulated in most countries. Atomic bombs are definitely regulated. They can all kill people if used badly, though.
When a photo was faked/composed with old tech, it was relatively easy to spot. With photoshop, it became more complicated to spot it but at the same time it wasn't easy to mass-produce altered images. Large models are changing the rules here as well.
The story of human history is newer generations freaking about progress and novel changes that have never been seen before. And later generations being perfectly okay with it and adapting to a new style of life.
We could use the opportunity to deploy robust systems of verification and validation to all digital works. One that allows for proving authenticity while respecting privacy if desired. For example… it’s insane in the US we revolve around a paper social security number that we know damn well isn’t unique. Or that it’s a massive pain in the ass for most people to even check the hash of a download.
Guess which we’ll do!
I don’t think this is a good comparison: knives are easy to produce, guns a bit harder, atomic bombs definitely harder. You should find something that is as easy to produce as a knife, but regulated.
Or, if you see the altered photo as the "product", then the "product" of the knife/gun/bomb is the damage it creates to a human body.
https://en.wikipedia.org/wiki/Communications_Decency_Act
> Were politicians 20 years ago as overreative they'd have demanded Photoshop leave a trace on anything it edited.
I bet it will be called "Real Photos" or something like that, and the pictures will be signed by the camera hardware. Then iMessage will put a special border around it or something, so that when people share the photos with other Apple users they can prove that it was a real photo taken with their phone's camera.
How "real" are iPhone photos? They're also computationally generated, not just the light that came through the lens.
Even without any other post-processing, iPhones generate gibberish text when attempting to sharpen blurry images, they delete actual textures and replace them with smooth, smeared surfaces that look like a watercolor or oil paintings, and combine data from multiple frames to give dogs five legs.
There used to be a joke about people who did slideshows (on an actual slide projector) of their vacation photos at parties.
I don't see how it would defeat the cat and mouse game.
For example, it's trivial to post an advertisement without disclosure. Yet it's illegal, so large players mostly comply and harm is less likely on the whole.
It still won't prevent it, but it would prevent large players from doing it.
Plus, any service good at reverse-image search (like Google) can basically apply that to determine whether they generated it.
There will always be a way to defeat anything, but I don't see why this won't work for like 90% of cases.
No, but model training technology is out in the open, so it will continue to be possible to train models and build model toolchains that just don't incorporate watermarking at all, which is what any motivated actor seeking to mislead will do; the only thing watermarking will do is train people to accept its absence as a sign of reliability, increasing the effectiveness of fakes by motivated bad actors.
It may be easier if you have an oracle on your end to say "yes, this image has/does not have the watermark," which could be the case for some proposed implementations of an AI watermark. (Often the use-case for digital watermarks assumes that the watermarker keeps the evaluation tool secret - this lets them find, e.g, people who leak early screenings of movies.)
Always has been so far. You add noise until the signal gets swamped. In order to remain imperceptible it's a tiny signal, so it's easy to swamp.
You're right that there will existed generated content without these watermarks, but you can bet that all the commercial providers burning $$$$ on state of the art models will gradually coalesce around some means of widespread by-default/non-optional watermarking for content they let the public generate so that they can all avoid drowning in their own filth.
Unless the watermark randomly replaces objects in the scene with bananas, these images/videos will still spread like wildfire on platforms like TikTok, where the average netizen's idea of due diligence is checking for a six‑fingered hand... at best.
So, you exploit real people, but run your images through a realtime AI video transformation model doing either a close-to-noop transformation or something like changing the background so that it can't be used to identify the actual location if people do figure out you are exploiting real people, and then you have your real exploitation watermarked as AI fakery.
I don't think this is solving a problem, unless you mean a problem for the would-be exploiter.
All of this is trivially easy to circumvent ceremony.
Google is doing this to deflect litigation and to preserve their brand in the face of negative press.
They'll do this (1) as long as they're the market leader, (2) as long as there aren't dozens of other similar products - especially ones available as open source, (3) as long as the public is still freaked out / new to the idea anyone can make images and video of whatever, and (4) as long as the signing compute doesn't eat into the bottom line once everyone in the world has uniform access to the tech.
The idea here is that {law enforcement, lawyers, journalists} find a deep fake {illegal, porn, libelous, controversial} image and goes to Google to ask who made it. That only works for so long, if at all. Once everyone can do this and the lookup hit rates (or even inquiries) are < 0.01%, it'll go away.
It's really so you can tell journalists "we did our very best" so that they shut up and stop writing bad articles about "Google causing harm" and "Google enabling the bad guys".
We're just in the awkward phase where everyone is freaking out that you can make images of Trump wearing a bikini, Tim Cook saying he hates Apple and loves Samsung, or the South Park kids deep faking each other into silly circumstances. In ten years, this will be normal for everyone.
Writing the sentence "Dr. Phil eats a bagel" is no different than writing the prompt "Dr. Phil eats a bagel". The former has been easy to do for centuries and required the brain to do some work to visualize. Now we have tools that previsualize and get those ideas as pixels into the brain a little faster than ASCII/UTF-8 graphemes. At the end of the day, it's the same thing.
And you'll recall that various forms of written text - and indeed, speech itself - have been illegal in various times, places, and jurisdictions throughout history. You didn't insult Caesar, you didn't blaspheme the medieval church, and you don't libel in America today.
How can they distinguish from real people exploited to AI models autogenerating everything?
I mean right now this is possible, largely because a lot of the AI videos have shortcomings. But imagine in 5 years from now on ...
The people who care don't consume content which even just plausibly looks like real people exploited. They wouldn't consume the content even if you pinky promised that the exploited looking people are not real people. Even if you digitally signed that promise.
The people who don't care don't care.
Image verification has never been easy. People have been airbrushed out of and pasted into photos for over a century; AI just makes it easier and more accessible. Expecting a “click to verify” workflow is unreasonable as it has ever been; only media literacy and a bit of legwork can accomplish this task.
We will always have local models. Eventually the Chinese will release a Nano Banana equivalent as open source.
If watermarking becomes a legal mandate, it will inevitably include a prohibition on distributing (and using and maybe even possessing, but the distribution ban is the thing that will have the most impact, since it is the part that is most policable, and most people aren't going to be training their own models, except, of course, the most motivated bad actors) open models that do not include watermarking as a baked-in model feature. So, for most users, it'll be much less accessible (and, at the same time, it won't solve the problem.)
If social media platforms are required by law to categorize content as AI generated, this means they need to check with the public "AI generation" providers. And since there is no agreed upon (public) standard for imperceptible watermarks hashing that means the content (image, video, audio) in its entirety needs to be uploaded to the various providers to check if it's AI generated.
Yes, it sounds crazy, but that's the plan; imagine every image you post on Facebook/X/Reddit/Whatsapp/whatever gets uploaded to Google / Microsoft / OpenAI / UnnamedGovernmentEntity / etc. to "check if it's AI". That's what the current law in Korea and the upcoming laws in California and EU (for August 2026) require :(
Hell, it might even be possible for some arbitrary photographs to come up with an AI prompt that produces them or something similar enough to be indistinguishable to the human eye, opening up the possibility of "proving" something is fake even when it was actually real.
What you want just can't work, not even from a theoretical or practical standpoint, let alone the other concerns mentioned in this thread.
DeepMind Page: https://deepmind.google/models/gemini-image/pro/
Model Card: https://storage.googleapis.com/deepmind-media/Model-Cards/Ge...
SynthID in Gemini: https://blog.google/technology/ai/ai-image-verification-gemi...
A cluster of launches reinforces the idea that Google is growing and leading in a bunch of areas.
In other words, if it's having so many successes it feels like overload, that's an excellent narrative. It's not like it's going to prevent people from using the tools.
https://killedbygoogle.com/
I've only managed to get a few prompts to go through, if it takes longer than 30 seconds it seems to just time out. Image quality seems to vary wildly; the first image I tried looked really good but then I tried to refresh a few times and it kept getting worse.
[0] lmarena.ai/
“Generate an image of an african elephant painted in the New England flag, doing a backflip in front of the russian federal assembly.”
OpenAI made the biggest step change towards compositionality in image generation when they started directly generating image tokens for decoders from foundation llms, and it worked very well (openais images were better in this regard than nano banana 1, but struggled with some OOD images like elephants doing backflips), but banana 2 nails this stuff in a way I haven't seen anywhere else
if video follows the same trends as images in terms of prompt adherence, that will be very valuable... and interesting
Not to mention all the other stuff.
edit: apparently people have been able to remove these watermarks with a high success rate so already this feels like a DOA product
No, its not the beginning, multiple different watermarking standards, watermark checking systems, and, of course, published countermeasures of various effectiveness for most of them, have been around for a while.
Results: https://imgur.com/a/9II0Aip
The white house was the original (random photo from Google). The prompt was "What paint color would look nice? Paint the house."
Careful with that kind of thing.
Here, it mostly poisons your test, because that exact photo probably exists in the underlying training data and the trained network will be more or less optimized on working with it. It's really the same consideration you'd want to make when testing classifiers or other ML techs 10 years ago.
Most people taking to a task like this will be using an original photo -- missing entirely from any training date, poorly framed, unevenly lit, etc -- and you need to be careful to capture as much of that as possible when trying to evaluate how a model will work in that kind of use case.
The failure and stress points for AI tools are generally kind of alien and unfamiliar because the way they operate is totally different than the way a human operates, and if you're not especially attentive to their weird failure shapes and biases when you want to test them, or you'll easily get false positives (and false negatives) that lead you to misleading conclusions.
At some point, this is probably gonna result in you coming home to a painted house and a big bill, lol.
The most effective fix I have found is that when the model is acting dumb, just turn it off and come back in the few hours to a new chat and try again.
After launch, Google's public branding for the product was "Gemini" until Google just decided to lean in and fully adopt the vastly more popular "Nano Banana" label.
The public named this product, not Google. Google's internal codename went virally popular and outstaged the official name.
Branding matters for distribution. When you install yourself into the public consciousness with a name, you'd better use the name. It's free distribution. You own human wetware market share for free. You're alive in the minds of the public.
Renaming things every human has brand recognition of, eg. HBO -> Max, is stupid. It doesn't matter if the name sucks. ChatGPT as a name sucks. But everyone in the world knows it.
This will forever be Nano Banana unless they deprecate the product.
I had trouble reliably getting it to...
* produce just two lanes of traffic
* have all the cars facing the same way—sometimes even within one lane they'd be facing in opposite directions.
* contain the construction within the blocked-off area. I think similarly it wouldn't understand which side was supposed to be blocked off. It'd also put the lane closure sign in lanes that were supposed to be open.
* have the cars be in proportion to the lane and road instead of two side-by-side within a lane.
* have the arrows go in the correct direction instead of veering into the shoulder or U-turning back into oncoming traffic
* use each number once, much less on the correct car
This is consistent with my understanding of how LLMs work, but I don't understand how you can "visualize real-time information like weather or sports" accurately with these failings.
Below is one of the prompts I tried to go from scratch to an image:
> You are an illustrator for a drivers' education handbook. You are an expert on US road signage and traffic laws. We need to prepare a diagram of a "zipper merge". It should clearly show what drivers are expected to do, without distracting elements.
> First, draw two lanes representing a single direction of travel from the bottom to the top of the image (not an entire two-way road), with a dotted white line dividing them. Make sure there's enough space for the several car-lengths approaching a construction site. Include only the illustration; no title or legend.
> Add the construction in the right lane only near the top (far side). It should have the correct signage for lane closure and merging to the left as drivers approach a demolished section. The left lane should be clear. The sign should be in the closed lane or right shoulder.
> Add cars in the unclosed sections of the road. Each car should be almost as wide as its lane.
> Add numbered arrows #1–#5 indicating the next cars to pass to the left of the "lane closed" sign. They should be in the direction the cars will move: from the bottom of the illustration to the top. One car should proceed straight in the left lane, then one should merge from the right to the left (indicate this with a curved arrow), another should proceed straight in the left, another should merge, and so on.
I did have a bit better luck starting from a simple image and adding an element to it with each prompt. But on the other hand, when I did that it wouldn't do as well at keeping space for things. And sometimes it just didn't make any changes to the image at all. A lot of dead ends.
I also tried sketching myself and having it change the illustration style. But it didn't do it completely. It turned some of my boxes into cars but not necessarily all of them. It drew a "proper" lane divider over my thin dotted line but still kept the original line. etc.
https://imgur.com/a/3PDUIQP
https://imgur.com/a/ENNk68B
Much better than previous attempts. Still has an extra lane with the cars on the right cutting off the cars in the middle. Still has the numbers in the wrong order.
edit: or Imagen4 Ultra. https://imgur.com/a/xr2ElXj cars facing opposite directions within a lane, 2-way (4 lanes total), double-ended arrow, confused disaster. pretty though.
The inline verification of images following the prompt is awesome, and you can do some _amazing_ stuff with it.
It's probably not as fun anymore though (in the early access program, it doesn't have censoring!)
In the past, I've deliberately stuck a Vision-language model in a REPL with a loop running against generative models to try to have it verify/try again because of this exact issue.
EDIT: Just tested it in Gemini - it either didn't use a VLM to actually look at the finished image or the VLM itself failed.
Output:
Result:To me the AI revolution is making visual media (and music) catch up with the text-based revolution we've had since the dawn of computing.
Computers accelerated typing and text almost immediately, but we've had really crude tools for images, video, and 3D despite graphics and image processing algorithms.
AI really pushes the envelope here.
I think images/media alone could save AI from "the bubble" as these tools enable everyone to make incredible content if you put the work into it.
Everyone now has the ingredients of Pixar and a music production studio in their hands. You just need to learn the tools and put the hours in and you can make chart-topping songs and Hollywood grade VFX. The models won't get you there by themselves, but using them in conjunction with other tools and understanding as to what makes good art - that can and will do it.
Screw ChatGPT, Claude, Gemini, and the rest. This is the exciting part of AI.
AI for images, video, music - these tools can already make movies, games, and music today with just a little bit of effort by domain experts. They're 10,000x time and cost savers. The models and tools are continuing to get better on an obvious trend line.
I'd be interested to see how Wan 2.2 First/Last frame handles those images though...
Here is a reproduction of the Matrix bullet time shot with and without pose guidance to illustrate the problem: https://youtu.be/iq5JaG53dho?t=1125
edit: I was thinking about this, and am not sure I even saw Pro3 as my image option last night. Today it was clearly there.
This has been an oddly difficult benchmark for Gemini's NB models. Googles images models have always been pretty bad at the studio ghibli prompt, but I'm shocked at how poorly it performs at this task still.
At the end of the day, a tool is a tool, and the computer had the same effect on the creative industry when people started using them in place of illustrating by hand, typesetting by hand, etc. I don't want my personal bias to get in the way too much, but every nail that AI hammers into the creative industry's coffin is hard to witness.
The trouble is that learning fundamentals now is a large trough to go past, just the way grade 3-10 children learn their math fundamentals despite there being calculators. It's no longer "easy mode" in creative careers.
Looks like: "When tested on images marked with Google’s SynthID, the technique used in the example images above, Kassis says that UnMarker successfully removed 79 percent of watermarks." From https://spectrum.ieee.org/ai-watermark-remover
We spent a lot of money trying but eventully gave up. If it is easier in Pro, then probably it stands a chance.
1. Trigger Circle to Search with long holding the home button/bar
2. Select the image
3. Navigate to About this image on the Google search top bar all the way to the right - check if it says "Made by Google AI" - which means it detected the SynthID watermark.
https://genai-showdown.specr.net/image-editing
Assuming that this new model works as advertised, it's interesting to me that it took this long to get an image generation model that can reliably generate text. Why is text generation in images so hard?
- It requires an AI that actually understands English, I.e. an LLM. Older, diffusion-only models were naturally terrible at that, because they weren’t trained on it.
- It requires the AI to make no mistakes on image rendering, and that’s a high bar. Mistakes in image generation are so common we have memes about it, and for all that hands generally work fine now, the rest of the picture is full of mistakes you can’t tell are mistakes. Entirely impossible with text.
Nano Banana Pro seems to somewhat reliably produce entire pictures without any mistakes at all.
For people that use them (regularly or not), what do you use them for?
1) I have a tricep tendon injury and ChatGPT wants me to check my tricep reflex. I have no idea where on the elbow you're supposed to tap to trigger the reflex.
2) I'm measuring my body fat using skin fold calipers. Show me were the measurement sites are.
3) I'm going hiking. Remind me how to identify poison ivy and dangerous snakes.
4) What would I look like with a buzz cut?
https://mordenstar.com/portfolio/gorgonzo
https://mordenstar.com/portfolio/brawny-tortillas
https://mordenstar.com/portfolio/ms-frizzle-lava
Like this one:
A piano where the keyboard is wrapped in a circular interface surrounding a drummer's stool connected to a motor that spins the seat, with a foot-operated pedal to control rotation speed for endless glissandos.
but concept art, try-it-on for clothes or paint, stock art, etc
https://www.youtube.com/watch?v=5mZ0_jor2_k
Honestly I think this is exactly how we're all feeling right now. Racing towards an unknown horizon in a nitrous powered dragster surrounded by fire tornadoes.
https://deepmind.google/models/synthid/
But of course there’s no way to enforce it on local generation.
Not sure how that makes any sense
https://i.imgur.com/WKckRmi.png
Google doesn't claim that Gemini would call SynthID detector at this point.
Edit: well they actually do. I guess it is not rolled out yet.
> Today, we are putting a powerful verification tool directly in consumers’ hands: you can now upload an image into the Gemini app and simply ask if it was generated by Google AI, thanks to SynthID technology. We are starting with images, but will expand to audio and video soon.
Re-rolling a few times got it to mention trying SynthID, but as a false negative, assuming it actually did the check and isn't just bullshitting.
> No Digital Watermark Detected: I was unable to detect any digital watermarks (such as Google's SynthID) that would definitively label it as being generated by a specific AI tool.
This would be a lot simpler if they just exposed the detector directly, but apparently the future is coaxing an LLM into doing a tool call and then second guessing whether it actually ran the tool.
But I wouldn't mind being easily able to make infographics like these, I'd just like to supply the textual and factual content myself.
> Rolling out globally in the Gemini app
wanna be any more vague? is it out or not? where? when?
And in AI Studio, you need to connect a paid API key to use it:
https://aistudio.google.com/prompts/new_chat?model=gemini-3-...
> Nano Banana Pro is only available for paid-tier users. Link a paid API key to access higher rate limits, advanced features, and more.
The 2nd take is AI is costing companies so much money, that they need to cut workforce to pay for their AI investments.
I'm inclined to think the latter is represents what's happening more than the former.
I had second thoughts about this comment, but if I stopped typing in the middle of it, I would've had to pay a cancellation fee.
Adobe, at least, makes money by selling software. Google makes money by capturing eyeballs; only incidentally does anything they do benefit the user.
ChatGPT's imagegen has been released for half a year but there isn't anything remotely similar to it in the open weight realm.
Failed to generate content: permission denied. Please try again.
If you triggered the safeguard it'll give you the typical "sorry, I can't..." LLM response.
Not just are they making slop machines, they seem to be run by them.
I am too old for this shit.
(The Gemini 3 post has a million comments too many to ask this now)
But ... it comes from Google. My goal is to eventually degoogle completely. I am not going to add any more dependency - I am way too annoyed at having to use the search engine (getting constantly worse though), google chrome (long story ...) and youtube.
I'll eventually find solutions to these.