Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.
> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.
Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.
It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.
As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video
The warehouse safety video example is really funny, because the people don't react at all.
SOTA open source model for image and vid generation. Beats all others but is too big to run on most people’s computers at 64b params.
Still impressive nonetheless given its artificially generated training sets.
Beats nano banana 1 but not yet competitive with 2 or seedance2, grok imagine,etc.
Great summary. I find image and video generation models are a more understandable reality check for how close local models are to frontier models.
Most of the examples they've chosen seem.. not good? What an odd mix of bad game engine and AI slop. I can't imagine that this stuff makes good training data for real-world applications.
> Cosmos 3 Nano is the compact version with 16B parameters and optimized for efficient inference. It’s designed to run on workstation-grade compute, like the NVIDIA RTX PRO 6000 GPU for real-time robotics inference and physical AI applications.
Looking forward to trying this out on my $10000+ workstation grade GPU that I need an equally expensive set up to run.
Good news, Nvidia will happily sell you one of their new RTX Spark laptops to run this.
I'm struggling to understand what this does.
> Generates future observations and action sequences.
Is that just a complicated way of saying video gen?
It can be used to generate synthetic data to train physical AI for robots, cars, drones, etc. The world can be simulated from first person perspective to generate training data without sending robots to peoples homes.
As I understand it, they mean both computer vision and video gen, linked by a pretty robust world model. One of their hosted examples is purely analysing an existing video, the other is predicting (i.e. video gen) from a static image to a video
Look at the table of supported modalities. It can take in input of image/video/text/actions and output image/video/text/actions.
That just raises more questions. What kind "observation or action" image does input generate? What is an action output if it's not text?
You can fine-tune it so, given an image and a task description, it generates a corresponding set of actions.