My first impression coming away from this is skepticism.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
Oh interesting, this is very cool. At first I thought it was just focus-follows-mouse but it's more interesting. You have certain keywords trigger "add to prompt". Ignoring the voice functionality (which is admittedly crucial currently because other inputs currently take over focus), I've often wanted to just have a continuous conversation with the LLM as I 'point and click' (or tab over and select) at various things. Might be neat to have text input focus continue to go to the LLM where I'm typing text etc.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
This seems like one of those things that is usable infrequently enough to be forgotten/poorly developed/never used. (Even before accounting for the actual failure rate of the LLM which will be none-zero).
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
Indeed. "AI-enabled pointer" is misdirection. This isn't an AI-enabled pointer; it's sending screen to AI, which yes, includes pointer position. The AI doesn't live in the pointer. The AI lives, apparently, so thoroughly in the system that it can see and do anything, and the pointer is just a way of giving it context.
Interesting! I wonder how UI will evolve in the long-term? If there are browser-use/computer-use and clicky-clones automating pointer actions, do we really need complex UI anymore? If yes, when?
They’re going to take your abilities to do anything and spread it across many places so you have to run around to do them, same as all the moneyed technology.
It tracks what's on the screen and sends it back to Alphabet. If you're watching a video about BBQ, enjoy a bunch of ads for Omaha steaks and big green egg in your Gmail.
On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.
My first impression coming away from this is skepticism.
Anything with voice controls for routine use is a pretty tough sell. Doing this when you're not completely alone would be annoying to everyone around you.
Most of their examples seem like they could have been done with a right click drop down menu so they don't really need to "re-invent the mouse pointer".
So is this thing talking to Google's servers all the time for the AI integration? So it won't work if you're not connected to the internet? Privacy concerns are obvious; now Google wants to have an AI watching literally everything you do on your computer?
Does it cost the user anything for the LLM use? If it's free will it stay free forever? That's quite a lot to give away if they're expecting people to use it to change a single word like in one of their examples. I guess they're expecting to make the money back by gathering data about literally everything you do on your computer.
There might be a killer app for AI integration with personal computers that has yet to be invented, but this doesn't look like it.
Right — it does seem cool but the voice is patching over a major gap. If I'm talking already, why wouldn't I just describe what I'm looking at and have the AI grab it for me?
The "Edit an Image" Demo at the bottom is pretty fun. Maybe this is just Google flexing their LLM inference capacity.
Please don't.
I like text selection exactly how it is. I want precise controls.
It's fine for a touch interface like a phone, but on a computer I expect precision. As much as I can get.
Oh interesting, this is very cool. At first I thought it was just focus-follows-mouse but it's more interesting. You have certain keywords trigger "add to prompt". Ignoring the voice functionality (which is admittedly crucial currently because other inputs currently take over focus), I've often wanted to just have a continuous conversation with the LLM as I 'point and click' (or tab over and select) at various things. Might be neat to have text input focus continue to go to the LLM where I'm typing text etc.
Sometimes I go to a different page to take a screenshot and other times I'm browsing for a file, and other times I'm highlighting some log lines. Cursor did this well, with selecting text in the terminal auto-focusing the Cursor agent textbox so you could talk to the agent and then select some text and you didn't have to re-select the original agent textbox again. The agent is a top-level function in that system not "just another app I have to switch to" to take my context with.
I have some small amount of bias because I've always felt input-constrained on computers. I have to move my hands to go places and that's exasperating. I've tried head tracking, had a vim pedal for a while, and used tiling WMs, and things like this to aid but while my vim-fu is pretty good and I function inside things very well with it, my cross-application interface isn't.
In the end, perhaps we all have our home offices with our Apple Vision Pros and we talk to them like this to maneouvre faster through our machines and get our ideas into them.
Cool research. I wonder what we'll end up with.
I sense a privacy problem brewing.
It reminds me of Microsoft Recall in the sense that some portion of the screen is going to be continuously transmitted outside of the users control.
What happens when someone browses something very private (planning a surprise engagement. looking at medical data. planning a protest)? All that data gets slurped to google and subject to a warrant or discovery or building your advertising fingerprint.
Maybe the idea is that the data is sent to AI only when you right click, but that seems like a very thin firewall that a product manager will breach in the interests of delivering "predictive AI" via some kind of precomputed results.
Wiggle at CAPTCHAs, wiggle at Termux, wiggle at Emacs, wiggle at the Godot Editor, wiggle at my remote desktop.
(Not going to happen)
Of course, it isn't a Google Demo, if you can't use it to book a table at restaurant. (shown at the bottom of the page)
It's beautiful how the human mind can take something very obvious but overlooked and make it into this fantastic innovation. Fab stuff.
Don't build these things, instead build protocols and expose system level APIs for application developers to build things.
Reminds me of Put That There https://m.youtube.com/watch?v=RyBEUyEtxQo
This seems like one of those things that is usable infrequently enough to be forgotten/poorly developed/never used. (Even before accounting for the actual failure rate of the LLM which will be none-zero).
Perhaps a text box and file upload isn’t the perfect interface for every use case but it is versatile which is a huge barrier to overcome.
so Google will be monitoring whatever on the screen continuously or only when the user say the magic words (this, that, here, there)?
Indeed. "AI-enabled pointer" is misdirection. This isn't an AI-enabled pointer; it's sending screen to AI, which yes, includes pointer position. The AI doesn't live in the pointer. The AI lives, apparently, so thoroughly in the system that it can see and do anything, and the pointer is just a way of giving it context.
Google Recall. Hey, it's all about the marketing.
Interesting! I wonder how UI will evolve in the long-term? If there are browser-use/computer-use and clicky-clones automating pointer actions, do we really need complex UI anymore? If yes, when?
I've been playing with writing a visionOS app that allows an AI agent to be aware of what you're looking at at any given time.
At some point I fully expect eye tracking (or attention tracking) to be common enough to be a first-class input method.
It only took Google and their AI offering to come up with Graffiti.
No thanks
I wonder what sort of monstrous power would be unleashed if Google used Plan9 as a foundation.
They'd half-finish it then bury it, like they did with Fuchsia which is heavily Plan-9-inspired.
Just seven hours ago there was a plea on HN [0] to please not do this. Seriously, what are they smoking at Google right now?
[0] https://news.ycombinator.com/item?id=48107027
There's already a product that does this lol
Aaaaand now I can't remember the name of it
being able to make precise edits would be huge for AI
Both of the text based demos would have been simpler and faster with traditional mouse and keyboard interactions. What is the AI adding?
They’re going to take your abilities to do anything and spread it across many places so you have to run around to do them, same as all the moneyed technology.
It tracks what's on the screen and sends it back to Alphabet. If you're watching a video about BBQ, enjoy a bunch of ads for Omaha steaks and big green egg in your Gmail.
On a less serious note, the audience for this is people who want to optimize for what seems like the least amount of effort.
It feels like everything modern is like this. No value added, just the appearance of it.
Maybe I'm misunderstanding, but what is new about the pointer itself? Seems to be functionally the same as selecting + tooltips / context menus.
Shush, how is anyone going to get promoted with that kind of talk!?
> but what is new about the pointer itself?
I'm hoping for a const-reference joke.
Like a dream come true...
Nightmares are dreams as well and this is a nightmare like Windows Recall.
Technically wonderful though.
do not want
> We’ve been exploring new AI-powered capabilities to help the pointer not only understand what it’s pointing at, but also why it matters to the user.
We couldn't quite track you well enough before. So we're fixing that under the guise of "AI powered capabilities."
what the hell is going on at google
Thanks, I hate it