Astro - Hacker News

57 comments

oefrha 11 minutes ago

> If you were a Mercor contractor and you believe your voice may already be in circulation, ORAVYS will analyze the first three suspect samples free of charge.
Awesome, if you're a victim of an AI company having your voice, you can help yourself by sending another AI company your voice!
> Audio is never used to train commercial models without explicit consent
I'm sure Mercor has explicit consent as well, legal teams are reasonably good at legally covering their asses with license terms.
ethagnawl 4 minutes ago

So, they should all just rotate their voices ... right?
I jest but the majority of the "normal" people I know are happy to hand over biometrics because _it's easier_. We need to start branding biometrics as "forever passwords" or something to help people understand just what they're handing over when they validate access to their checking account or enter Disney World or whatever else.
eqvinox 2 hours ago

The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.
Germans (because of course) have a word for this: "Datensparsamkeit". Being frugal with your data.
[-]
- tgv 15 minutes ago
  
  > Germans (because of course)
  I don't know if it's the reason you imply. In the 70s, there were big debates in Germany about privacy and data storage. They spoke of one's data shadow (Datenschatten). I suspect this word comes from that tradition. The reason the word exists would then be the reflection (Verwaltigung) on WW2.
  [-]
  - xenocratus 4 minutes ago
    
    I took the "because of course" to be about having a word for everything - a stereotypical idea about the German language.
    
    [-]
    
    dragontamer a few seconds ago
    
    [delayed]
  - theptip 6 minutes ago
    
    The Stasi would be the obvious cultural context.
    In the US of course the government buys this sort of information legally from corporations.
  - mrsvanwinkle 9 minutes ago
    
    Love it, also love how Datenschatten can also imply that it disappears when someone shines light on it
    
    [-]
    
    reactordev 5 minutes ago
    
    If only our past 20 year old self data could be so ephemeral…
    Who doesn’t want that old post going extinct forever when they were shit faced outside of a bar in Nashville but now they are in their mid-life and are “respectable” members of society.
- hiccuphippo 17 minutes ago
  
  Data that is publicly available also can't be stolen or leaked. Nobody can steal Mozilla's common voice dataset.
- wlesieutre an hour ago
  
  I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”
  [-]
  - CincinnatiMan an hour ago
    
    Were you not around for the Big Data heyday a decade ago?
    
    [-]
    
    ToucanLoucan 15 minutes ago
    
    Hell you mean a decade ago? I still see businesses running losses left right and center saying that they're gonna monetize user data, any day now.
    Related "monetizing user data" seems to just mean ads. Ads on everything, forever, until the userbase gets fed up and moves to a new service that definitely won't do that, and the cycle repeats about every 3 years.
    
    varispeed an hour ago
    
    Until thumb drives became large enough to fit most datasets it stopped becoming Big Data. Just normal data.
    
    [-]
    
    ffsm8 15 minutes ago
    
    We have thumb drives that can store petabytes of data?
    Or did you mean the "big data" crowd which thought 500GB was noteworthy? I don't think anyone took those serious, neither in 2010s nor now. That was always "small" data
    
    [-]
    
    butlike 2 minutes ago
    
    > We have thumb drives that can store petabytes of data
    We do?
    
    varispeed 11 minutes ago
    
    Most companies using term "big data" had datasets in TB region. One company I had a gig at had full Hadoop cluster setup and their whole dataset was 40GB. Their marketing had all the big data adjacent keywords over the brochures for clients.
  - citrin_ru an hour ago
    
    Data hoarding predates LLMs. There where other machine learning methods which also needed data for training.
    
    [-]
    
    Forgeties79 an hour ago
    
    “Before LLM’s there was_____”
    I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) to execute at scale.
    Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
    
    [-]
    
    dpoloncsak 32 minutes ago
    
    Do LLMs require that much more data than the tradional ML approaches we've seen over the years?
    
    [-]
    
    sigmoid10 22 minutes ago
    
    Yes. This is pretty well established. Neural networks in general are considerably less sample-efficient than traditional ML methods. The reason they became so successful is that they scale better as you increase training data and model size. But only with modern compute power they became useful outside of academic toy model applications.
    
    b00ty4breakfast 24 minutes ago
    
    I really hate this when it's something negative that humans also do. It's like, yeah, people do do that, but why are we automating {negativeTrait}?
    
    ToucanLoucan 12 minutes ago
    
    > Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.
    I have the faintest possible hope that such things are going to be the death knell of social media. Yeah a lot of credulous idiots are happily giving AI thirst traps their money for stroking their confirmation bias, but that's just who's left at this point. It feels like every social media app I use is gradually bleeding users who aren't hopelessly addicted to the dopamine treadmill, because what's left is just plain unappealing to them, which selects for the people who are most vulnerable to AI shit, which is far from ideal, but also means those platforms are comprised ever more of that vulnerable population and nobody else. And the problem with all these businesses going through that is without a diverse, growing audience, you just become InfoWars, slinging the same slop to the same people every day, and every ounce of said slop is great for what's left of your audience, but absolute garbage for getting anyone new in it. And it just goes on that way until you sputter out and die (or harass the wrong group of parents I guess).
    I wish all social media sites a very haha die in a fire.
- littlecranky67 15 minutes ago
  
  Data can never be stolen, because it is not a physical thing. Data can be copied, and it can be erased - sometimes both happens at the same time. Data can be lost, that is when its last existing copy was erased.
  [-]
  - Peritract 8 minutes ago
    
    The use of "steal" for non-physical things pre-dates the use of "data" in the modern sense [1]. Policing language incorrectly is not reasonable.
    [0] https://www.opensourceshakespeare.org/views/plays/play_view....
    [1] https://www.etymonline.com/word/data
  - altruios 9 minutes ago
    
    pedantic and true. What was stolen was not data, but future revenue based on exclusive access to that data.
Oravys 5 hours ago
Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.
```
  Happy to discuss the forensic detection side. AudioSeal
  watermarks, AASIST anti-spoofing, and how the detection landscape changes
  once voice biometrics start leaking at scale.
```
[-]
- davsti4 32 minutes ago
  
  Interesting - thanks for the rabbit hole today. ;)
  Mercer hasn't released many public statements over the incident. Social media posts aren't necessarily public; but I did find this breach notification sample filed with CA - https://oag.ca.gov/ecrime/databreach/reports/sb24-621099 . I guess we'll see if our legislators finally take data privacy seriously.
barrenko 21 minutes ago

It more looks like the purpose of such company was to steal such data.
[-]
- 52-6F-62 14 minutes ago
  
  Look at their privacy policies. It absolutely is. They are harvesting video, voice, and much more.
VladVladikoff an hour ago

Man that’s pretty shitty that Mercor tricked 40k contractors, and then did a poor job of securing their data. There should be stronger consequences for stuff like this.
[-]
- throwa356262 17 minutes ago
  
  What happens now is that a lot of clueless CTO that didn't know about this company now know it's name. So the outcome of thos mess is probably more business for Mercor
  I mean, just look at what happened to Crowdstrike....
3 minutes ago

[deleted]
embedding-shape an hour ago

I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.
[-]
- hirako2000 an hour ago
  
  It's already there. And keeps moving.
  Even have a nice UI on top.
  https://voicebox.sh/
- jubilanti 19 minutes ago
  
  Not really, Mozilla Common Voice (the ImageNet of speech) is larger than this. Their English database has 3814 hours, 1.6 million sentences, from 100k speakers.
  https://commonvoice.mozilla.org/en/languages
amarcheschi an hour ago

I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with
john_strinlai an hour ago

>Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.
good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.
>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.
would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.
[-]
- wongarsu 34 minutes ago
  
  Someone who has hundreds or thousands of clients presumably couldn't remember every client's voice either, so no meaningful security is lost. They are approximately as secure or insecure as before
  [-]
  - john_strinlai 33 minutes ago
    
    >presumably couldn't remember every client's voice either, so no meaningful security is lost
    there are automated systems for this already. my bank, isp, etc. use them when you call in to skip the traditional verification steps. this fact is also highlighted in the article.
    the problem is that there isnt typically a system in place for setting up or validating code words, so the advice given is not practical to implement.
- iterateoften 37 minutes ago
  
  Yeah seems like nonsense advise. Have a code word that was never recorded? I don’t see how that would tote y anything. Like the point of these systems is they can say stuff you never said convincingly
  [-]
  - MarsIronPI 5 minutes ago
    
    The idea is that the attacker doesn't know the codeword. If the attacker finds out about the codeword then the attacker could indeed fake it. Hence why you shouldn't say/write it in recordings or chat messages.
jacquesm 2 hours ago

You could have seen this coming a mile away. So far I have gotten away with never uploading my ID and/or interacting with one of those companies (though one idiot working for some VC thought it was ok to sign a document on my behalf by uploading my signature!!, never mind a bit of fraud) but it is getting harder and harder. Banks and in some cases even governments forcing you to send data to these operators is a very bad idea. But hey, who ever got hurt by some security theater?
I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.
These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.
[-]
- Schlagbohrer 24 minutes ago
  
  Tell us more about that fraud story! Was the person your attorney or accountant? Or just some "smart" person who decided to wisely save time by doing fraud?
- hiccuphippo 9 minutes ago
  
  Why is the ID a hidden secret that can be used for anything regarding security in the first place?
2 hours ago

[deleted]
[-]
- z0ltan an hour ago
  
  [dead]
Havoc an hour ago

I love how the check if your affected involves giving a voice sample to whatever the fuck that website is
[-]
- 2ndorderthought 5 minutes ago
  
  It's like those have been owned websites. Where you type in your name email and they grab your IP location and anything else to sell it off.
josefritzishere 2 hours ago

This kind of event is the best argument against needless data hoarding. But it would help if the law better provided for some kind of consequences for negligence.
throw0101c an hour ago

"My voice is my passport. Verify Me."
:)
[-]
- 40 minutes ago
  
  [deleted]
- java-man 34 minutes ago
  
  HSBC did that. I could never understand that - the exact phrase was in the movie!
  [-]
  - NitpickLawyer 26 minutes ago
    
    Someone probably did it for an internal demo, as a joke. Then people pushed it upwards, until someone clueless approved it.
globalnode 11 minutes ago

not to be conspiratorial but stolen? or given away...
KnuthIsGod an hour ago

[dead]
algoth1 an hour ago

[dead]